Chapter 2 2 2

download Chapter 2 2 2

of 13

Transcript of Chapter 2 2 2

  • 8/13/2019 Chapter 2 2 2

    1/13

    2.5 Implementation of a tiny scanner

    Develop the actual code for a scanner to illustrate the concepts studied so far inthis chapter.Do this for the TINY language that we introduced informally in Chapter 1.

    2.5.1 Implementing a scanner for the sample language TINY

    Defining the tokens and their attributes.The tokens and token classes of TINY are summaried as follows!

    "eserved #ords $pecial $ymbols %ther

    if + numberthen - (1 or more digitselse !end "repeat #until $ identifierread & (1 or more letters%rite '

    & '#

    In addition to the tokens( TINY has the following le)ical conventions.&1' Comments are enclosed in curly brackets *+, and cannot be nested-&' The code is free format- white space consists of blanks( tabs( and newlines-&/' The principle of longest substring is followed in recogniing tokens.

    1

  • 8/13/2019 Chapter 2 2 2

    2/13

    The D0 for the special symbols e)cept assignment !

    Combine this D0 with D0s that accept numbers and identifiers!

    +

    )eturn *,

    )eturn IN,

    )eturn /I

    digit

  • 8/13/2019 Chapter 2 2 2

    3/13

    Then add comments (white space( and assignment to this D0.

    It is easiest from the point of view of the D0 to consider reserved words to be the sameas identifiers( and then to look up the identifiers in a table of reserved words afteracceptance.

    /

  • 8/13/2019 Chapter 2 2 2

    4/13

    The code to implement this D0 is contained in the scan.h and scan.c files &seeppendi) 2'.

    1The principal procedure is getTo0en&lines 345647/'( which consumes input

    characters and returns the ne)t token recognied according to the D0.

    The implementation uses the doubly nested case analysiswe have described in$ection ././( with a large case list based on the state( within which are individual caselists based on the current input character.

    /Theto0ens themseles are defined as an enumerated typein globals.h &lines

    1456183'( which include all the tokens listed above together with the bookkeepingtokens 9ND0I:9&when the end of the file is reached' and 9""%" &when an erroneouscharacter is encountered'. The states of the scanner are also defined as an enumeratedtype( but within the scanner itself &lines 316315'.

    5In the case of the TINY scanner( the only attribute that is computed is the le)eme(

    or string value of the token recognied( and this is placed in the variable to0entring.

    This variable( together with getTo0enare the only services offered to other parts ofthe compiler( and their definitions are collected in the header file scan.h &lines ;;

  • 8/13/2019 Chapter 2 2 2

    5/13

  • 8/13/2019 Chapter 2 2 2

    6/13

    2.5.2 )esered 3ords Eersus Identifiers

    %ur TINY scanner recognies reserved words by first considering them asidentifiers and then looking them up in a table of reserved words. This is a common

    practice in scanners( but it means that the efficiency of the scanner depends on theefficiency of the lookup process in the reserved word table.

    1 linear search>in which the table is searched se=uentially from beginning to

    end. This is not a problem for very small tables such as that for TINY( withonly eight reserved words.

    binary search( which we could have applied had we written the list of

    reserved words in alphabetic order./ hash table > in this case we would like to use a hash function that has a very

    small number of collisions. $uch a hash function can be developed in advance(since the reserved words are not going to change &at least not rapidly'( andtheir places in the table will be fi)ed for every run of the compiler.

    minimal perfect hash functions' functions that distinguish each reserved word fromthe others( and that have the minimum number of values( so that a hash table no largerthan the number of reserved words can be used.

    nother option in dealing with reserved words is to use the same table that storesidentifiers( that is( the symbol table. 2efore processing is begun( all reserved wordsare entered into this table and are marked reserved &so that no redefinition isallowed'.

    2.5.>

  • 8/13/2019 Chapter 2 2 2

    7/13

    2.@ ,se of e4 to generate a scanner automatically

    Now we will use the :e) scanner generator to generate a scanner from a

    description of the tokens of TINY as regular e)pressions. $ince there are a number ofdifferent versions of :e) in e)istence( we confine our discussion to those features thatare common to all or most of the versions.

    The most popular version of :e) is called fle) *for 0ast :e)'. It is distributed aspart of the Fnu compiler pac0ageproduced by the 0ree $oftware 0oundation( and isalso freely available at many Internet sites.

    :e) is a program.

    The :e) output file( usually called le).yy.c or le)yy.c( is then compiled and linked toa main program to get a running program.

    [email protected] e4 conentions for regular e4pression

    1 :e) allows the matching of single characters( or strings of characters( simplyby writing the characters in se=uence. :e) also allows metacharacters to bematched as actual characters by surrounding the characters in =uotes. @uotescan also be written around characters that are not metacharacters( where theyhave no effect.

    0or e)ample ! if and GifH are same meaning. match a left parenthesis( we must write (an alternative is to use the

    backslash metacharacter A( but this works only for singlemetacharacters! to match the character se=uence &B we would haveto write A&AB or &B .

    a special meaning! An matches a newline and At matches a tab.

    metacharacters ! B( ( &( ' ( E( F

    for e)ample ! (aabb(ab!cJ (aaKbb(aKb!cJ

    /The e4 convention for character classes &sets of characters' is to write them

    between s=uare brackets.for e)ample !GabcdH (aaKbb LabM!cJ

    5 "anges of characters can also be written using a hyphen. Thus( the e)pression

    G

  • 8/13/2019 Chapter 2 2 2

    8/13

    3 Complementary sets>that is( sets that do not contain certain characters>can

    also be written in this notation( using the carat as the first character insidethe brackets. Thus( G

  • 8/13/2019 Chapter 2 2 2

    9/13

    7 au4iliary routines8

    1 The first section !definitions

    The definition section occurs before the first MM.1' any C code that must be inserted e)ternal to any function should appear in this

    section between the delimiters M*and M,( &Note the order of thesecharacters'

    ' names for regular e)pressions must also be defined in this section. name isdefined by writing it on a separate line starting in the first column andfollowing it &after one or more blanks' by the regular e)pression it represents.

    The second section ! rules

    These consist of a se=uence of regular e)pressions followed by the C code that is tobe e)ecuted when the corresponding regular e)pression is matched.

    / The third section! au)iliary routines

    "outines are called in the second section and not defined elsewhere. This sectionmay also contain a main program( if we want to compile the :e) output as a

    standalone program. This section can also be missing. &the second MM need notbe written. The first MM is always necessary.'

    9)ample .

    M*OB a :e) program that adds line numbers

    to lines of te)t( printing thenew te)t to the standard output

    BO

    Pinclude Qstdio.hRint lineno S l-M,line .BAnMM*line, * printf &M;d Ms(lineno(yyte)t' - ,MMmain& '* yyle)& '- return

  • 8/13/2019 Chapter 2 2 2

    10/13

    "unning the program obtained from :e) on this input file itself gives the followingoutput! 1 M*

    OB a :e) program that adds line numbers/ to lines of te)t( printing the5 new te)t to the standard output; BO3 Pinclude Qstdio.hR4 int lineno S l-8 M,7 line .BAn1< MM11 *line, * printf &M;d Ms(lineno( yyte)t' - ,1 MM1/ main& '15 * yyle)& '- return

  • 8/13/2019 Chapter 2 2 2

    11/13

    9)ample . the following :e) input file!

    M*

    OB $elects only lines that end or

    begin with the letter ?a?.

    Deletes everything else.

    BO

    Pinclude Qstdio.hR

    M,

    endswitha .BaAn

    beginswitha a.BAn

    MM

    *endswitha, 9CU%-

    *beginswitha, 9CU%-

    .BAn -

    MM

    main& '

    * yyle)& '- return

  • 8/13/2019 Chapter 2 2 2

    12/13

    9)ample ./ In this e)ample( :e) generates a program that will convert all uppercase letters

    to lowercase( e)cept for letters inside C6style comments &that is( anything inside the delimiters

    OB...BO'!

    M*OB :e) program to convert uppercase to

    lowercase e)cept inside comments

    BO

    Pinclude Qstdio.hR

    Pifndef 0:$9

    PdeVine 0:$9