Compiler Notes - Ullman

download Compiler Notes - Ullman

of 182

Transcript of Compiler Notes - Ullman

  • 8/10/2019 Compiler Notes - Ullman

    1/182

    Compiler Construction Lecture Notes

    Introduction

    o Lecture 1 (printable)

    Lexical Analysiso Lecture 2 (printable)o Lecture 3 (printable)o Lecture 4 (printable)o Lecture 5 (printable)o Lecture 6 (printable)o Lecture 7 (printable)

    Syntax Analysiso Lecture 8 (printable)o Lecture 9 (printable)

    o Lecture 10 (printable)o Lecture 11 (printable)o Lecture 12 (printable)o Lecture 13 (printable)

    Semantic Analysiso Lecture 14 (printable)o Lecture 15 (printable)o Lecture 16 (printable)o Lecture 17 (printable)

    Intermediate CodeGeneration

    o Lecture 18 (printable)o Lecture 19 (printable)o

    Lecture 20 (printable)o Lecture 21 (printable)o Lecture 22 (printable)

    Final Code Generationo Lecture 23 (printable)o Lecture 24 (printable)o Lecture 25 (printable)o

    Lecture 26 (printable)

    lecture #1 began here

    Why study compilers?

    Most CS students do not go on to write a commercial compilersomeday, but that's not why we study compilers. We studycompiler construction for the following reasons:

    Writing a compiler gives experience with large-scaleapplications development. Your compiler program may bethe largest program you write as a student. Experienceworking with really big data structures and complex

    http://www.cs.nmsu.edu/~jeffery/courses/370/introhttp://e/Dr%20J_%20compiler%20lecture.html%231http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=1http://e/Dr%20J_%20compiler%20lecture.html%23lexicalhttp://e/Dr%20J_%20compiler%20lecture.html%232http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=2http://e/Dr%20J_%20compiler%20lecture.html%233http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=3http://e/Dr%20J_%20compiler%20lecture.html%234http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=4http://e/Dr%20J_%20compiler%20lecture.html%235http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=5http://e/Dr%20J_%20compiler%20lecture.html%236http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=6http://e/Dr%20J_%20compiler%20lecture.html%237http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=7http://e/Dr%20J_%20compiler%20lecture.html%23syntaxhttp://e/Dr%20J_%20compiler%20lecture.html%238http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=8http://e/Dr%20J_%20compiler%20lecture.html%239http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=9http://e/Dr%20J_%20compiler%20lecture.html%2310http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=10http://e/Dr%20J_%20compiler%20lecture.html%2311http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=11http://e/Dr%20J_%20compiler%20lecture.html%2312http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=12http://e/Dr%20J_%20compiler%20lecture.html%2313http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=13http://e/Dr%20J_%20compiler%20lecture.html%23semantichttp://e/Dr%20J_%20compiler%20lecture.html%2314http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=14http://e/Dr%20J_%20compiler%20lecture.html%2315http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=15http://e/Dr%20J_%20compiler%20lecture.html%2316http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=16http://e/Dr%20J_%20compiler%20lecture.html%2317http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=17http://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%2318http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=18http://e/Dr%20J_%20compiler%20lecture.html%2319http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=19http://e/Dr%20J_%20compiler%20lecture.html%2320http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=20http://e/Dr%20J_%20compiler%20lecture.html%2321http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=21http://e/Dr%20J_%20compiler%20lecture.html%2322http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=22http://www.cs.nmsu.edu/~jeffery/courses/370/finalcodehttp://e/Dr%20J_%20compiler%20lecture.html%2323http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=23http://e/Dr%20J_%20compiler%20lecture.html%2324http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=24http://e/Dr%20J_%20compiler%20lecture.html%2325http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=25http://e/Dr%20J_%20compiler%20lecture.html%2326http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=26http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=26http://e/Dr%20J_%20compiler%20lecture.html%2326http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=25http://e/Dr%20J_%20compiler%20lecture.html%2325http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=24http://e/Dr%20J_%20compiler%20lecture.html%2324http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=23http://e/Dr%20J_%20compiler%20lecture.html%2323http://www.cs.nmsu.edu/~jeffery/courses/370/finalcodehttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=22http://e/Dr%20J_%20compiler%20lecture.html%2322http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=21http://e/Dr%20J_%20compiler%20lecture.html%2321http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=20http://e/Dr%20J_%20compiler%20lecture.html%2320http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=19http://e/Dr%20J_%20compiler%20lecture.html%2319http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=18http://e/Dr%20J_%20compiler%20lecture.html%2318http://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=17http://e/Dr%20J_%20compiler%20lecture.html%2317http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=16http://e/Dr%20J_%20compiler%20lecture.html%2316http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=15http://e/Dr%20J_%20compiler%20lecture.html%2315http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=14http://e/Dr%20J_%20compiler%20lecture.html%2314http://e/Dr%20J_%20compiler%20lecture.html%23semantichttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=13http://e/Dr%20J_%20compiler%20lecture.html%2313http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=12http://e/Dr%20J_%20compiler%20lecture.html%2312http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=11http://e/Dr%20J_%20compiler%20lecture.html%2311http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=10http://e/Dr%20J_%20compiler%20lecture.html%2310http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=9http://e/Dr%20J_%20compiler%20lecture.html%239http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=8http://e/Dr%20J_%20compiler%20lecture.html%238http://e/Dr%20J_%20compiler%20lecture.html%23syntaxhttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=7http://e/Dr%20J_%20compiler%20lecture.html%237http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=6http://e/Dr%20J_%20compiler%20lecture.html%236http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=5http://e/Dr%20J_%20compiler%20lecture.html%235http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=4http://e/Dr%20J_%20compiler%20lecture.html%234http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=3http://e/Dr%20J_%20compiler%20lecture.html%233http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=2http://e/Dr%20J_%20compiler%20lecture.html%232http://e/Dr%20J_%20compiler%20lecture.html%23lexicalhttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=1http://e/Dr%20J_%20compiler%20lecture.html%231http://www.cs.nmsu.edu/~jeffery/courses/370/intro
  • 8/10/2019 Compiler Notes - Ullman

    2/182

    interactions between algorithms will help you out on yournext big programming project.

    Compiler writing is one of the shining triumphs of CS

    theory. It demonstrates the value of theory over the impulseto just "hack up" a solution. Compiler writing is a basic element of programming

    language research. Many language researchers writecompilers for the languages they design.

    Many applications have similar properties to one or morephases of a compiler, and compiler expertise and tools canhelp an application programmer working on other projects

    besides compilers.CS 370 is labor intensive. Famous computer scientist Dan Berryof the University of Waterloo has argued convincingly that thereis no software development method for writing large programsthat doesn't involve pain: pain is inevitable in softwaredevelopment (Berry's Theorem). From my own experience as astudent, I posulate Jeffery's Corollary: there is no way to learn

    the skills necessary for writing big programs without pain. Agood CS course includes pain, and teaches pain managementand minimization.

    The questions we should ask, then, are: (a) should CS majors berequired to spend a lot of time becoming really goodprogrammers? and (b) are we providing students with theassistance and access to the tools and information they need to

    accomplish their goals with the minimal doses of inevitable painthat are required?

    Some Tools we will use

  • 8/10/2019 Compiler Notes - Ullman

    3/182

    Labs and lectures will discuss all of these, but if you do notknow them already, the sooner you go learn them, the better.C and "make".

    If you are not expert with these yet, you will be a lot closerby the time you pass this class.lex and yacc

    These are compiler-writers tools, but they are useful forother kinds of applications, almost anything with a complexfile format to read in can benefit from them.

    gdbIf you do not know a source-level debugger well, start

    learning. You will need one to survive this class.e-mail

    Regularly e-mailing your instructor is a crucial part of classparticipation. If you aren't asking questions, you aren'tdoing your job as a student.

    webThis is where you get your lecture notes, homeworks, and

    labs, and turnin all your work.virtual environment

    We have a 3D video game / chat tool available that canhelp us handle questions when one of us is not on campus.

    Compilers - What Are They and What Kinds of Compilers

    are Out There?

    The purpose of a compiler is: to translate a program in somelanguage (thesource language) into a lower-level language(the target language). The compiler itself is written in somelanguage, called the implementation language. To write acompiler you have to be very good at programming in the

  • 8/10/2019 Compiler Notes - Ullman

    4/182

    implementation language, and have to think about andunderstand the source language and target language.

    There are several major kinds of compilers:

    Native Code CompilerTranslates source code into hardware (assembly or machinecode) instructions. Example: gcc.

    Virtual Machine CompilerTranslates source code into an abstract machine code, forexecution by a virtual machine interpreter. Example: javac.

    JIT CompilerTranslates virtual machine code to native code. Operateswithin a virtual machine. Example: Sun's HotSpot javamachine.

    PreprocessorTranslates source code into simpler or slightly lower levelsource code, for compilation by another compiler.Examples: cpp, m4.

    Pure interpreterExecutes source code on the fly, without generatingmachine code. Example: Lisp.

    Phases of a Compiler

    Lexical Analysis:Converts a sequence of characters into words, or tokens

    Syntax Analysis:Converts a sequence of tokens into aparse tree

    Semantic Analysis:Manipulates parse tree to verify symbol and typeinformation

  • 8/10/2019 Compiler Notes - Ullman

    5/182

    Intermediate Code Generation:Converts parse tree into a sequence of intermediate codeinstructions

    Optimization:Manipulates intermediate code to produce a more efficientprogram

    Final Code Generation:Translates intermediate code into final (machine/assembly)code

    Example of the Compilation Process

    Consider the example statement; its translation to machine codeillustrates some of the issues involved in compiling.

    position = initial + rate * 60

    30 or so characters, from a single line of source code, are firsttransformed by lexical analysis into a sequence of 7 tokens.

    Those tokens are then used to build a tree of height 4 duringsyntax analysis. Semantic analysis may transform the tree intoone of height 5, that includes a type conversion necessary forreal addition on an integer operand. Intermediate codegeneration uses a simple traversal algorithm to linearize the treeback into a sequence of machine-independent three-address-code instructions.

    t1 = inttoreal(60)t2 = id3* t1t3 = id2+ t2id1= t3

  • 8/10/2019 Compiler Notes - Ullman

    6/182

    Optimization of the intermediate code allows the fourinstructions to be reduced to two machine-independentinstructions. Final code generation might implement these two

    instructions using 5 machine instructions, in which the actualregisters and addressing modes of the CPU are utilized.

    MOVF id3, R2MULF #60.0, R2MOVF id2, R1ADDF R2, R1MOVF R1, id1

    lecture #2 began here

    Announcements

    Reading!

    I hope you have already been reading! Make sure you read the

    class lecture notes, the related sections of the text, and pleaseask questions about whatever is not totally clear. You canAskQuestionsin class, via e-mail, in the virtual environment, or ontheclass message board.

    Note: although last year's CS 370 lecture notes are ALLavailable to you up front, I generally revise each lecture's notes,making additions, corrections and adaptations to this year's

    homeworks, the night before each lecture. The best time to printhard copies of the lecture notes is one day at a time, right beforethe lecture is given.

    Overview of Lexical Analysis

    http://www.cs.nmsu.edu/~jeffery/courses/370/index.htmlhttp://www.cs.nmsu.edu/~jeffery/courses/370/index.html
  • 8/10/2019 Compiler Notes - Ullman

    7/182

    A lexical analyzer, also called ascanner, typically has thefollowing functionality and characteristics.

    Its primary function is to convert from a (often very long)

    sequence of characters into a (much shorter, perhaps 10Xshorter) sequence of tokens. This means less work forsubsequent phases of the compiler.

    The scanner must Identify and Categorize specificcharacter sequences into tokens. It must know whetherevery two adjacent characters in the file belong together inthe same token, or whether the second character must be ina different token.

    Most lexical analyzers discard comments & whitespace. Inmost languages these characters serve to separate tokensfrom each other, but once lexical analysis is completed theyserve no purpose. On the other hand, the exact line # and/orcolumn # may be useful in reporting errors, so some recordof what whitespace has occurred may be retained.Note:insome languages, even popular ones, whitespace is

    significant. Handle lexical errors (illegal characters, malformed tokens)

    by reporting them intelligibly to the user. Efficiency is crucial; a scanner may perform elaborate input

    buffering Token categories can be (precisely, formally) specified

    using regular expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]* Lexical Analyzers can be written by hand, or implemented

    automatically using finite automata.

    What is a "token" ?

  • 8/10/2019 Compiler Notes - Ullman

    8/182

    In compilers, a "token" is:1.

    a single word of source code input (a.k.a. "lexeme")2.

    an integer code that refers to a single word of input

    3.

    a set of lexical attributes computed from a single word ofinputProgrammers think about all this in terms of #1. Syntaxchecking uses #2. Error reporting, semantic analysis, and codegeneration require #3. In a compiler written in C, for each tokenyou allocate a C struct to store (3) for each token.

    Worth Mentioning

    Here are the names of several important tools closely related tocompilers. You should learn those of these terms that you don'talready know.interpreter

    a language processor program that translates and executessource code directly, without compiling it ot machine code.

    assembler

    a translator from human readable (ASCII text) files ofmachine instructions into the actual binary code (objectfiles) of a machine.

    linkera program that combines (multiple) object files to make anexecutable. Converts names of variables and functions tonumbers (machine addresses).

    loaderProgram to load code. On some systems, differentexecutables start at different base addresses, so the loadermust patch the executable with the actual base address ofthe executable.

  • 8/10/2019 Compiler Notes - Ullman

    9/182

    preprocessorProgram that processes the source code before the compilersees it. Usually, it implements macro expansion, but it can

    do much more.editorEditors may operate on plain text, or they may be wiredinto the rest of the compiler, highlighting syntax errors asyou go, or allowing you to insert or delete entire syntaxconstructs at a time.

    debuggerProgram to help you see what's going on when your

    program runs. Can print the values of variables, show whatprocedure called what procedure to get where you are, runup to a particular line, run until a particular variable gets aspecial value, etc.

    profilerProgram to help you see where your program is spendingits time, so you can tell where you need to speed it up.

    Auxiliary data structures

    You were presented with the phases of the compiler, fromlexical and syntax analysis, through semantic analysis, andintermediate and final code generation. Each phase has an inputand an output to the next phase. But there are a few datastructures we will build that survive across multiple phases: the

    literal table, the symbol table, and the error handler.lexeme table

    a table that stores lexeme values, such as strings andvariable names, that may occur in many places. Only one

  • 8/10/2019 Compiler Notes - Ullman

    10/182

    copy of each unique string and name needs to be allocatedin memory.

    symbol table

    a table that stores the names defined (and visible with) eachparticular scope. Scopes include: global, and procedure(local). More advanced languages have more scopes suchas class (or record) and package.

    error handlererrors in lexical, syntax, or semantic analysis all need acommon reporting mechanism, that shows where the erroroccurred (filename, line number, and maybe column

    number are useful).

    Reading Named Files in C using stdio

    In this class you are opening and reading files. Hopefully this isreview for you; if not, you will need to learn it quickly. To doany "standard I/O" file processing, you start by including theheader:

    #include This defines a data type (FILE *) and gives prototypes forrelevant functions. The following code opens a file using astring filename, reads the first character (into an int variable, nota char, so that it can detect end-of-file; EOF is not a legal charvalue).

    FILE *f = fopen(filename, "r");int i = fgetc(f);if (i == EOF) /* empty file... */

    Command line argument handling and file processing in C

  • 8/10/2019 Compiler Notes - Ullman

    11/182

    The following example is from Kernighan & Ritchie's "The CProgramming Language", page 162.

    #include

    /* cat: concatenate files, version 1 */int main(int argc, char *argv[]){

    FILE *fp;void filecopy(FILE *, FILE *);

    if (argc == 1)filecopy(stdin, stdout);

    elsewhile (--argc > 0)

    if ((fp = fopen(*++argv, "r")) == NULL) {printf("cat: can't open %s\n", *argv);return 1;

    }else {

    filecopy(fp, stdout);fclose(fp);}

    return 0;}

    void filecopy(FILE *ifp, FILE *ofp){

    int c;

    while ((c = getc(ifp)) != EOF)

  • 8/10/2019 Compiler Notes - Ullman

    12/182

    putc(c, ofp);}Warning: while using and adapting the above code is fair game

    in this class, the yylex() function is very different than thefilecopy() function! It takes no parameters! It returns an integer

    every time it finds a token! So if you "borrow" from this

    example, delete filecopy() and write yylex() from scratch.

    Multiple students have fallen into this trap before you.

    A Brief Introduction to Make

    It is not a good idea to write a large program like a compiler as asingle source file. For one thing, every time you make a smallchange, you would need to recompile the whole program, whichwill end up being many thousands of lines. For another thing,parts of your compiler may be generated by "compilerconstruction tools" which will write separate files. In any case,this class will require you to use multiple source files, compiledseparately, and linked together to form your executable program.

    This would be a pain, except we have "make" which takes careof it for us. Make uses an input file named "makefile", whichstores in ASCII text form a collection of rules for how to build aprogram from its pieces. Each rule shows how to build a filefrom its source files, or dependencies. For example, to compile afile under C:

    foo.o : foo.cgcc -c foo.c

    The first line says to build foo.o you need foo.c, and the secondline, which mustbeing with a tab, gave a command-line to

  • 8/10/2019 Compiler Notes - Ullman

    13/182

    execute whenever foo.o should be rebuilt, i.e. when it is missingor when foo.c has been changed and need to be recompiled.

    The first rule in the makefile is what "make" builds by default,

    but note that make dependencies are recursive: before it checkswhether it needs to rebuild foo.o from foo.c it will checkwhether foo.c needs to be rebuilt using some other rule. Becauseof this post-order traversal of the "dependency graph", the firstrule in your makefile is usually the last one that executes whenyou type "make". For a C program, the first rule in yourmakefile would usually be the "link" step that assembles objects

    files into an executable as in:

    compiler: foo.o bar.o baz.ogcc -o compiler foo.o bar.o baz.o

    There is a lot more to "make" but we will take it one step at atime. Thisarticle on Make may be useful to you. You can findother useful on-line documentation on "make" (manual page,

    Internet reference guides, etc) if you look.

    A couple finer points forHW#1

    extern vs. #include: when do you use the one, when the other? public interface to yylex(): no, you can't add your ownparameters

    Regular Expressions

    The notation we use to precisely capture all the variations that agiven category of token may take are called "regularexpressions" (or, less formally, "patterns". The word "pattern" isreally vague and there are lots of other notations for patterns

    http://developers.sun.com/solaris/articles/make_utility.htmlhttp://www.cs.nmsu.edu/~jeffery/courses/370/hw1.htmlhttp://www.cs.nmsu.edu/~jeffery/courses/370/hw1.htmlhttp://developers.sun.com/solaris/articles/make_utility.html
  • 8/10/2019 Compiler Notes - Ullman

    14/182

    besides regular expressions). Regular expressions are ashorthand notation for sets of strings. In order to even talk about"strings" you have to first define an alphabet, the set of

    characters which can appear.1.Epsilon () is a regular expression denoting the set

    containing the empty string2.Any letter in the alphabet is also a regular expression

    denoting the set containing a one-letter string consisting ofthat letter.

    3.

    For regular expressions r and s,r | s

    is a regular expression denoting the union of r and s4.

    For regular expressions r and s,r s

    is a regular expression denoting the set of strings consistingof a member of r followed by a member of s

    5.

    For regular expression r,r*

    is a regular expression denoting the set of strings consistingof zero or more occurrences of r.

    6.You can parenthesize a regular expression to specifyoperator precedence (otherwise, alternation is like plus,concatenation is like times, and closure is likeexponentiation)

    Although these operators are sufficient to describe all regularlanguages, in practice everybody uses extensions:

    For regular expression r,r+

    is a regular expression denoting the set of strings consistingof one or more occurrences of r. Equivalent to rr*

  • 8/10/2019 Compiler Notes - Ullman

    15/182

    For regular expression r,r?

    is a regular expression denoting the set of strings consisting

    of zero or one occurrence of r. Equivalent to r| The notation [abc] is short for a|b|c. [a-z] is short for

    a|b|...|z. [^abc] is short for: any character other than a, b, orc.

    lecture #3 began here

    What is a "lexical attribute" ?

    A lexical attribute is a piece of information about a token. Thesetypically include:

    category an integer code used to check syntax

    lexeme actual string contents of the token

    line, column, file where the lexeme occurs in source code

    value for literals, the binary data they represent

    Homework #2

    Avoid These Common Bugs in Your Homeworks!

    1.

    yytext or yyinput were not declared global2.main() does not have its required argc, argv parameters!3.main() does not call yylex() in a loop or check its return

    value

    4.

    getc() EOF handling is missing or wrong! check EVERYall to getc() for EOF!

    5.opened files not (all) closed! file handle leak!6.

    end-of-comment code doesn't check for */7.

    yylex() is not doing the file reading

    http://www.cs.nmsu.edu/~jeffery/courses/370/hw2.htmlhttp://www.cs.nmsu.edu/~jeffery/courses/370/hw2.html
  • 8/10/2019 Compiler Notes - Ullman

    16/182

    8.

    yylex() does not skip multiple spaces, mishandles spaces atthe front of input, or requirescertain spaces in order tofunction OK

    9.

    extra or bogus output not in assignment spec10.

    = instead of ==

    Some Regular Expression Examples

    In a previous lecture we saw regular expressions, the preferrednotation for specifying patterns of characters that define tokencategories. The best way to get a feel for regular expressions is

    to see examples. Note that regular expressions form the basis forpattern matching in many UNIX tools such as grep, awk, perl,etc.

    What is the regular expression for each of the different lexicalitems that appear in C programs? How does this compare withanother, possibly simpler programming language such asBASIC?

    lexical

    categoryBASIC C

    operators the charactersthemselves

    For operators that are regularexpression operators we needmark them with double quotes orbackslashes to indicate you mean

    the character, not the regularexpression operator. Note severaloperators have a common prefix.The lexical analyzer needs to lookahead to tell whether an = is anassignment, or is followed by

  • 8/10/2019 Compiler Notes - Ullman

    17/182

    another = for example.

    reserved

    words

    the concatenationof characters;case insensitive

    Reserved words are also matchedby the regular expression for

    identifiers, so a disambiguatingrule is needed.

    identifiers

    no _; $ at ends ofsome; 2significantletters!?; caseinsensitive

    [a-zA-Z_][a-zA-Z0-9]*

    numbersints and reals,starting with [0-9]+

    0x[0-9a-fA-F]+ etc.

    comments REM.* C's comments are tricky regexp's

    stringsalmost ".*"; noescapes

    escaped quotes

    what else?

    lex(1) and flex(1)

    These programs generally take a lexical specification given in a.l file and create a corresponding C language lexical analyzer ina file named lex.yy.c. The lexical analyzer is then linked withthe rest of your compiler.

    The C code generated by lex has the following public interface.Note the use of global variables instead of parameters, and theuse of the prefix yy to distinguish scanner names from yourprogram names. This prefix is also used in the YACC parsergenerator.

  • 8/10/2019 Compiler Notes - Ullman

    18/182

    FILE *yyin; /* set this variable prior to calling yylex() */int yylex(); /* call this function once for each token */

    char yytext[]; /* yylex() writes the token's lexeme to an array *//* note: with flex, I believe extern declarations mustread

    extern char *yytext;*/

    int yywrap(); /* called by lex when it hits end-of-file; seebelow */

    The .l file format consists of a mixture of lex syntax and C codefragments. The percent sign (%) is used to signify lex elements.The whole file is divided into three sections separated by %%:

    header%%

    body

    %%helper functions

    The header consists of C code fragments enclosed in %{ and %}as well as macro definitions consisting of a name and a regularexpression denoted by that name. lex macros are invokedexplicitly by enclosing the macro name in curly braces.Following are some example lex macros.

    letter [a-zA-Z]digit [0-9]ident {letter}({letter}|{digit})*

  • 8/10/2019 Compiler Notes - Ullman

    19/182

    The body consists of of a sequence of regular expressions fordifferent token categories and other lexical entities. Each regularexpression can have a C code fragment enclosed in curly braces

    that executes when that regular expression is matched. For mostof the regular expressions this code fragment (also calledasemantic actionconsists of returning an integer that identifiesthe token category to the rest of the compiler, particularly foruse by the parser to check syntax. Some typical regularexpressions and semantic actions might include:

    " " { /* no-op, discard whitespace */ }{ident} { return IDENTIFIER; }"*" { return ASTERISK; }"." { return PERIOD; }You also need regular expressions for lexical errors such asunterminated character constants, or illegal characters.

    The helper functions in a lex file typically compute lexical

    attributes, such as the actual integer or string values denoted byliterals. One helper function you have to write is yywrap(),which is called when lex hits end of file. If you just want lex toquit, have yywrap() return 1. If your yywrap() switches yyin to adifferent file and you want lex to continue processing, haveyywrap() return 0. The lex or flex library (-ll or -lfl) have defaultyywrap() function which return a 1, and flex has the

    directive %option noyywrap which allows you to skip writingthis function.

    A Short Comment on Lexing C Reals

  • 8/10/2019 Compiler Notes - Ullman

    20/182

  • 8/10/2019 Compiler Notes - Ullman

    21/182

    .The dot operator matches any one character exceptnewline: [^\n]

    r* match r 0 or more times.r+

    match r 1 or more times.r?

    match r 0 or 1 time.r{m,n}

    match r between m and n times.

    r1r2concatenation. match r1followed by r2

    r1|r2alternation. match r1or r2

    (r)parentheses specify precedence but do not match anything

    r1/r2

    lookahead. match r1when r2follows, without consuming r2^r

    match r only when it occurs at the beginning of a liner$

    match r only when it occurs at the end of a line

    lecture #4 began here

    Announcements

    Next homework I promise: I will ask the TA to run yourprogram with a nonexistent file as a command-line argument!

    Lexical Attributes and Token Objects

  • 8/10/2019 Compiler Notes - Ullman

    22/182

    Besides the token's category, the rest of the compiler may needseveral pieces of information about a token in order to performsemantic analysis, code generation, and error handling. These

    are stored in an object instance of class Token, or in C, a struct.The fields are generally something like:

    struct token {int category;char *text;int linenumber;int column;

    char *filename;union literal value;

    }The union literal will hold computed values of integers, realnumbers, and strings.In your homework assignment, I amrequiring you to compute column #'s; not all compilers require

    them, but they are easy. Also: in our compiler project we are not

    worrying about optimizing our use of memory, so am notrequiring you to use a union.

    Flex Manpage Examplefest

    To read a UNIX "man page", or manual page, you type"man command" where command is the UNIX program orlibrary function you need information on. Read the man page for

    man to learn more advanced uses ("man man").

    It turns out the flex man page is intended to be pretty complete,enough so that we can draw our examples from it. Perhaps whatyou should figure out from these examples is that flex is

  • 8/10/2019 Compiler Notes - Ullman

    23/182

    actually... flexible. The first several examples use flex as a filterfrom standard input to standard output.

    sneaky string removal tool:

    %% "zap me" excess whitespace trimmer %% [ \t]+ putchar( ' ' );

    [ \t]+$ /* ignore this token */ sneaky string substitution tool: %% username printf( "%s", getlogin() ); Line Counter/Word Counter int num_lines = 0, num_chars = 0; %% \n ++num_lines; ++num_chars; . ++num_chars; %% main()

    { yylex(); printf( "# of lines = %d, # of chars =

    %d\n", num_lines, num_chars );

  • 8/10/2019 Compiler Notes - Ullman

    24/182

    } Toy compiler example

    /* scanner for a toy Pascal-like language */ %{ /* need this for the call to atof() below */ #include %} DIGIT [0-9]

    ID [a-z][a-z0-9]* %% {DIGIT}+ { printf( "An integer: %s (%d)\n",

    yytext,

    atoi( yytext ) ); } {DIGIT}+"."{DIGIT}* { printf( "A float: %s (%g)\n", yytext, atof( yytext ) ); } if|then|begin|end|procedure|function { printf( "A keyword: %s\n", yytext ); }

  • 8/10/2019 Compiler Notes - Ullman

    25/182

    {ID} printf( "An identifier: %s\n",yytext );

    "+"|"-"|"*"|"/" printf( "An operator: %s\n",yytext ); "{"[^}\n]*"}" /* eat up one-line

    comments */ [ \t\n]+ /* eat up whitespace */

    . printf( "Unrecognized character:%s\n", yytext );

    %% main( argc, argv ) int argc;

    char **argv; { ++argv, --argc; /* skip over program

    name */ if ( argc > 0 ) yyin = fopen( argv[0], "r" ); else yyin = stdin; yylex(); }

  • 8/10/2019 Compiler Notes - Ullman

    26/182

    On the use of character sets (square brackets) in lex and

    similar tools

    A student recently sent me an example regular expression for

    comments that read:

    COMMENT [/*][[^*/]*[*]*]]*[*/]One problem here is that square brackets are not parentheses,they do not nest, they do not support concatenation or otherregular expression operators. They mean exactly: "match anyone of these characters" or for ^: "match any one character that

    is not one of these characters". Note also that you can'tuse ^ asa "not" operator outside of square brackets: you can't write theexpression for "stuff that isn't */" by saying (^ "*/")

    lecture #5 began here

    Finite Automata

    A finite automaton (FA) is an abstract, mathematical machine,also known as a finite state machine, with the followingcomponents:

    1.A set of states S2.A set of input symbols E (the alphabet)3.A transition function move(state, symbol) : new state(s)4.A start state S05.

    A set of final states F

    The wordfiniterefers to the set of states: there is a fixed size tothis machine. No "stacks", no "virtual memory", just a knownnumber of states. The word automatonrefers to the executionmode: there is no instruction set, there is no sequence of

  • 8/10/2019 Compiler Notes - Ullman

    27/182

    instructions, there is just a hardwired short loop that executes thesame instruction over and over:

    while ((c=getchar()) != EOF) S := move(S, c);DFAs

    The type of finite automata that is easiest to understand andsimplest to implement (say, even in hardware) is called adeterministic finite automaton (DFA). Theword deterministichere refers to the return value of function

    move(state, symbol), which goes to at most one state. Example:

    S = {s0, s1, s2}E = {a, b, c}move = { (s0,a):s1; (s1,b):s2; (s2,c):s2 }S0 = s0F = {s2}

    Finite automata correspond in a 1:1 relationship to transitiondiagrams; from any transition diagram one can write down theformal automaton in terms of items #1-#5 above, and vice versa.To draw the transition diagram for a finite automaton:

    draw a circle for each state s in S; put a label inside thecircles to identify each state by number or name

    draw an arrow between Siand Sj, labeled with x whenever

    the transition says to move(Si, x) : Sj draw a "wedgie" into the start state S0 to identify it draw a second circle inside each of the final states in F

    The Automaton Game

  • 8/10/2019 Compiler Notes - Ullman

    28/182

    If I give you a transition diagram of a finite automaton, you canhand-simulate the operation of that automaton on any input Igive you.

    DFA Implementation

    The nice part about DFA's is that they are efficientlyimplemented on computers. What DFA does the following codecorrespond to? What is the corresponding regular expression?You can speed this code fragment up even further if you arewilling to use goto's or write it in assembler.

    state := S0for(;;)

    switch (state) {case 0:

    switch (input) {'a': state = 1; input = getchar(); break;'b': input = getchar(); break;

    default: printf("dfa error\n"); exit(1);}

    case 1:switch (input) {

    EOF: printf("accept\n"); exit(0);default: printf("dfa error\n"); exit(1);}

    }

    Deterministic Finite Automata Examples

    A lexical analyzer might associate different final states withdifferent token categories:

  • 8/10/2019 Compiler Notes - Ullman

    29/182

    C Comments:

    Nondeterministic Finite Automata (NFA's)

    Notational convenience motivates more flexible machines inwhich function move() can go to more than one state on a giveninput symbol, and some states can move to other states evenwithout consuming an input symbol (-transitions).

    Fortunately, one can prove that for any NFA, there is anequivalent DFA. They are just a notational convenience. So,

    finite automata help us get from a set of regular expressions to acomputer program that recognizes them efficiently.

    NFA Examples

    -transitions make it simpler to merge automata:

  • 8/10/2019 Compiler Notes - Ullman

    30/182

    multiple transitions on the same symbol handle common

    prefixes:

    factoring may optimize the number of states. Is this pictureOK/correct?

  • 8/10/2019 Compiler Notes - Ullman

    31/182

    C Pointers, malloc, and your future

    For most of you success as a computer scientist may boil downto whether you can master the concept of dynamically allocated

    memory. In C this means pointers and the malloc() family offunctions. Here are some tips:

    Draw "memory box" pictures of your variables. Pencil andpaper understanding of memory leads to correct runningprograms.

    Always initialize local pointer variables. Consider thiscode:

    void f() { int i = 0; struct tokenlist *current, *head; ... foo(current)

  • 8/10/2019 Compiler Notes - Ullman

    32/182

    }Here, current is passed in as a parameter to foo, but it is apointer that hasn't been pointed at anything. I cannot tell

    you how many times I personally have written bugs myselfor fixed bugs in student code, caused by reading or writingto pointers that weren't pointing at anything in particular.Local variables that weren't initialized point at randomgarbage. If you are lucky this is a coredump, but you mightnot be lucky, you might not find out where the mistakewas, you might just get a wrong answer. This can all befixed by

    struct tokenlist *current = NULL, *head = NULL; Avoid this common C bug: struct token *t = (struct token

    *)malloc(sizeof(struct token *)));This compiles, but causes coredumps during program

    execution. Why? Check your malloc() return value to be sure it is not NULL.

    Sure, modern programs will "never run out of memory".Wrong. malloc() can return NULL even on big machines.Operating systems often place limits on memory so as toprotect themselves from runaway programs or hackerattacks.

    Regular expression examples

    Can you draw an NFA corresponding to the following?

    (a|c)*b(a|c)*

  • 8/10/2019 Compiler Notes - Ullman

    33/182

    (a|c)*|(a|c)*b(a|c)*

    (a|c)*(b|)(a|c)*Regular expressions can be converted automatically to

    NFA's

    Each rule in the definition of regular expressions has acorresponding NFA; NFA's are composedusing transitions.This is called "Thompson's construction" ). We will work

    examples such as (a|b)*abb in class and during lab.1.

    For , draw two states with a single transition.

    2.

    For any letter in the alphabet, draw two states with a singletransition labeled with that letter.

    3.

    For regular expressions r and s, draw r | s by adding a newstart state with transitions to the start states of r and s, and

    a new final state with transitions from each final state in r

  • 8/10/2019 Compiler Notes - Ullman

    34/182

    and s.

    4.For regular expressions r and s, draw rs by adding transitions from the final states of r to the start state of s.

    5.

    For regular expression r, draw r* by adding new start andfinal states, and transitions

    o from the start state to the final state,o from the final state back to the start state,o from the new start to the old start and from the old

    final states to the new final state.

  • 8/10/2019 Compiler Notes - Ullman

    35/182

    6.For parenthesized regular expression (r) you can use theNFA for r.

    lecture #6 began here

    NFA's can be converted automatically to DFA's

    In: NFA NOut: DFA DMethod: Construct transition table Dtran (a.k.a. the "move

    function"). Each DFA state is a set of NFA states. Dtransimulates in parallel all possible moves N can make on a givenstring.

    Operations to keep track of sets of NFA states:

    _closure(s)set of states reachable from state s via

    _closure(T)set of states reachable from any state in set T via

    move(T,a)set of states to which there is an NFA transition from statesin T on symbol a

  • 8/10/2019 Compiler Notes - Ullman

    36/182

    NFA to DFA Algorithm:

    Dstates := {_closure(start_state)}

    while T := unmarked_member(Dstates) do {mark(T)for each input symbol a do {

    U := _closure(move(T,a))if not member(Dstates, U) then

    insert(Dstates, U)Dtran[T,a] := U

    }}

    Practice converting NFA to DFA

    OK, you've seen the algorithm, now can you use it?

    ...

  • 8/10/2019 Compiler Notes - Ullman

    37/182

    ...did you get:

    OK, how about this one:

    lecture #7 began here

    Some Remarks

    I have a collection of compiler textbooks in my office,

    which I will make avaliable as "loaners" from class periodto class period, all you have to do is sign a return contractin blood.

    If you checked out the class web page, you saw a solutionto HW#1 was posted awhile ago... I will try to do this for

  • 8/10/2019 Compiler Notes - Ullman

    38/182

    future assignments also, but not immediately, so as to allowstudents a few days of lateness without a heavy penalty.

    Whether we return the same or a different category for

    integer constants and for line numbers depends very muchon the grammar we use to parse our language.

    Lexical Analysis and the Literal Table

    In many compilers, the memory management components of thecompiler interact with several phases of compilation, startingwith lexical analysis.

    Efficient storage is necessary to handle large input files. There is a colossal amount of duplication in lexical data:

    variable names, strings and other literal values duplicatefrequently

    What token type to use may depend on previousdeclarations.

    A hash table or other efficient data structure can avoid this

    duplication. The software engineering design pattern to use iscalled the "flyweight".

    Major Data Structures in a Compiler

    tokencontains an integer category, lexeme, line #, column #,filename... We could build these into a link list, but instead

    we'll use them as leaves in a tree structure.syntax tree

    contains grammar information about a sequence of relatedtokens. leaves contain lexical information (tokens). internal

  • 8/10/2019 Compiler Notes - Ullman

    39/182

    nodes contain grammar rules and pointers to tokens orother tree nodes.

    symbol table

    contains variable names, types, and information needed togenerate code for a name (such as its address, or constantvalue). Look ups are by name, so we'll need a hash table.

    intermediate & final codeWe'll need link lists or similar structures to hold sequencesof machine instructions

    Literal Table: Usage Example

    Example abbreviated from [ASU86]: Figure 3.18, p. 109. Use"install_id()" instead of "strdup()" to avoid duplication in thelexical data.

    %{/* #define's for token categories LT, LE, etc.%}

    white [ \t\n]+digit [0-9]id [a-zA-Z_][a-zA-Z_0-9]*num {digit}+(\.{digit}+)?

    %%

    {ws} { /* discard */ }if { return IF; }then { return THEN; }else { return ELSE; }

  • 8/10/2019 Compiler Notes - Ullman

    40/182

    {id} { yylval.id = install_id(); return ID; }{num} { yylval.num = install_num(); return NUMBER; }"" { yylval.op = GT; return RELOP; }

    %%

    install_id(){

    /* insert yytext into the literal table */}

    install_num(){

    /* insert (binary number corresponding to?) yytext into theliteral table */}So how would you implement a literal table using a hash table?

    We will see more hash tables when it comes time to constructthe symbol tables with which variable names and scopes aremanaged, so you had better become fluent.

    lecture #8 began here

    Constructing your Token inside yylex()

    A student recently asked if it was OK to allocate a tokenstructure inside main() after yylex() returns the token. This is notOK because in the next phase of your compiler, you are notcalling yylex(), the automatically generated parser will callyylex(). There is a way for the parser to grab your token if

  • 8/10/2019 Compiler Notes - Ullman

    41/182

    you've stored it in a global variable, but there is not a way forthe parser to build the token structure itself.

    Syntax Analysis

    Parsingis the act of performing syntax analysis to verify aninput program's compliance with the source language. A by-product of this process is typically a tree that represents thestructure of the program.

    Context Free Grammars

    A context free grammar G has: A set of terminal symbols, T A set of nonterminal symbols, N A start symbol, s, which is a member of N A set of production rules of the form A -> w, where A is a

    nonterminal and w is a string of terminal and nonterminalsymbols.

    A context free grammar can be used togeneratestrings in thecorresponding language as follows:let X = the start symbol swhile there is some nonterminal Y in X do

    apply any one production rule using Y, e.g. Y -> wWhen X consists only of terminal symbols, it is a string of thelanguage denoted by the grammar. Each iteration of the loop isa derivation step. If an iteration has several nonterminals to

    choose from at some point, the rules of derviation would allowany of these to be applied. In practice, parsing algorithms tend toalways choose the leftmost nonterminal, or the rightmostnonterminal, resulting in strings that are leftmostderivationsor rightmost derivations.

  • 8/10/2019 Compiler Notes - Ullman

    42/182

    Context Free Grammar Examples

    Well, OK, so how much of the C language grammar can wecome up with in class today? Start with expressions, work on up

    to statements, and work there up to entire functions, andprograms.

    lecture #9 began here

    Dr. Pontelli is looking for a web developer, did everyone seethat ad? I too am looking for student research assistants.

    Grammar Ambiguity

    The grammar

    E -> E + EE -> E * EE -> ( E )E -> ident

    allows two different derivations for strings such as "x + y * z".The grammar is ambiguous, but the semantics of the languagedictate a particular operator precedence that should be used. Oneway to eliminate such ambiguity is to rewrite the grammar. Forexample, we can force the precedence we want by adding somenonterminals and production rules.

    E -> E + TE -> TT -> T * FT -> FF -> ( E )

  • 8/10/2019 Compiler Notes - Ullman

    43/182

    F -> identGiven the arithmetic expression grammar from last lecture:

    How can a program figure that x + y * z is legal?

    How can a program figure out that x + y (* z) is illegal?

    A brief aside on casting your mallocs

    If you don't put a prototype for malloc(), C thinks it returns anint.

    #include

    includes prototypes for malloc(), free(), etc. malloc() returns avoid *.

    void * means "pointer that points at nothing", or "pointer thatpoints at anything". You need to cast it to what you are reallypointing at, as in:

    union lexval *l = (union lexval *)malloc(sizeof(union lexval));Note the stupid duplication of type information; no language isperfect! Anyhow, always cast your mallocs. The program maywork without the cast, but you need to fix every warning, so youdon't accidentally let a serious one through.

    Recursive Descent Parsing

    Perhaps the simplest parsing method, for a large subset ofcontext free grammars, is called recursive descent. It is simplebecause the algorithm closely follows the production rules ofnonterminal symbols.

    Write 1 procedure per nonterminal rule

  • 8/10/2019 Compiler Notes - Ullman

    44/182

    Within each procedure, a) match terminals at appropriatepositions, and b) call procedures for non-terminals.

    Pitfalls:

    1.

    left recursion is FATAL2.

    must distinguish between several production rules, orpotentially, one has to try all of them via backtracking.

    Recursive Descent Parsing Example #1

    Consider the grammar we gave above. There will be functionsfor E, T, and F. The function for F() is the "easiest" in some

    sense: based on a single token it can decide which productionrule to use. The parsing functions return 0 (failed to parse) if thenonterminal in question cannot be derived from the tokens at thecurrent point. A nonzero return value of N would indicatesuccess in parsing using production rule #N.

    int F(){

    int t = yylex();if (t == IDENT) return 6;else if (t == LP) {

    if (E() && (yylex()==RP) return 5;}

    return 0;}

    Comment #1: if F() is in the middle of a larger parse of E() orT(), F() may succeed, but the subsequent parsing may fail. Theparse may have to backtrack, which would mean we'd have tobe able to put tokens back for later parsing. Add a memory (say,a gigantic array or link list for example) of already-parsed

  • 8/10/2019 Compiler Notes - Ullman

    45/182

    tokens to the lexical analyzer, plus backtracking logic to E() orT() as needed. The call to F() may get repeated following adifferent production rule for a higher nonterminal.

    Comment #2: in a real compiler we need more than "yes itparsed" or "no it didn't": we need a parse tree if it succeeds, andwe need a useful error message if it didn't.

    Question: for E() and T(), how do we know which productionrule to try? Option A: just blindly try each one in turn. Option B:look at the first (current) token, only try those rules that start

    with that token (1 character lookahead). If you are lucky, thatone character will uniquely select a production rule. If that isalways true through the whole grammar, no backtracking isneeded.

    Question: how do we know which rules start with whatevertoken we are looking at? Can anyone suggest a solution, or arewe stuck?

    lecture #10 began here

    Announcements

    Homework #3 minor extension Midterm exam: Thursday March 16 The first midterm exam will cover lexical analysis and

    syntax analysis

    Removing Left Recursion

    E -> E + T | T

  • 8/10/2019 Compiler Notes - Ullman

    46/182

    T -> T * F | FF -> ( E ) | identWe can remove the left recursion by introducing new

    nonterminals and new production rules.

    E -> T E'E' -> + T E' | T -> F T'T' -> * F T' | F -> ( E ) | identGetting rid of such immediate left recursionis not enough, one

    must get rid of indirect left recursion, where two or morenonterminals are mutually left-recursive. One canrewrite anyCFG to remove left recursion (Algorithm 4.1).

    for i := 1 to n dofor j := 1 to i-1 do begin

    replace each Ai-> Ajgamma with productions

    Ai-> delta1gamma | delta2gammaend

    eliminate immediate left recursion

    Removing Left Recursion, part 2

    Left recursion can be broken into three cases

    case 1: trivial

    A : A | The recursion must always terminate by A finally deriving so

    you can rewrite it to the equivalent

  • 8/10/2019 Compiler Notes - Ullman

    47/182

    A : A'A' : A' |

    Example:

    E : E op T | Tcan be rewritten

    E : T E'E' : op T E' |

    case 2: non-trivial, but immediate

    In the more general case, there may be multiple recursiveproductions and/or multiple non-recursive productions.

    A : A 1| A 2| ... | 1| 2As in the trivial case, you get rid of left-recursing A andintroduce an A'

    A : 1A' | 2A' | ...A' : 1A' | 2A' | ... |

    case 3: mutual recursion

    1.Order the nonterminals in some order 1 to N.2.

    Rewrite production rules to eliminate all nonterminals in

    leftmost positions that refer to a "previous" nonterminal.When finished, all productions' right hand symbols startwith a terminal or a nonterminal that is numbered equal orhigher than the nonterminal no the left hand side.

    3.

    Eliminate the direct left recusion as per cases 1-2.

  • 8/10/2019 Compiler Notes - Ullman

    48/182

    Left Recursion Versus Right Recursion: When does it

    Matter?

    A student came to me once with what they described as an

    operator precedence problem where 5-4+3 was computing thewrong value (-2 instead of 4). What it really was, was anassociativity problem due to the grammar:

    E : T + E | T - E | TThe problem here is that right recursion is forcing rightassociativity, but normal arithmetic requires left associativity.

    Several solutions are: (a) rewrite the grammar to be leftrecursive, or (b) rewrite the grammar with more nonterminals toforce the correct precedence/associativity, or (c) if using YACCor Bison, there are "cheat codes" we will discuss later to allow itto be majorly ambiguous and specify associativity separately(look for %left and %right in YACC manuals).

    Recursive Descent Parsing Example #2

    The grammar

    S -> A B CA -> a AA -> B -> bC -> cmaps to pseudocode like the following. (:= is an assignmentoperator)

    procedure S()if A() & B() & C() then succeed # matched S, we win

  • 8/10/2019 Compiler Notes - Ullman

    49/182

    end

    procedure A()

    if yychar == a then { # use production 2yychar := scan()return A()}

    elsesucceed # production rule 3, match

    end

    procedure B()if yychar == b then {

    yychar := scan()succeed}

    else failend

    procedure C()if yychar == c then {

    yychar := scan()succeed}

    else failend

    Backtracking?

  • 8/10/2019 Compiler Notes - Ullman

    50/182

    Could your current token begin more than one of your possibleproduction rules? Try all of them, remember and reset state foreach try.

    S -> cAdA -> abA -> aLeft factoringcan often solve such problems:

    S -> cAdA -> a A'

    A'-> bA'-> ()One can also perform left factoring to reduce or eliminate thelookahead or backtracking needed to tell which production ruleto use. If the end result has no lookahead or backtrackingneeded, the resulting CFG can be solved by a "predictive parser"and coded easily in a conventional language. If backtracking is

    needed, a recursive descent parser takes more work toimplement, but is still feasible. As a more concrete example:

    S -> ifE thenSS -> ifE thenS1else S2can be factored to:

    S -> ifE thenS S'S'-> else S2|

    Some More Parsing Theory

  • 8/10/2019 Compiler Notes - Ullman

    51/182

    Automatic techniques for constructing parsers start withcomputing some basic functions for symbols in the grammar.These functions are useful in understanding both recursive

    descent and bottom-up LR parsers.First(a)

    First(a) is the set of terminals that begin strings derived from a,which can include .

    1.First(X) starts with the empty set.2.if X is a terminal, First(X) is {X}.

    3.

    if X -> is a production, add to First(X).4.

    if X is a non-terminal and X -> Y1Y2... Ykis a production,add First(Y1) to First(X).

    5. for (i = 1; if Yican derive ; i++)6. add First(Yi+1) to First(X)7.

    First(a) examples

    by the way, this stuff is all in section 4.3 in your text.

    Last time we looked at an example with E, T, and F, and + and*. The first-set computation was not too exciting and we needmore examples.

    stmt : if-stmt | OTHERif-stmt: IF LP expr RP stmt else-partelse-part: ELSE stmt | expr: IDENT | INTLITWhat are the First() sets of each nonterminal?

  • 8/10/2019 Compiler Notes - Ullman

    52/182

    Follow(A)

    Follow(A) for nonterminal A is the set of terminals that canappear immediately to the right of A in some sentential form S -

    > aAxB... To compute Follow, apply these rules to allnonterminals in the grammar:

    1.

    Add $ to Follow(S)2.

    if A -> aBb then add First(b) - to Follow(B)3.

    if A -> aB or A -> aBb where is in First(b), then addFollow(A) to Follow(B).

    On resizing arrays in CThe sval attribute in homework #2 is a perfect example of aproblem which a BCS major might not be expected to manage,but a CS major should be able to do by the time they graduate.This is not to encourage any of you to consider BCS, but rather,to encourage you to learn how to solve problems like these.

    The problem can be summarized as: step through yytext,copying each piece out to sval, removing doublequotes andplusses between the pieces, and evaluating CHR$() constants.

    Space allocated with malloc() can be increased in size byrealloc(). realloc() is awesome. But, it COPIES and MOVES theold chunk of space you had to the new, resized chunk of space,and frees the old space, so you had better not have any other

    pointers pointing at that space if you realloc(), and you have toupdate your pointer to point at the new location realloc() returns.

    i = 0; j = 0;while (yytext[i] != '\0') {

  • 8/10/2019 Compiler Notes - Ullman

    53/182

  • 8/10/2019 Compiler Notes - Ullman

    54/182

    char *appendstring(char *s, char c){

    i = strlen(s);

    s = realloc(s, i+2);s[i] = c;s[i+1] = '\0';return s;

    }Note: it is very inefficient to grow your array one characterat a time; in real life people grow arrays in large chunks ata time.

    Solution #3: use solution one and then shrink your arraywhen you find out how big it actually needs to be.

    sval = malloc(strlen(yytext)+1); /* ... do the code copying into sval; be sure to

    NUL-terminate */ sval = realloc(sval, strlen(sval)+1);

    lecture #11 began here

    YACC

  • 8/10/2019 Compiler Notes - Ullman

    55/182

    YACC ("yet another compiler compiler") is a popular toolwhich originated at

    AT&T Bell Labs. YACC takes a context free grammar as input,and generates aparser as output. Several independent, compatibleimplementations (AT&Tyacc, Berkeley yacc, GNU Bison) for C exist, as well as manyimplementationsfor other popular languages.

    YACC files end in .y and take the form

    declarations%%grammar%%

    subroutines

    The declarations section defines the terminal symbols (tokens)andnonterminal symbols. The most useful declarations are:

    %token adeclares terminal symbol a; YACC can generate a set of#define'sthat map these symbols onto integers, in a y.tab.hfile. Note: don' t

    #include your y.tab.h fi le from your grammar .y fi le,

    YACC generates the

  • 8/10/2019 Compiler Notes - Ullman

    56/182

    same defini tions and declarations dir ectly in the .c f i le,

    and includingthe .tab.h f i le wil l cause duplication errors.

    %start Aspecifies the start symbol for the grammar (defaults tononterminal

    on left side of the first production rule).

    The grammar gives the production rules, interspersed withprogram code

    fragments called semantic actions that let the programmer dowhat'sdesired when the grammar productions are reduced. Theyfollow thesyntax

    A : body ;

    Where body is a sequence of 0 or more terminals, nonterminals,or semanticactions (code, in curly braces) separated by spaces. As anotationalconvenience, multiple production rules may be grouped togetherusing thevertical bar (|).

    Bottom Up Parsing

  • 8/10/2019 Compiler Notes - Ullman

    57/182

    Bottom up parsers start from the sequence of terminal symbols

    and worktheir way back up to the start symbol by repeatedly replacinggrammarrules' right hand sides by the corresponding non-terminal. Thisisthe reverse of the derivation process, and is called "reduction".

    Example. For the grammar

    (1) S->aABe(2) A->Abc(3) A->b(4) B->d

    the string "abbcde" can be parsed bottom-up by the followingreductionsteps:

    abbcdeaAbcdeaAdeaABeS

  • 8/10/2019 Compiler Notes - Ullman

    58/182

    Handles

    Definition: a handleis a substring that

    1. matches a right hand side of a production rule in thegrammar and

    2.3. whose reduction to the nonterminal on the left hand

    side of that

    4. grammar rule is a step along the reverse of arightmost derivation.

    5.

    Shift Reduce Parsing

    A shift-reduce parser performs its parsing using the followingstructure

    Stack Input$ w$

    At each step, the parser performs one of the following actions.

    1. Shift one symbol from the input onto the parse stack

  • 8/10/2019 Compiler Notes - Ullman

    59/182

    2.3. Reduce one handle on the top of the parse stack. The

    symbols

    4. from the right hand side of a grammar rule arepopped of the5. stack, and the nonterminal symbol is pushed on

    the stack in their place.6.7. Accept is the operation performed when the start

    symbol is alone8. on the parse stack and the input is empty.

    9.10. Error actions occur when no successful parse is

    possible.11.

    The YACC Value Stack

    YACC's parse stack contains only "states"

    YACC maintains a parallel set of values $ is used in semantic actions to name elements on

    the value stack

  • 8/10/2019 Compiler Notes - Ullman

    60/182

  • 8/10/2019 Compiler Notes - Ullman

    61/182

    each time yylex() returns to the parser will get copied over to thetop of the value stack when the token is shifted onto the parsestack.

    You can either declare that struct token may appear in the%union,and put a mixture of struct node and struct token on the valuestack,or you can allocate a "leaf" tree node, and point it at your structtoken. Or you can use a tree type that allows tokens to include

    their lexical information directly in the tree nodes. If you havemore than one %union type possible, be prepared to see typeconflictsand to declare the types of all your nonterminals.

    Getting all this straight takes some time; you can plan on

    it. Your bestbet is to draw pictures of how you want the trees to look, andthen make thecode match the pictures. No pictures == "Dr. J will ask to seeyourpictures and not be able to help if you can't describe your trees."

    Declaring value stack types for terminal and nonterminalsymbols

  • 8/10/2019 Compiler Notes - Ullman

    62/182

  • 8/10/2019 Compiler Notes - Ullman

    63/182

    might write:

    %token < tokenptr > SEMICOL

    Announcements

    Having trouble debugging your grammar? "bison -v"generates a .outputfile that gives the gory details of conflicts and such.

    lecture #12 began here

    Announcements

    In honor of Dr. Jeffery's 10th anniversary, a minor extensionin Homework #3.

    Conflicts in Shift-Reduce Parsing

  • 8/10/2019 Compiler Notes - Ullman

    64/182

    "Conflicts" occur when an ambiguity in the grammar creates asituationwhere the parser does not know which step to perform at a given

    pointduring parsing. There are two kinds of conflicts that occur.

    shift-reducea shift reduce conflict occurs when the grammar indicatesthat

    different successful parses might occur with either a

    shift or a reduceat a given point during parsing. The vast majority of

    situations wherethis conflict occurs can be correctly resolved by shifting.

    reduce-reducea reduce reduce conflict occurs when the parser has two ormore

    handles at the same time on the top of thestack. Whatever choice

    the parser makes is just as likely to be wrong as not. Inthis case

    it is usually best to rewrite the grammar to eliminate theconflict,

    possibly by factoring.

    Example shift reduce conflict:

    S->if E then SS->if E then S else S

  • 8/10/2019 Compiler Notes - Ullman

    65/182

    In many languages two nested "if" statements produce a

    situation wherean "else" clause could legally belong to either "if". The usualrule(to shift) attaches the else to the nearest (i.e. inner) if statement.

    Example reduce reduce conflict:

    (1) S -> id LP plist RP(2) S -> E GETS E(3) plist -> plist, p(4) plist -> p(5) p -> id(6) E -> id LP elist RP(7) E -> id

    (8) elist -> elist, E(9) elist -> E

    By the point the stack holds ...id LP id

    the parser will not know which rule to use to reduce the id: (5)or (7).

    Further Discussion of Reduce Reduce and Shift Reduce

    Conflicts

  • 8/10/2019 Compiler Notes - Ullman

    66/182

  • 8/10/2019 Compiler Notes - Ullman

    67/182

    T : F T2 g;T2 : t F T2 ;T2 : ;

    F : l T r ;F : v ;

    This grammar is not much different than before, and has thesame problem,but the surrounding context (the "calling environments") of Fcause the

    grammar to have a shift-reduce instead of reduce-reduce. Onceagain,the trouble is after you have seen an F and dwells on thequestion ofwhether to reduce the epsilon production, or instead to shift,uponseeing a token g.

    The .output file generated by "bison -v" explains these conflictsinconsiderable detail. Part of what you need to interpret them aretheconcepts of "items" and "sets of items" discussed below.

    YACC precedence and associativity declarations

  • 8/10/2019 Compiler Notes - Ullman

    68/182

    YACC headers can specify precedence and associativity rulesfor otherwiseheavily ambiguous grammars. Precedence is determined byincreasing orderof these declarations. Example:

    %right ASSIGN

    %left PLUS MINUS%left TIMES DIVIDE%right POWER%%expr: expr ASSIGN expr

    | expr PLUS expr| expr MINUS expr

    | expr TIMES expr| expr DIVIDE expr| expr POWER expr;

    YACC error handling and recovery

  • 8/10/2019 Compiler Notes - Ullman

    69/182

    Use special predefined token error where errorsexpected

    On an error, the parser pops states until it entersone that has an action on the error token. For example: statement: error ';' ; The parser must see 3 good tokens before it

    decides it has recovered.

    yyerrok tells parser to skip the 3 token recovery

    rule yyclearin throws away the current (error-causing?)

    token

    yyerror(s) is called when a syntax error occurs (sis the error message)

    Improving YACC's Error Reporting

    yyerror(s) overrides the default error message, which usuallyjust says either"syntax error" or "parse error", or "stack overflow".

  • 8/10/2019 Compiler Notes - Ullman

    70/182

    You can easily add information in your own yyerror() function,for example

    GCC emits messages that look like:

    goof.c:1: parse error before '}' token

    using a yyerror function that looks like

    void yyerror(char *s){

    fprintf(stderr, "%s:%d: %s before '%s' token\n",yyfilename, yylineno, s, yytext);

    }

    You could instead, use the error recovery mechanism to produce

    better messages.For example

    lbrace : LBRACE | { error_code=MISSING_LBRACE; } error ;

    Where LBRACE is an expected token {

    This uses a global variable error_code to pass parse informationto yyerror().

  • 8/10/2019 Compiler Notes - Ullman

    71/182

    Another related option is to call yyerror() explicitly with a bettermessagestring, and tell the parser to recover explicitly:

    package_declaration: PACKAGE_TK error{ yyerror("Missing name"); yyerrok; } ;

    But, using error recovery to perform better error reporting runsagainst

    conventional wisdom that you should use error tokens verysparingly.What information from the parser determined we had an error inthe firstplace? Can we use that information to produce a better errormessage?

    LR Syntax Error Messages: Advanced Methods

    The pieces of information that YACC/Bison use to determinethat there

    is an error in the first place are the parse state (yystate) and thecurrent input token (yychar). These are exactly the pieces ofinformationone might use to produce better diagnostic error messageswithout

  • 8/10/2019 Compiler Notes - Ullman

    72/182

    relying on the error recovery mechanism and mucking up thegrammarwith a lot of extra production rules that feature the error token.

    Even just the parse state is enough to do pretty good errormessages.yystate is not part of YACC's public interface, though, so youmayhave to play some tricks to pass it as a parameter into yyerror()from

    yyparse(). Say, for example:

    #define yyerror(s) __yyerror(s,yystate)

    Inside __yyerror(msg, yystate) you can use a switch statement ora global

    array to associate messages with specific parse states. But,figuringout which parse state means which syntax error message wouldbe by trialand error.

    A tool called Merr is available that let's you generate thisyyerrorfunction from examples: you supply the sample syntax errorsand messages,and Merr figures out which parse state integer goes with whichmessage.

  • 8/10/2019 Compiler Notes - Ullman

    73/182

  • 8/10/2019 Compiler Notes - Ullman

    74/182

  • 8/10/2019 Compiler Notes - Ullman

    75/182

    For HW3, test your work on as many test cases as possible.

    Midterm Exam is coming up, March 16. Midterm reviewMarch 14.Three more lectures before that.

    LR vs. LL vs. LR(0) vs. LR(1) vs. LALR(1)

    The first char ("L") means input tokens are read from the left(left to right). The second char ("R" or "L") means parsingfinds the rightmost, or leftmost, derivation. Relevantif there is ambiguity in the grammar. (0) or (1) or (k) afterthe main lettering indicates how many lookahead characters areused. (0) means you only look at the parse stack, (1) means you

    use the current token in deciding what to do, shift or reduce.(k) means you look at the next k tokens before deciding whatto do at the current position.

    LR Parsers

    LR denotes a class of bottom up parsers that is capable ofhandling virtually

  • 8/10/2019 Compiler Notes - Ullman

    76/182

    all programming language constructs. LR is efficient; it runs inlinear timewith no backtracking needed. The class of languages handled

    by LR is a propersuperset of the class of languages handled by top down"predictive parsers".LR parsing detects an error as soon as it is possible to doso. Generallybuilding an LR parser is too big and complicated a job to do byhand, we usetools to generate LR parsers.

    The LR parsing algorithm is given below.

    ip = first symbol of inputrepeat {

    s = state on top of parse stacka = *ipcase action[s,a] of {

    SHIFT s': { push(a); push(s') }REDUCE A->beta: {

    pop 2*|beta| symbols; s' = new state on toppush Apush goto(s', A)}

    ACCEPT: return 0 /* success */ERROR: { error("syntax error", s, a); halt }}

    }

  • 8/10/2019 Compiler Notes - Ullman

    77/182

    Constructing SLR Parsing Tables:

    Note: in Spring 2006 this material is FYI but you will not beexamined on it.

    Definition: An LR(0) item of a grammar G is a productionof G with a dot at some position of the RHS.

    Example: The production A->aAb gives the items:

    A -> . a A b

    A -> a . A b

    A -> a A . b

  • 8/10/2019 Compiler Notes - Ullman

    78/182

    A -> a A b .

    Note: A production A-> generates

    only one item:

    A -> .

    Intuition: an item A-> . denotes:

    1. - we have already seen a string2. derivable from

    3.4. - we hope to see a string derivable5. from 6.

    Functions on Sets of Items

    Closure: if I is a set of items for a grammar G, then closure(I)is the set of items constructed as follows:

    1. Every item in I is in closure(I).2.3. If A-> .B4. is in closure(I) and B->5. is a production, then add B-> .

  • 8/10/2019 Compiler Notes - Ullman

    79/182

    6. to closure(I).7.

    These two rules are applied repeatedly until no new items canbe added.

    Intuition: If A -> . B is inclosure(I) then we hope to see a string derivable from B in theinput. So if B-> is a production,

    we should hope to see a string derivable from .Hence, B->. is in closure(I).

    Goto: if I is a set of items and X is a grammar symbol, then

    goto(I,X)is defined to be:

    goto(I,X) = closure({[A->X.] | [A->.X]is in I})

    Intuition:

    [A->.X] is in I => we've seen a string derivable from ; we hope to see a string derivable from X.

  • 8/10/2019 Compiler Notes - Ullman

    80/182

  • 8/10/2019 Compiler Notes - Ullman

    81/182

  • 8/10/2019 Compiler Notes - Ullman

    82/182

  • 8/10/2019 Compiler Notes - Ullman

    83/182

    4. if 2= ,5. A -> 1is the handle,6. and we should reduce by this production

    7.

    Note: two valid items may tell us to do different things for thesame viable prefix. Some of these conflicts can be resolvedusinglookahead on the input string.

    Constructing an SLR Parsing Table

    1. Given a grammar G, construct the augmentedgrammar by adding

    2. the production S' -> S.3.4. Construct C = {I0, I1, In},5. the set of sets of LR(0) items for G'.6.7. State I is constructed from Ii, with parsing action8. determined as follows:

    9.10.o [A -> .aB] is ino Ii, where a is a terminal; goto(Ii,a) = Ijo : set action[i,a] = "shift j"

  • 8/10/2019 Compiler Notes - Ullman

    84/182

    oo [A -> .] is ino Ii: set action[i,a] to "reduce A -> x"

    o for all a e FOLLOW(A), where A != S'oo [S' -> S] is in Ii:o set action[i,$] to "accept"

    11.12.13.14. goto transitions constructed as follows: for all non-

    terminals:15. if goto(Ii, A) = Ij, then goto[i,A] = j16.17. All entries not defined by (3) & (4) are made "error".18. If there are any multiply defined entries, grammar is

    not SLR.19.

    20. Initial state S0of parser: that constructed from21. I0or [S' -> S]22.

    Example:

    S -> aABe FIRST(S) = {a} FOLLOW(S) = {$}A -> Abc FIRST{A} = {b} FOLLOW(A) =

    {b,d}

  • 8/10/2019 Compiler Notes - Ullman

    85/182

    A -> b FIRST{B} = {d} FOLLOW{B} ={e}

    B -> d FIRST{S'}= {a} FOLLOW{S'}=

    {$}I0= closure([S'->.S]= closure([S'->.S],[S->.aABe])

    goto(I0,S) = closure([S'->S.]) = I1goto(I0,a) = closure([S->a.Abe])

    = closure([S->a.Abe],[A->.Abc],[A->.b]) = I2goto(I2,A) = closure([S->aA.Be],[A->A.bc])

    = closure([S->aA.Be],[A->A.bc],[B->.d]) = I3

    goto(I2,B) = closure([A->b.]) = I4goto(I3,B) = closure([S->aAB.e]) = I5goto(I3,b) = closure([A->Ab.c]) = I6goto(I3,d) = closure([B->d.]) = I7goto(I5,e) = closure([S->aABe.]) = I8goto(I6,c) = closure([A->Abc.]) = I9

    lecture #14 began here

    On Tree Traversals

  • 8/10/2019 Compiler Notes - Ullman

    86/182

  • 8/10/2019 Compiler Notes - Ullman

    87/182

    struct tree *child[1]; /* array of children, size varies 0..k */};

    struct tree *alctree(int label, int nkids, ...){int i;va_list ap;struct tree *ptr = malloc(sizeof(struct tree) +

    (nkids-1)*sizeof(struct tree *));if (ptr == NULL) {fprintf(stderr, "alctree out of memory\n");

    exit(1); }

    ptr->label = label;ptr->nkids = nkids;va_start(ap, nkids);for(i=0; i < nkids; i++)

    ptr->child[i] = va_arg(ap, struct tree *);va_end(ap);return ptr;

    }

    Besides a function to allocate trees, you need to write one ormore recursivefunctions to visit each node in the tree, either top to bottom(preorder),or bottom to top (postorder). You might do many differenttraversals on thetree in order to write a whole compiler: check types, generatemachine-

  • 8/10/2019 Compiler Notes - Ullman

    88/182

    independent intermediate code, analyze the code to make itshorter, etc.You can write 4 or more different traversal functions, or you can

    write1 traversal function that does different work at each node,determined bypassing in a function pointer, to be called for each node.

    void postorder(struct tree *t, void (*f)(struct tree *)){

    /* postorder means visit each child, then do work at the parent*/

    int i;if (t == NULL) return;

    /* visit each child */for (i=0; i < t-> nkids; i++)

    postorder(t->child[i], f);

    /* do work at parent */f(t);

    }

    You would then be free to write as many little helper functionsas youwant, for different tree traversals, for example:

    void printer(struct tree *t){

  • 8/10/2019 Compiler Notes - Ullman

    89/182

    if (t == NULL) return;printf("%p: %d, %d children\n", t, t->label, t->nkids);

    }

    Semantic Analysis

    Semantic ("meaning") analysis refers to a phase of compilationin which theinput program is studied in order to determine what operationsare to becarried out. The two primary components of a classic semantic

    analysisphase are variable reference analysis and type checking. Thesecomponentsboth rely on an underlying symbol table.

    What we haveat the start of semantic analysis is a syntax tree

    thatcorresponds to the source program as parsed using the contextfree grammar.Semantic information is added by annotating grammar symbolswith

  • 8/10/2019 Compiler Notes - Ullman

    90/182

    semantic attributes, which are defined bysemantic rules.A semantic rule is a specification of how to calculate a semanticattribute

    that is to be added to the parse tree.

    So the input is a syntax tree...and the output is the same tree,only"fatter" in the sense that nodes carry more information.Another output of semantic analysis are error messagesdetecting manytypes of semantic errors.

    Two typical examples of semantic analysis include:

    variable reference analysisthe compiler must determine, for each use of a variable,which

    variable declaration corresponds to that use. Thisdepends on

    the semantics of the source language being translated.type checking

    the compiler must determine, for each operation in thesource code,

    the types of the operands and resulting value, if any.

    Notations used in semantic analysis:

  • 8/10/2019 Compiler Notes - Ullman

    91/182

  • 8/10/2019 Compiler Notes - Ullman

    92/182

    attributes computed from information obtained from one'sparent or siblings

    These are generally harder to compute. Compilers may

    be able to jumpthrough hoops to compute some inherited attributesduring parsing,

    but depending on the semantic rules this may not bepossible in general.

    Compilers resort to tree traversals to move semanticinformation around

    the tree to where it will be used.

    Attribute Examples

    Isconst and Value

    Not all expressions have constant values; the ones that do mayallowvarious optimizations.

    CFG Semantic Rule

    E1: E2+ TE1.isconst = E2.isconst &&T.isconstif (E1.isconst)

  • 8/10/2019 Compiler Notes - Ullman

    93/182

  • 8/10/2019 Compiler Notes - Ullman

    94/182

    Symbol Table Module

    Symbol tables are used to resolve names within name spaces.Symboltables are generally organized hierarchically according to thescope rules of the language. Although initially concerned withsimplystoring the names of various that are visible in each scope,symbol

    tables take on additional roles in the remaining phases of thecompiler.In semantic analysis, they store type information. And for codegeneration,they store memory addresses and sizes of variables.

    mktable(parent)creates a new symbol table, whose scope is local to (orinside) parent

    enter(table, symbolname, type, offset)insert a symbol into a table

    lookup(table, symbolname)lookup a symbol in a table; returns structure pointer

    including type and offset. lookup operations are oftenchainedtogether progressively from most local scope onout to global scope.

    addwidth(table)

  • 8/10/2019 Compiler Notes - Ullman

    95/182

  • 8/10/2019 Compiler Notes - Ullman

    96/182

    In order to work with your tree, you must be able to tell,preferablytrivially easily, which nodes are tree leaves and which are

    internal nodes,and for the leaves, how to access the lexical attributes.

    Options:

    1. encode in the parent what the types of children are2.3. encode in each child what its own type is (better)

    4.

    How do you do option #2 here?

    Perhaps the best approach to all this is to unify the tokens andparse treenodes with something like the following, where perhaps an

    nkids value of -1is treated as a flag that tells the reader to uselexical information instead of pointers to children:

    struct node {int code; /* terminal or nonterminal symbol */int nkids;union {

    struct token { ... } leaf;struct node *kids[9];}u;

    } ;

  • 8/10/2019 Compiler Notes - Ullman

    97/182

    There are actually nonterminal symbols with 0 children

    (nonterminal witha righthand side with 0 symbols) so you don't necessarily wantto usean nkids of 0 is your flag to say that you are a leaf.

    Type Checking

    Perhaps the primary component of semantic analysis in manytraditionalcompilers consists of the type checker. In order to check types,one first

    must have a representation of those types (a type system) andthen one mustimplement comparison and composition operators on thosetypes using thesemantic rules of the source language being compiled. Lastly,type checkingwill involve adding (mostly-) synthesized attributes through

    those parts ofthe language grammar that involve expressions and values.

    Type Systems

  • 8/10/2019 Compiler Notes - Ullman

    98/182

    Types are defined recursively according to rules defined by the

    sourcelanguage being compiled. A type system might start with ruleslike:

    Base types (int, char, etc.) are types Named types (via typedef, etc.) are types

    Types composed using other types are types, for

    example:

    o array(T, indices) is a type. In some

    o languages indices always start with0, so array(T, size) works.

    oo T1 x T2 is a type (specifying, more oro less, the tuple or sequence T1

    followed by T2;o x is a so-called cross-product

    operator).oo record((f1 x T1) x (f2 x T2) x ... x (fn x

    Tn)) is a typeo

  • 8/10/2019 Compiler Notes - Ullman

    99/182

    o in languages with pointers, pointer(T) is atype

    o

    o (T1x ... Tn) -> Tn+1is ao type denoting a function mappingparameter types to a return type

    o

    In some language type expressions may containvariables whose values

    are types.

    In addition, a type system includes rules for assigning thesetypesto the various parts of the program; usually this will beperformedusing attributes assigned to grammar symbols.

    lecture #16 began here

    Midterm Exam Review

    The Midterm will cover lexical analysis, finite automatas,context freegrammars, syntax analysis, and parsing. Sample problems:

  • 8/10/2019 Compiler Notes - Ullman

    100/182

    1. Write a regular expression for numeric quantities of

    U.S. money2. that start with a dollar sign, followed by one ormore digits.

    3. Require a comma between every three digits, as in$7,321,212.

    4. Also, allow but do not require a decimal pointfollowed by two

    5. digits at the end, as in $5.99

    6.7. Use Thompson's construction to write a non-

    deterministic finite8. automaton for the following regular expression,

    an abstraction9. of the expression used for real number literal

    values in C.

    10.(d+pd*|d*pd+)(ed+)?

    11. Write a regular expression, or explain why you can'twrite a

    12. regular expression, for Modula-2 comments whichuse (* *) as

    13. their boundaries. Unlike C, Modula-2 commentsmay be nested,

    14. as in (* this is a (* nested *) comment *)15.16. Write a context free grammar for the subset of C

    expressions

  • 8/10/2019 Compiler Notes - Ullman

    101/182

    17. that include identifiers and function calls withparameters.

    18. Parameters may themselves be function calls, as in

    f(g(x)),19. or h(a,b,i(j(k,l)))20.21. What are the FIRST(E) and FOLLOW(T) in the

    grammar:22.23. E : E + T | T24. T : T * F | F

    F : ( E ) | ident

    25. What is the -closure(move({2,4},b)) in the followingNFA?

    26. That is, suppose you might be in either state 2 or 4at the time

    27. you see a symbol b: what NFA states might you

    find yourself in28. after consuming b?

    (automata to be written on the board)29.

    Q: What elseis likely to appear on the midterm?

    A: questions that allow you to demonstrate that you know thedifference

  • 8/10/2019 Compiler Notes - Ullman

    102/182

    between an DFA and an NFA, questions about lex and flexand tokens

    and lexical attributes, questions about context free grammars:

    ambiguity, factoring, removing left recursion, etc.

    On the mysterious TYPE_NAME

    The C language typedef construct is an example where all thebeautifultheory we've used up to this point breaks down. Once a typedefisintroduced (which can first be recognized at the syntax level),certainidentifiers should be legal type names instead of identifiers. Tomake

    things worse, they are still legal variable names: the lexicalanalyzerhas to know whether the syntactic context needs a type name oranidentifier at each point in which it runs into one of the