Compiler Notes - Ullman

8/10/2019 Compiler Notes - Ullman

1/182

Compiler Construction Lecture Notes

Introduction

o Lecture 1 (printable)

Lexical Analysiso Lecture 2 (printable)o Lecture 3 (printable)o Lecture 4 (printable)o Lecture 5 (printable)o Lecture 6 (printable)o Lecture 7 (printable)

Syntax Analysiso Lecture 8 (printable)o Lecture 9 (printable)

o Lecture 10 (printable)o Lecture 11 (printable)o Lecture 12 (printable)o Lecture 13 (printable)

Semantic Analysiso Lecture 14 (printable)o Lecture 15 (printable)o Lecture 16 (printable)o Lecture 17 (printable)

Intermediate CodeGeneration

o Lecture 18 (printable)o Lecture 19 (printable)o

Lecture 20 (printable)o Lecture 21 (printable)o Lecture 22 (printable)

Final Code Generationo Lecture 23 (printable)o Lecture 24 (printable)o Lecture 25 (printable)o

Lecture 26 (printable)

lecture #1 began here

Why study compilers?

Most CS students do not go on to write a commercial compilersomeday, but that's not why we study compilers. We studycompiler construction for the following reasons:

Writing a compiler gives experience with large-scaleapplications development. Your compiler program may bethe largest program you write as a student. Experienceworking with really big data structures and complex
http://www.cs.nmsu.edu/~jeffery/courses/370/introhttp://e/Dr%20J_%20compiler%20lecture.html%231http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=1http://e/Dr%20J_%20compiler%20lecture.html%23lexicalhttp://e/Dr%20J_%20compiler%20lecture.html%232http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=2http://e/Dr%20J_%20compiler%20lecture.html%233http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=3http://e/Dr%20J_%20compiler%20lecture.html%234http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=4http://e/Dr%20J_%20compiler%20lecture.html%235http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=5http://e/Dr%20J_%20compiler%20lecture.html%236http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=6http://e/Dr%20J_%20compiler%20lecture.html%237http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=7http://e/Dr%20J_%20compiler%20lecture.html%23syntaxhttp://e/Dr%20J_%20compiler%20lecture.html%238http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=8http://e/Dr%20J_%20compiler%20lecture.html%239http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=9http://e/Dr%20J_%20compiler%20lecture.html%2310http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=10http://e/Dr%20J_%20compiler%20lecture.html%2311http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=11http://e/Dr%20J_%20compiler%20lecture.html%2312http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=12http://e/Dr%20J_%20compiler%20lecture.html%2313http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=13http://e/Dr%20J_%20compiler%20lecture.html%23semantichttp://e/Dr%20J_%20compiler%20lecture.html%2314http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=14http://e/Dr%20J_%20compiler%20lecture.html%2315http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=15http://e/Dr%20J_%20compiler%20lecture.html%2316http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=16http://e/Dr%20J_%20compiler%20lecture.html%2317http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=17http://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%2318http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=18http://e/Dr%20J_%20compiler%20lecture.html%2319http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=19http://e/Dr%20J_%20compiler%20lecture.html%2320http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=20http://e/Dr%20J_%20compiler%20lecture.html%2321http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=21http://e/Dr%20J_%20compiler%20lecture.html%2322http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=22http://www.cs.nmsu.edu/~jeffery/courses/370/finalcodehttp://e/Dr%20J_%20compiler%20lecture.html%2323http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=23http://e/Dr%20J_%20compiler%20lecture.html%2324http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=24http://e/Dr%20J_%20compiler%20lecture.html%2325http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=25http://e/Dr%20J_%20compiler%20lecture.html%2326http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=26http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=26http://e/Dr%20J_%20compiler%20lecture.html%2326http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=25http://e/Dr%20J_%20compiler%20lecture.html%2325http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=24http://e/Dr%20J_%20compiler%20lecture.html%2324http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=23http://e/Dr%20J_%20compiler%20lecture.html%2323http://www.cs.nmsu.edu/~jeffery/courses/370/finalcodehttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=22http://e/Dr%20J_%20compiler%20lecture.html%2322http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=21http://e/Dr%20J_%20compiler%20lecture.html%2321http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=20http://e/Dr%20J_%20compiler%20lecture.html%2320http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=19http://e/Dr%20J_%20compiler%20lecture.html%2319http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=18http://e/Dr%20J_%20compiler%20lecture.html%2318http://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://e/Dr%20J_%20compiler%20lecture.html%23codegenhttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=17http://e/Dr%20J_%20compiler%20lecture.html%2317http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=16http://e/Dr%20J_%20compiler%20lecture.html%2316http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=15http://e/Dr%20J_%20compiler%20lecture.html%2315http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=14http://e/Dr%20J_%20compiler%20lecture.html%2314http://e/Dr%20J_%20compiler%20lecture.html%23semantichttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=13http://e/Dr%20J_%20compiler%20lecture.html%2313http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=12http://e/Dr%20J_%20compiler%20lecture.html%2312http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=11http://e/Dr%20J_%20compiler%20lecture.html%2311http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=10http://e/Dr%20J_%20compiler%20lecture.html%2310http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=9http://e/Dr%20J_%20compiler%20lecture.html%239http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=8http://e/Dr%20J_%20compiler%20lecture.html%238http://e/Dr%20J_%20compiler%20lecture.html%23syntaxhttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=7http://e/Dr%20J_%20compiler%20lecture.html%237http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=6http://e/Dr%20J_%20compiler%20lecture.html%236http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=5http://e/Dr%20J_%20compiler%20lecture.html%235http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=4http://e/Dr%20J_%20compiler%20lecture.html%234http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=3http://e/Dr%20J_%20compiler%20lecture.html%233http://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=2http://e/Dr%20J_%20compiler%20lecture.html%232http://e/Dr%20J_%20compiler%20lecture.html%23lexicalhttp://www.cs.nmsu.edu/~jeffery/cgi-bin/lecselec.cgi?course=370&lecture=1http://e/Dr%20J_%20compiler%20lecture.html%231http://www.cs.nmsu.edu/~jeffery/courses/370/intro


2/182

interactions between algorithms will help you out on yournext big programming project.

Compiler writing is one of the shining triumphs of CS

theory. It demonstrates the value of theory over the impulseto just "hack up" a solution. Compiler writing is a basic element of programming

language research. Many language researchers writecompilers for the languages they design.

Many applications have similar properties to one or morephases of a compiler, and compiler expertise and tools canhelp an application programmer working on other projects

besides compilers.CS 370 is labor intensive. Famous computer scientist Dan Berryof the University of Waterloo has argued convincingly that thereis no software development method for writing large programsthat doesn't involve pain: pain is inevitable in softwaredevelopment (Berry's Theorem). From my own experience as astudent, I posulate Jeffery's Corollary: there is no way to learn

the skills necessary for writing big programs without pain. Agood CS course includes pain, and teaches pain managementand minimization.

The questions we should ask, then, are: (a) should CS majors berequired to spend a lot of time becoming really goodprogrammers? and (b) are we providing students with theassistance and access to the tools and information they need to

accomplish their goals with the minimal doses of inevitable painthat are required?

Some Tools we will use


3/182

Labs and lectures will discuss all of these, but if you do notknow them already, the sooner you go learn them, the better.C and "make".

If you are not expert with these yet, you will be a lot closerby the time you pass this class.lex and yacc

These are compiler-writers tools, but they are useful forother kinds of applications, almost anything with a complexfile format to read in can benefit from them.

gdbIf you do not know a source-level debugger well, start

learning. You will need one to survive this class.e-mail

Regularly e-mailing your instructor is a crucial part of classparticipation. If you aren't asking questions, you aren'tdoing your job as a student.

webThis is where you get your lecture notes, homeworks, and

labs, and turnin all your work.virtual environment

We have a 3D video game / chat tool available that canhelp us handle questions when one of us is not on campus.

Compilers - What Are They and What Kinds of Compilers

are Out There?

The purpose of a compiler is: to translate a program in somelanguage (thesource language) into a lower-level language(the target language). The compiler itself is written in somelanguage, called the implementation language. To write acompiler you have to be very good at programming in the


4/182

implementation language, and have to think about andunderstand the source language and target language.

There are several major kinds of compilers:

Native Code CompilerTranslates source code into hardware (assembly or machinecode) instructions. Example: gcc.

Virtual Machine CompilerTranslates source code into an abstract machine code, forexecution by a virtual machine interpreter. Example: javac.

JIT CompilerTranslates virtual machine code to native code. Operateswithin a virtual machine. Example: Sun's HotSpot javamachine.

PreprocessorTranslates source code into simpler or slightly lower levelsource code, for compilation by another compiler.Examples: cpp, m4.

Pure interpreterExecutes source code on the fly, without generatingmachine code. Example: Lisp.

Phases of a Compiler

Lexical Analysis:Converts a sequence of characters into words, or tokens

Syntax Analysis:Converts a sequence of tokens into aparse tree

Semantic Analysis:Manipulates parse tree to verify symbol and typeinformation


5/182

Intermediate Code Generation:Converts parse tree into a sequence of intermediate codeinstructions

Optimization:Manipulates intermediate code to produce a more efficientprogram

Final Code Generation:Translates intermediate code into final (machine/assembly)code

Example of the Compilation Process

Consider the example statement; its translation to machine codeillustrates some of the issues involved in compiling.

position = initial + rate * 60

30 or so characters, from a single line of source code, are firsttransformed by lexical analysis into a sequence of 7 tokens.

Those tokens are then used to build a tree of height 4 duringsyntax analysis. Semantic analysis may transform the tree intoone of height 5, that includes a type conversion necessary forreal addition on an integer operand. Intermediate codegeneration uses a simple traversal algorithm to linearize the treeback into a sequence of machine-independent three-address-code instructions.

t1 = inttoreal(60)t2 = id3* t1t3 = id2+ t2id1= t3


6/182

Optimization of the intermediate code allows the fourinstructions to be reduced to two machine-independentinstructions. Final code generation might implement these two

instructions using 5 machine instructions, in which the actualregisters and addressing modes of the CPU are utilized.

MOVF id3, R2MULF #60.0, R2MOVF id2, R1ADDF R2, R1MOVF R1, id1


Announcements

Reading!

I hope you have already been reading! Make sure you read the

class lecture notes, the related sections of the text, and pleaseask questions about whatever is not totally clear. You canAskQuestionsin class, via e-mail, in the virtual environment, or ontheclass message board.

Note: although last year's CS 370 lecture notes are ALLavailable to you up front, I generally revise each lecture's notes,making additions, corrections and adaptations to this year's

homeworks, the night before each lecture. The best time to printhard copies of the lecture notes is one day at a time, right beforethe lecture is given.

Overview of Lexical Analysis
http://www.cs.nmsu.edu/~jeffery/courses/370/index.htmlhttp://www.cs.nmsu.edu/~jeffery/courses/370/index.html


7/182

A lexical analyzer, also called ascanner, typically has thefollowing functionality and characteristics.

Its primary function is to convert from a (often very long)

sequence of characters into a (much shorter, perhaps 10Xshorter) sequence of tokens. This means less work forsubsequent phases of the compiler.

The scanner must Identify and Categorize specificcharacter sequences into tokens. It must know whetherevery two adjacent characters in the file belong together inthe same token, or whether the second character must be ina different token.

Most lexical analyzers discard comments & whitespace. Inmost languages these characters serve to separate tokensfrom each other, but once lexical analysis is completed theyserve no purpose. On the other hand, the exact line # and/orcolumn # may be useful in reporting errors, so some recordof what whitespace has occurred may be retained.Note:insome languages, even popular ones, whitespace is

significant. Handle lexical errors (illegal characters, malformed tokens)

by reporting them intelligibly to the user. Efficiency is crucial; a scanner may perform elaborate input

buffering Token categories can be (precisely, formally) specified

using regular expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]* Lexical Analyzers can be written by hand, or implemented

automatically using finite automata.

What is a "token" ?


8/182

In compilers, a "token" is:1.

a single word of source code input (a.k.a. "lexeme")2.

an integer code that refers to a single word of input

3.

a set of lexical attributes computed from a single word ofinputProgrammers think about all this in terms of #1. Syntaxchecking uses #2. Error reporting, semantic analysis, and codegeneration require #3. In a compiler written in C, for each tokenyou allocate a C struct to store (3) for each token.

Worth Mentioning

Here are the names of several important tools closely related tocompilers. You should learn those of these terms that you don'talready know.interpreter

a language processor program that translates and executessource code directly, without compiling it ot machine code.

assembler

a translator from human readable (ASCII text) files ofmachine instructions into the actual binary code (objectfiles) of a machine.

linkera program that combines (multiple) object files to make anexecutable. Converts names of variables and functions tonumbers (machine addresses).

loaderProgram to load code. On some systems, differentexecutables start at different base addresses, so the loadermust patch the executable with the actual base address ofthe executable.


9/182

preprocessorProgram that processes the source code before the compilersees it. Usually, it implements macro expansion, but it can

do much more.editorEditors may operate on plain text, or they may be wiredinto the rest of the compiler, highlighting syntax errors asyou go, or allowing you to insert or delete entire syntaxconstructs at a time.

debuggerProgram to help you see what's going on when your

program runs. Can print the values of variables, show whatprocedure called what procedure to get where you are, runup to a particular line, run until a particular variable gets aspecial value, etc.

profilerProgram to help you see where your program is spendingits time, so you can tell where you need to speed it up.

Auxiliary data structures

You were presented with the phases of the compiler, fromlexical and syntax analysis, through semantic analysis, andintermediate and final code generation. Each phase has an inputand an output to the next phase. But there are a few datastructures we will build that survive across multiple phases: the

literal table, the symbol table, and the error handler.lexeme table

a table that stores lexeme values, such as strings andvariable names, that may occur in many places. Only one


10/182

copy of each unique string and name needs to be allocatedin memory.

symbol table

a table that stores the names defined (and visible with) eachparticular scope. Scopes include: global, and procedure(local). More advanced languages have more scopes suchas class (or record) and package.

error handlererrors in lexical, syntax, or semantic analysis all need acommon reporting mechanism, that shows where the erroroccurred (filename, line number, and maybe column

number are useful).

Reading Named Files in C using stdio

In this class you are opening and reading files. Hopefully this isreview for you; if not, you will need to learn it quickly. To doany "standard I/O" file processing, you start by including theheader:

#include This defines a data type (FILE *) and gives prototypes forrelevant functions. The following code opens a file using astring filename, reads the first character (into an int variable, nota char, so that it can detect end-of-file; EOF is not a legal charvalue).

FILE *f = fopen(filename, "r");int i = fgetc(f);if (i == EOF) /* empty file... */

Command line argument handling and file processing in C


11/182

The following example is from Kernighan & Ritchie's "The CProgramming Language", page 162.

#include

/* cat: concatenate files, version 1 */int main(int argc, char *argv[]){

FILE *fp;void filecopy(FILE *, FILE *);

if (argc == 1)filecopy(stdin, stdout);

elsewhile (--argc > 0)

if ((fp = fopen(*++argv, "r")) == NULL) {printf("cat: can't open %s\n", *argv);return 1;

}else {

filecopy(fp, stdout);fclose(fp);}

return 0;}

void filecopy(FILE *ifp, FILE *ofp){

int c;

while ((c = getc(ifp)) != EOF)


12/182

putc(c, ofp);}Warning: while using and adapting the above code is fair game

in this class, the yylex() function is very different than thefilecopy() function! It takes no parameters! It returns an integer

every time it finds a token! So if you "borrow" from this

example, delete filecopy() and write yylex() from scratch.

Multiple students have fallen into this trap before you.

A Brief Introduction to Make

It is not a good idea to write a large program like a compiler as asingle source file. For one thing, every time you make a smallchange, you would need to recompile the whole program, whichwill end up being many thousands of lines. For another thing,parts of your compiler may be generated by "compilerconstruction tools" which will write separate files. In any case,this class will require you to use multiple source files, compiledseparately, and linked together to form your executable program.

This would be a pain, except we have "make" which takes careof it for us. Make uses an input file named "makefile", whichstores in ASCII text form a collection of rules for how to build aprogram from its pieces. Each rule shows how to build a filefrom its source files, or dependencies. For example, to compile afile under C:

foo.o : foo.cgcc -c foo.c

The first line says to build foo.o you need foo.c, and the secondline, which mustbeing with a tab, gave a command-line to


13/182

execute whenever foo.o should be rebuilt, i.e. when it is missingor when foo.c has been changed and need to be recompiled.

The first rule in the makefile is what "make" builds by default,

but note that make dependencies are recursive: before it checkswhether it needs to rebuild foo.o from foo.c it will checkwhether foo.c needs to be rebuilt using some other rule. Becauseof this post-order traversal of the "dependency graph", the firstrule in your makefile is usually the last one that executes whenyou type "make". For a C program, the first rule in yourmakefile would usually be the "link" step that assembles objects

files into an executable as in:

compiler: foo.o bar.o baz.ogcc -o compiler foo.o bar.o baz.o

There is a lot more to "make" but we will take it one step at atime. Thisarticle on Make may be useful to you. You can findother useful on-line documentation on "make" (manual page,

Internet reference guides, etc) if you look.

A couple finer points forHW#1

extern vs. #include: when do you use the one, when the other? public interface to yylex(): no, you can't add your ownparameters

Regular Expressions

The notation we use to precisely capture all the variations that agiven category of token may take are called "regularexpressions" (or, less formally, "patterns". The word "pattern" isreally vague and there are lots of other notations for patterns
http://developers.sun.com/solaris/articles/make_utility.htmlhttp://www.cs.nmsu.edu/~jeffery/courses/370/hw1.htmlhttp://www.cs.nmsu.edu/~jeffery/courses/370/hw1.htmlhttp://developers.sun.com/solaris/articles/make_utility.html


14/182

besides regular expressions). Regular expressions are ashorthand notation for sets of strings. In order to even talk about"strings" you have to first define an alphabet, the set of

characters which can appear.1.Epsilon () is a regular expression denoting the set

containing the empty string2.Any letter in the alphabet is also a regular expression

denoting the set containing a one-letter string consisting ofthat letter.

3.

For regular expressions r and s,r | s

is a regular expression denoting the union of r and s4.

For regular expressions r and s,r s

is a regular expression denoting the set of strings consistingof a member of r followed by a member of s

5.

For regular expression r,r*

is a regular expression denoting the set of strings consistingof zero or more occurrences of r.

6.You can parenthesize a regular expression to specifyoperator precedence (otherwise, alternation is like plus,concatenation is like times, and closure is likeexponentiation)

Although these operators are sufficient to describe all regularlanguages, in practice everybody uses extensions:

For regular expression r,r+

is a regular expression denoting the set of strings consistingof one or more occurrences of r. Equivalent to rr*


15/182

For regular expression r,r?

is a regular expression denoting the set of strings consisting

of zero or one occurrence of r. Equivalent to r| The notation [abc] is short for a|b|c. [a-z] is short for

a|b|...|z. [^abc] is short for: any character other than a, b, orc.


What is a "lexical attribute" ?

A lexical attribute is a piece of information about a token. Thesetypically include:

category an integer code used to check syntax

lexeme actual string contents of the token

line, column, file where the lexeme occurs in source code

value for literals, the binary data they represent

Homework #2

Avoid These Common Bugs in Your Homeworks!

1.

yytext or yyinput were not declared global2.main() does not have its required argc, argv parameters!3.main() does not call yylex() in a loop or check its return

value

4.

getc() EOF handling is missing or wrong! check EVERYall to getc() for EOF!

5.opened files not (all) closed! file handle leak!6.

end-of-comment code doesn't check for */7.

yylex() is not doing the file reading
http://www.cs.nmsu.edu/~jeffery/courses/370/hw2.htmlhttp://www.cs.nmsu.edu/~jeffery/courses/370/hw2.html


16/182

8.

yylex() does not skip multiple spaces, mishandles spaces atthe front of input, or requirescertain spaces in order tofunction OK

9.

extra or bogus output not in assignment spec10.

= instead of ==

Some Regular Expression Examples

In a previous lecture we saw regular expressions, the preferrednotation for specifying patterns of characters that define tokencategories. The best way to get a feel for regular expressions is

to see examples. Note that regular expressions form the basis forpattern matching in many UNIX tools such as grep, awk, perl,etc.

What is the regular expression for each of the different lexicalitems that appear in C programs? How does this compare withanother, possibly simpler programming language such asBASIC?

lexical

categoryBASIC C

operators the charactersthemselves

For operators that are regularexpression operators we needmark them with double quotes orbackslashes to indicate you mean

the character, not the regularexpression operator. Note severaloperators have a common prefix.The lexical analyzer needs to lookahead to tell whether an = is anassignment, or is followed by


17/182

another = for example.

reserved

words

the concatenationof characters;case insensitive

Reserved words are also matchedby the regular expression for

identifiers, so a disambiguatingrule is needed.

identifiers

no _; $ at ends ofsome; 2significantletters!?; caseinsensitive

[a-zA-Z_][a-zA-Z0-9]*

numbersints and reals,starting with [0-9]+

0x[0-9a-fA-F]+ etc.

comments REM.* C's comments are tricky regexp's

stringsalmost ".*"; noescapes

escaped quotes

what else?

lex(1) and flex(1)

These programs generally take a lexical specification given in a.l file and create a corresponding C language lexical analyzer ina file named lex.yy.c. The lexical analyzer is then linked withthe rest of your compiler.

The C code generated by lex has the following public interface.Note the use of global variables instead of parameters, and theuse of the prefix yy to distinguish scanner names from yourprogram names. This prefix is also used in the YACC parsergenerator.


18/182

FILE *yyin; /* set this variable prior to calling yylex() */int yylex(); /* call this function once for each token */

char yytext[]; /* yylex() writes the token's lexeme to an array *//* note: with flex, I believe extern declarations mustread

extern char *yytext;*/

int yywrap(); /* called by lex when it hits end-of-file; seebelow */

The .l file format consists of a mixture of lex syntax and C codefragments. The percent sign (%) is used to signify lex elements.The whole file is divided into three sections separated by %%:

header%%

body

%%helper functions

The header consists of C code fragments enclosed in %{ and %}as well as macro definitions consisting of a name and a regularexpression denoted by that name. lex macros are invokedexplicitly by enclosing the macro name in curly braces.Following are some example lex macros.

letter [a-zA-Z]digit [0-9]ident {letter}({letter}|{digit})*


19/182

The body consists of of a sequence of regular expressions fordifferent token categories and other lexical entities. Each regularexpression can have a C code fragment enclosed in curly braces

that executes when that regular expression is matched. For mostof the regular expressions this code fragment (also calledasemantic actionconsists of returning an integer that identifiesthe token category to the rest of the compiler, particularly foruse by the parser to check syntax. Some typical regularexpressions and semantic actions might include:

" " { /* no-op, discard whitespace */ }{ident} { return IDENTIFIER; }"*" { return ASTERISK; }"." { return PERIOD; }You also need regular expressions for lexical errors such asunterminated character constants, or illegal characters.

The helper functions in a lex file typically compute lexical

attributes, such as the actual integer or string values denoted byliterals. One helper function you have to write is yywrap(),which is called when lex hits end of file. If you just want lex toquit, have yywrap() return 1. If your yywrap() switches yyin to adifferent file and you want lex to continue processing, haveyywrap() return 0. The lex or flex library (-ll or -lfl) have defaultyywrap() function which return a 1, and flex has the

directive %option noyywrap which allows you to skip writingthis function.

A Short Comment on Lexing C Reals


20/182


21/182

.The dot operator matches any one character exceptnewline: [^\n]

r* match r 0 or more times.r+

match r 1 or more times.r?

match r 0 or 1 time.r{m,n}

match r between m and n times.

r1r2concatenation. match r1followed by r2

r1|r2alternation. match r1or r2

(r)parentheses specify precedence but do not match anything

r1/r2

lookahead. match r1when r2follows, without consuming r2^r

match r only when it occurs at the beginning of a liner$

match r only when it occurs at the end of a line


Announcements

Next homework I promise: I will ask the TA to run yourprogram with a nonexistent file as a command-line argument!

Lexical Attributes and Token Objects


22/182

Besides the token's category, the rest of the compiler may needseveral pieces of information about a token in order to performsemantic analysis, code generation, and error handling. These

are stored in an object instance of class Token, or in C, a struct.The fields are generally something like:

struct token {int category;char *text;int linenumber;int column;

char *filename;union literal value;

}The union literal will hold computed values of integers, realnumbers, and strings.In your homework assignment, I amrequiring you to compute column #'s; not all compilers require

them, but they are easy. Also: in our compiler project we are not

worrying about optimizing our use of memory, so am notrequiring you to use a union.

Flex Manpage Examplefest

To read a UNIX "man page", or manual page, you type"man command" where command is the UNIX program orlibrary function you need information on. Read the man page for

man to learn more advanced uses ("man man").

It turns out the flex man page is intended to be pretty complete,enough so that we can draw our examples from it. Perhaps whatyou should figure out from these examples is that flex is


23/182

actually... flexible. The first several examples use flex as a filterfrom standard input to standard output.

sneaky string removal tool:

%% "zap me" excess whitespace trimmer %% [ \t]+ putchar( ' ' );

[ \t]+$ /* ignore this token */ sneaky string substitution tool: %% username printf( "%s", getlogin() ); Line Counter/Word Counter int num_lines = 0, num_chars = 0; %% \n ++num_lines; ++num_chars; . ++num_chars; %% main()

{ yylex(); printf( "# of lines = %d, # of chars =

%d\n", num_lines, num_chars );


24/182

} Toy compiler example

/* scanner for a toy Pascal-like language */ %{ /* need this for the call to atof() below */ #include %} DIGIT [0-9]

ID [a-z][a-z0-9]* %% {DIGIT}+ { printf( "An integer: %s (%d)\n",

yytext,

atoi( yytext ) ); } {DIGIT}+"."{DIGIT}* { printf( "A float: %s (%g)\n", yytext, atof( yytext ) ); } if|then|begin|end|procedure|function { printf( "A keyword: %s\n", yytext ); }


25/182

{ID} printf( "An identifier: %s\n",yytext );

"+"|"-"|"*"|"/" printf( "An operator: %s\n",yytext ); "{"[^}\n]*"}" /* eat up one-line

comments */ [ \t\n]+ /* eat up whitespace */

. printf( "Unrecognized character:%s\n", yytext );

%% main( argc, argv ) int argc;

char **argv; { ++argv, --argc; /* skip over program

name */ if ( argc > 0 ) yyin = fopen( argv[0], "r" ); else yyin = stdin; yylex(); }


26/182

On the use of character sets (square brackets) in lex and

similar tools

A student recently sent me an example regular expression for

comments that read:

COMMENT [/*][[^*/]*[*]*]]*[*/]One problem here is that square brackets are not parentheses,they do not nest, they do not support concatenation or otherregular expression operators. They mean exactly: "match anyone of these characters" or for ^: "match any one character that

is not one of these characters". Note also that you can'tuse ^ asa "not" operator outside of square brackets: you can't write theexpression for "stuff that isn't */" by saying (^ "*/")


Finite Automata

A finite automaton (FA) is an abstract, mathematical machine,also known as a finite state machine, with the followingcomponents:

1.A set of states S2.A set of input symbols E (the alphabet)3.A transition function move(state, symbol) : new state(s)4.A start state S05.

A set of final states F

The wordfiniterefers to the set of states: there is a fixed size tothis machine. No "stacks", no "virtual memory", just a knownnumber of states. The word automatonrefers to the executionmode: there is no instruction set, there is no sequence of


27/182

instructions, there is just a hardwired short loop that executes thesame instruction over and over:

while ((c=getchar()) != EOF) S := move(S, c);DFAs

The type of finite automata that is easiest to understand andsimplest to implement (say, even in hardware) is called adeterministic finite automaton (DFA). Theword deterministichere refers to the return value of function

move(state, symbol), which goes to at most one state. Example:

S = {s0, s1, s2}E = {a, b, c}move = { (s0,a):s1; (s1,b):s2; (s2,c):s2 }S0 = s0F = {s2}

Finite automata correspond in a 1:1 relationship to transitiondiagrams; from any transition diagram one can write down theformal automaton in terms of items #1-#5 above, and vice versa.To draw the transition diagram for a finite automaton:

draw a circle for each state s in S; put a label inside thecircles to identify each state by number or name

draw an arrow between Siand Sj, labeled with x whenever

the transition says to move(Si, x) : Sj draw a "wedgie" into the start state S0 to identify it draw a second circle inside each of the final states in F

The Automaton Game


28/182

If I give you a transition diagram of a finite automaton, you canhand-simulate the operation of that automaton on any input Igive you.

DFA Implementation

The nice part about DFA's is that they are efficientlyimplemented on computers. What DFA does the following codecorrespond to? What is the corresponding regular expression?You can speed this code fragment up even further if you arewilling to use goto's or write it in assembler.

state := S0for(;;)

switch (state) {case 0:

switch (input) {'a': state = 1; input = getchar(); break;'b': input = getchar(); break;

default: printf("dfa error\n"); exit(1);}

case 1:switch (input) {

EOF: printf("accept\n"); exit(0);default: printf("dfa error\n"); exit(1);}

}

Deterministic Finite Automata Examples

A lexical analyzer might associate different final states withdifferent token categories:


29/182

C Comments:

Nondeterministic Finite Automata (NFA's)

Notational convenience motivates more flexible machines inwhich function move() can go to more than one state on a giveninput symbol, and some states can move to other states evenwithout consuming an input symbol (-transitions).

Fortunately, one can prove that for any NFA, there is anequivalent DFA. They are just a notational convenience. So,

finite automata help us get from a set of regular expressions to acomputer program that recognizes them efficiently.

NFA Examples

-transitions make it simpler to merge automata:


30/182

multiple transitions on the same symbol handle common

prefixes:

factoring may optimize the number of states. Is this pictureOK/correct?


31/182

C Pointers, malloc, and your future

For most of you success as a computer scientist may boil downto whether you can master the concept of dynamically allocated

memory. In C this means pointers and the malloc() family offunctions. Here are some tips:

Draw "memory box" pictures of your variables. Pencil andpaper understanding of memory leads to correct runningprograms.

Always initialize local pointer variables. Consider thiscode:

void f() { int i = 0; struct tokenlist *current, *head; ... foo(current)


32/182

}Here, current is passed in as a parameter to foo, but it is apointer that hasn't been pointed at anything. I cannot tell

you how many times I personally have written bugs myselfor fixed bugs in student code, caused by reading or writingto pointers that weren't pointing at anything in particular.Local variables that weren't initialized point at randomgarbage. If you are lucky this is a coredump, but you mightnot be lucky, you might not find out where the mistakewas, you might just get a wrong answer. This can all befixed by

struct tokenlist *current = NULL, *head = NULL; Avoid this common C bug: struct token *t = (struct token

*)malloc(sizeof(struct token *)));This compiles, but causes coredumps during program

execution. Why? Check your malloc() return value to be sure it is not NULL.

Sure, modern programs will "never run out of memory".Wrong. malloc() can return NULL even on big machines.Operating systems often place limits on memory so as toprotect themselves from runaway programs or hackerattacks.

Regular expression examples

Can you draw an NFA corresponding to the following?

(a|c)*b(a|c)*


33/182

(a|c)*|(a|c)*b(a|c)*

(a|c)*(b|)(a|c)*Regular expressions can be converted automatically to

NFA's

Each rule in the definition of regular expressions has acorresponding NFA; NFA's are composedusing transitions.This is called "Thompson's construction" ). We will work

examples such as (a|b)*abb in class and during lab.1.

For , draw two states with a single transition.

2.

For any letter in the alphabet, draw two states with a singletransition labeled with that letter.

3.

For regular expressions r and s, draw r | s by adding a newstart state with transitions to the start states of r and s, and

a new final state with transitions from each final state in r


34/182

and s.

4.For regular expressions r and s, draw rs by adding transitions from the final states of r to the start state of s.

5.

For regular expression r, draw r* by adding new start andfinal states, and transitions

o from the start state to the final state,o from the final state back to the start state,o from the new start to the old start and from the old

final states to the new final state.


35/182

6.For parenthesized regular expression (r) you can use theNFA for r.


NFA's can be converted automatically to DFA's

In: NFA NOut: DFA DMethod: Construct transition table Dtran (a.k.a. the "move

function"). Each DFA state is a set of NFA states. Dtransimulates in parallel all possible moves N can make on a givenstring.

Operations to keep track of sets of NFA states:

_closure(s)set of states reachable from state s via

_closure(T)set of states reachable from any state in set T via

move(T,a)set of states to which there is an NFA transition from statesin T on symbol a


36/182

NFA to DFA Algorithm:

Dstates := {_closure(start_state)}

while T := unmarked_member(Dstates) do {mark(T)for each input symbol a do {

U := _closure(move(T,a))if not member(Dstates, U) then

insert(Dstates, U)Dtran[T,a] := U

}}

Practice converting NFA to DFA

OK, you've seen the algorithm, now can you use it?

...


37/182

...did you get:

OK, how about this one:


Some Remarks

I have a collection of compiler textbooks in my office,

which I will make avaliable as "loaners" from class periodto class period, all you have to do is sign a return contractin blood.

If you checked out the class web page, you saw a solutionto HW#1 was posted awhile ago... I will try to do this for


38/182

future assignments also, but not immediately, so as to allowstudents a few days of lateness without a heavy penalty.

Whether we return the same or a different category for

integer constants and for line numbers depends very muchon the grammar we use to parse our language.

Lexical Analysis and the Literal Table

In many compilers, the memory management components of thecompiler interact with several phases of compilation, startingwith lexical analysis.

Efficient storage is necessary to handle large input files. There is a colossal amount of duplication in lexical data:

variable names, strings and other literal values duplicatefrequently

What token type to use may depend on previousdeclarations.

A hash table or other efficient data structure can avoid this

duplication. The software engineering design pattern to use iscalled the "flyweight".

Major Data Structures in a Compiler

tokencontains an integer category, lexeme, line #, column #,filename... We could build these into a link list, but instead

we'll use them as leaves in a tree structure.syntax tree

contains grammar information about a sequence of relatedtokens. leaves contain lexical information (tokens). internal


39/182

nodes contain grammar rules and pointers to tokens orother tree nodes.

symbol table

contains variable names, types, and information needed togenerate code for a name (such as its address, or constantvalue). Look ups are by name, so we'll need a hash table.

intermediate & final codeWe'll need link lists or similar structures to hold sequencesof machine instructions

Literal Table: Usage Example

Example abbreviated from [ASU86]: Figure 3.18, p. 109. Use"install_id()" instead of "strdup()" to avoid duplication in thelexical data.

%{/* #define's for token categories LT, LE, etc.%}

white [ \t\n]+digit [0-9]id [a-zA-Z_][a-zA-Z_0-9]*num {digit}+(\.{digit}+)?

%%

{ws} { /* discard */ }if { return IF; }then { return THEN; }else { return ELSE; }


40/182

{id} { yylval.id = install_id(); return ID; }{num} { yylval.num = install_num(); return NUMBER; }"" { yylval.op = GT; return RELOP; }

%%

install_id(){

/* insert yytext into the literal table */}

install_num(){

/* insert (binary number corresponding to?) yytext into theliteral table */}So how would you implement a literal table using a hash table?

We will see more hash tables when it comes time to constructthe symbol tables with which variable names and scopes aremanaged, so you had better become fluent.


Constructing your Token inside yylex()

A student recently asked if it was OK to allocate a tokenstructure inside main() after yylex() returns the token. This is notOK because in the next phase of your compiler, you are notcalling yylex(), the automatically generated parser will callyylex(). There is a way for the parser to grab your token if


41/182

you've stored it in a global variable, but there is not a way forthe parser to build the token structure itself.

Syntax Analysis

Parsingis the act of performing syntax analysis to verify aninput program's compliance with the source language. A by-product of this process is typically a tree that represents thestructure of the program.

Context Free Grammars

A context free grammar G has: A set of terminal symbols, T A set of nonterminal symbols, N A start symbol, s, which is a member of N A set of production rules of the form A -> w, where A is a

nonterminal and w is a string of terminal and nonterminalsymbols.

A context free grammar can be used togeneratestrings in thecorresponding language as follows:let X = the start symbol swhile there is some nonterminal Y in X do

apply any one production rule using Y, e.g. Y -> wWhen X consists only of terminal symbols, it is a string of thelanguage denoted by the grammar. Each iteration of the loop isa derivation step. If an iteration has several nonterminals to

choose from at some point, the rules of derviation would allowany of these to be applied. In practice, parsing algorithms tend toalways choose the leftmost nonterminal, or the rightmostnonterminal, resulting in strings that are leftmostderivationsor rightmost derivations.


42/182

Context Free Grammar Examples

Well, OK, so how much of the C language grammar can wecome up with in class today? Start with expressions, work on up

to statements, and work there up to entire functions, andprograms.


Dr. Pontelli is looking for a web developer, did everyone seethat ad? I too am looking for student research assistants.

Grammar Ambiguity

The grammar

E -> E + EE -> E * EE -> ( E )E -> ident

allows two different derivations for strings such as "x + y * z".The grammar is ambiguous, but the semantics of the languagedictate a particular operator precedence that should be used. Oneway to eliminate such ambiguity is to rewrite the grammar. Forexample, we can force the precedence we want by adding somenonterminals and production rules.

E -> E + TE -> TT -> T * FT -> FF -> ( E )


43/182

F -> identGiven the arithmetic expression grammar from last lecture:

How can a program figure that x + y * z is legal?

How can a program figure out that x + y (* z) is illegal?

A brief aside on casting your mallocs

If you don't put a prototype for malloc(), C thinks it returns anint.

#include

includes prototypes for malloc(), free(), etc. malloc() returns avoid *.

void * means "pointer that points at nothing", or "pointer thatpoints at anything". You need to cast it to what you are reallypointing at, as in:

union lexval *l = (union lexval *)malloc(sizeof(union lexval));Note the stupid duplication of type information; no language isperfect! Anyhow, always cast your mallocs. The program maywork without the cast, but you need to fix every warning, so youdon't accidentally let a serious one through.

Recursive Descent Parsing

Perhaps the simplest parsing method, for a large subset ofcontext free grammars, is called recursive descent. It is simplebecause the algorithm closely follows the production rules ofnonterminal symbols.

Write 1 procedure per nonterminal rule


44/182

Within each procedure, a) match terminals at appropriatepositions, and b) call procedures for non-terminals.

Pitfalls:

1.

left recursion is FATAL2.

must distinguish between several production rules, orpotentially, one has to try all of them via backtracking.

Recursive Descent Parsing Example #1

Consider the grammar we gave above. There will be functionsfor E, T, and F. The function for F() is the "easiest" in some

sense: based on a single token it can decide which productionrule to use. The parsing functions return 0 (failed to parse) if thenonterminal in question cannot be derived from the tokens at thecurrent point. A nonzero return value of N would indicatesuccess in parsing using production rule #N.

int F(){

int t = yylex();if (t == IDENT) return 6;else if (t == LP) {

if (E() && (yylex()==RP) return 5;}

return 0;}

Comment #1: if F() is in the middle of a larger parse of E() orT(), F() may succeed, but the subsequent parsing may fail. Theparse may have to backtrack, which would mean we'd have tobe able to put tokens back for later parsing. Add a memory (say,a gigantic array or link list for example) of already-parsed


45/182

tokens to the lexical analyzer, plus backtracking logic to E() orT() as needed. The call to F() may get repeated following adifferent production rule for a higher nonterminal.

Comment #2: in a real compiler we need more than "yes itparsed" or "no it didn't": we need a parse tree if it succeeds, andwe need a useful error message if it didn't.

Question: for E() and T(), how do we know which productionrule to try? Option A: just blindly try each one in turn. Option B:look at the first (current) token, only try those rules that start

with that token (1 character lookahead). If you are lucky, thatone character will uniquely select a production rule. If that isalways true through the whole grammar, no backtracking isneeded.

Question: how do we know which rules start with whatevertoken we are looking at? Can anyone suggest a solution, or arewe stuck?


Announcements

Homework #3 minor extension Midterm exam: Thursday March 16 The first midterm exam will cover lexical analysis and

syntax analysis

Removing Left Recursion

E -> E + T | T


46/182

T -> T * F | FF -> ( E ) | identWe can remove the left recursion by introducing new

nonterminals and new production rules.

E -> T E'E' -> + T E' | T -> F T'T' -> * F T' | F -> ( E ) | identGetting rid of such immediate left recursionis not enough, one

must get rid of indirect left recursion, where two or morenonterminals are mutually left-recursive. One canrewrite anyCFG to remove left recursion (Algorithm 4.1).

for i := 1 to n dofor j := 1 to i-1 do begin

replace each Ai-> Ajgamma with productions

Ai-> delta1gamma | delta2gammaend

eliminate immediate left recursion

Removing Left Recursion, part 2

Left recursion can be broken into three cases

case 1: trivial

A : A | The recursion must always terminate by A finally deriving so

you can rewrite it to the equivalent


47/182

A : A'A' : A' |

Example:

E : E op T | Tcan be rewritten

E : T E'E' : op T E' |

case 2: non-trivial, but immediate

In the more general case, there may be multiple recursiveproductions and/or multiple non-recursive productions.

A : A 1| A 2| ... | 1| 2As in the trivial case, you get rid of left-recursing A andintroduce an A'

A : 1A' | 2A' | ...A' : 1A' | 2A' | ... |

case 3: mutual recursion

1.Order the nonterminals in some order 1 to N.2.

Rewrite production rules to eliminate all nonterminals in

leftmost positions that refer to a "previous" nonterminal.When finished, all productions' right hand symbols startwith a terminal or a nonterminal that is numbered equal orhigher than the nonterminal no the left hand side.

3.

Eliminate the direct left recusion as per cases 1-2.


48/182

Left Recursion Versus Right Recursion: When does it

Matter?

A student came to me once with what they described as an

operator precedence problem where 5-4+3 was computing thewrong value (-2 instead of 4). What it really was, was anassociativity problem due to the grammar:

E : T + E | T - E | TThe problem here is that right recursion is forcing rightassociativity, but normal arithmetic requires left associativity.

Several solutions are: (a) rewrite the grammar to be leftrecursive, or (b) rewrite the grammar with more nonterminals toforce the correct precedence/associativity, or (c) if using YACCor Bison, there are "cheat codes" we will discuss later to allow itto be majorly ambiguous and specify associativity separately(look for %left and %right in YACC manuals).

Recursive Descent Parsing Example #2

The grammar

S -> A B CA -> a AA -> B -> bC -> cmaps to pseudocode like the following. (:= is an assignmentoperator)

procedure S()if A() & B() & C() then succeed # matched S, we win


49/182

end

procedure A()

if yychar == a then { # use production 2yychar := scan()return A()}

elsesucceed # production rule 3, match

end

procedure B()if yychar == b then {

yychar := scan()succeed}

else failend

procedure C()if yychar == c then {

yychar := scan()succeed}

else failend

Backtracking?


50/182

Could your current token begin more than one of your possibleproduction rules? Try all of them, remember and reset state foreach try.

S -> cAdA -> abA -> aLeft factoringcan often solve such problems:

S -> cAdA -> a A'

A'-> bA'-> ()One can also perform left factoring to reduce or eliminate thelookahead or backtracking needed to tell which production ruleto use. If the end result has no lookahead or backtrackingneeded, the resulting CFG can be solved by a "predictive parser"and coded easily in a conventional language. If backtracking is

needed, a recursive descent parser takes more work toimplement, but is still feasible. As a more concrete example:

S -> ifE thenSS -> ifE thenS1else S2can be factored to:

S -> ifE thenS S'S'-> else S2|

Some More Parsing Theory


51/182

Automatic techniques for constructing parsers start withcomputing some basic functions for symbols in the grammar.These functions are useful in understanding both recursive

descent and bottom-up LR parsers.First(a)

First(a) is the set of terminals that begin strings derived from a,which can include .

1.First(X) starts with the empty set.2.if X is a terminal, First(X) is {X}.

3.

if X -> is a production, add to First(X).4.

if X is a non-terminal and X -> Y1Y2... Ykis a production,add First(Y1) to First(X).

5. for (i = 1; if Yican derive ; i++)6. add First(Yi+1) to First(X)7.

First(a) examples

by the way, this stuff is all in section 4.3 in your text.

Last time we looked at an example with E, T, and F, and + and*. The first-set computation was not too exciting and we needmore examples.

stmt : if-stmt | OTHERif-stmt: IF LP expr RP stmt else-partelse-part: ELSE stmt | expr: IDENT | INTLITWhat are the First() sets of each nonterminal?


52/182

Follow(A)

Follow(A) for nonterminal A is the set of terminals that canappear immediately to the right of A in some sentential form S -

> aAxB... To compute Follow, apply these rules to allnonterminals in the grammar:

1.

Add $ to Follow(S)2.

if A -> aBb then add First(b) - to Follow(B)3.

if A -> aB or A -> aBb where is in First(b), then addFollow(A) to Follow(B).

On resizing arrays in CThe sval attribute in homework #2 is a perfect example of aproblem which a BCS major might not be expected to manage,but a CS major should be able to do by the time they graduate.This is not to encourage any of you to consider BCS, but rather,to encourage you to learn how to solve problems like these.

The problem can be summarized as: step through yytext,copying each piece out to sval, removing doublequotes andplusses between the pieces, and evaluating CHR$() constants.

Space allocated with malloc() can be increased in size byrealloc(). realloc() is awesome. But, it COPIES and MOVES theold chunk of space you had to the new, resized chunk of space,and frees the old space, so you had better not have any other

pointers pointing at that space if you realloc(), and you have toupdate your pointer to point at the new location realloc() returns.

i = 0; j = 0;while (yytext[i] != '\0') {


53/182


54/182

char *appendstring(char *s, char c){

i = strlen(s);

s = realloc(s, i+2);s[i] = c;s[i+1] = '\0';return s;

}Note: it is very inefficient to grow your array one characterat a time; in real life people grow arrays in large chunks ata time.

Solution #3: use solution one and then shrink your arraywhen you find out how big it actually needs to be.

sval = malloc(strlen(yytext)+1); /* ... do the code copying into sval; be sure to

NUL-terminate */ sval = realloc(sval, strlen(sval)+1);


YACC


55/182

YACC ("yet another compiler compiler") is a popular toolwhich originated at

AT&T Bell Labs. YACC takes a context free grammar as input,and generates aparser as output. Several independent, compatibleimplementations (AT&Tyacc, Berkeley yacc, GNU Bison) for C exist, as well as manyimplementationsfor other popular languages.

YACC files end in .y and take the form

declarations%%grammar%%

subroutines

The declarations section defines the terminal symbols (tokens)andnonterminal symbols. The most useful declarations are:

%token adeclares terminal symbol a; YACC can generate a set of#define'sthat map these symbols onto integers, in a y.tab.hfile. Note: don' t

#include your y.tab.h fi le from your grammar .y fi le,

YACC generates the


56/182

same defini tions and declarations dir ectly in the .c f i le,

and includingthe .tab.h f i le wil l cause duplication errors.

%start Aspecifies the start symbol for the grammar (defaults tononterminal

on left side of the first production rule).

The grammar gives the production rules, interspersed withprogram code

fragments called semantic actions that let the programmer dowhat'sdesired when the grammar productions are reduced. Theyfollow thesyntax

A : body ;

Where body is a sequence of 0 or more terminals, nonterminals,or semanticactions (code, in curly braces) separated by spaces. As anotationalconvenience, multiple production rules may be grouped togetherusing thevertical bar (|).

Bottom Up Parsing


57/182

Bottom up parsers start from the sequence of terminal symbols

and worktheir way back up to the start symbol by repeatedly replacinggrammarrules' right hand sides by the corresponding non-terminal. Thisisthe reverse of the derivation process, and is called "reduction".

Example. For the grammar

(1) S->aABe(2) A->Abc(3) A->b(4) B->d

the string "abbcde" can be parsed bottom-up by the followingreductionsteps:

abbcdeaAbcdeaAdeaABeS


58/182

Handles

Definition: a handleis a substring that

1. matches a right hand side of a production rule in thegrammar and

2.3. whose reduction to the nonterminal on the left hand

side of that

4. grammar rule is a step along the reverse of arightmost derivation.

5.

Shift Reduce Parsing

A shift-reduce parser performs its parsing using the followingstructure

Stack Input$ w$

At each step, the parser performs one of the following actions.

1. Shift one symbol from the input onto the parse stack


59/182

2.3. Reduce one handle on the top of the parse stack. The

symbols

4. from the right hand side of a grammar rule arepopped of the5. stack, and the nonterminal symbol is pushed on

the stack in their place.6.7. Accept is the operation performed when the start

symbol is alone8. on the parse stack and the input is empty.

9.10. Error actions occur when no successful parse is

possible.11.

The YACC Value Stack

YACC's parse stack contains only "states"

YACC maintains a parallel set of values $ is used in semantic actions to name elements on

the value stack


60/182


61/182

each time yylex() returns to the parser will get copied over to thetop of the value stack when the token is shifted onto the parsestack.

You can either declare that struct token may appear in the%union,and put a mixture of struct node and struct token on the valuestack,or you can allocate a "leaf" tree node, and point it at your structtoken. Or you can use a tree type that allows tokens to include

their lexical information directly in the tree nodes. If you havemore than one %union type possible, be prepared to see typeconflictsand to declare the types of all your nonterminals.

Getting all this straight takes some time; you can plan on

it. Your bestbet is to draw pictures of how you want the trees to look, andthen make thecode match the pictures. No pictures == "Dr. J will ask to seeyourpictures and not be able to help if you can't describe your trees."

Declaring value stack types for terminal and nonterminalsymbols


62/182


63/182

might write:

%token < tokenptr > SEMICOL

Announcements

Having trouble debugging your grammar? "bison -v"generates a .outputfile that gives the gory details of conflicts and such.


Announcements

In honor of Dr. Jeffery's 10th anniversary, a minor extensionin Homework #3.

Conflicts in Shift-Reduce Parsing


64/182

"Conflicts" occur when an ambiguity in the grammar creates asituationwhere the parser does not know which step to perform at a given

pointduring parsing. There are two kinds of conflicts that occur.

shift-reducea shift reduce conflict occurs when the grammar indicatesthat

different successful parses might occur with either a

shift or a reduceat a given point during parsing. The vast majority of

situations wherethis conflict occurs can be correctly resolved by shifting.

reduce-reducea reduce reduce conflict occurs when the parser has two ormore

handles at the same time on the top of thestack. Whatever choice

the parser makes is just as likely to be wrong as not. Inthis case

it is usually best to rewrite the grammar to eliminate theconflict,

possibly by factoring.

Example shift reduce conflict:

S->if E then SS->if E then S else S


65/182

In many languages two nested "if" statements produce a

situation wherean "else" clause could legally belong to either "if". The usualrule(to shift) attaches the else to the nearest (i.e. inner) if statement.

Example reduce reduce conflict:

(1) S -> id LP plist RP(2) S -> E GETS E(3) plist -> plist, p(4) plist -> p(5) p -> id(6) E -> id LP elist RP(7) E -> id

(8) elist -> elist, E(9) elist -> E

By the point the stack holds ...id LP id

the parser will not know which rule to use to reduce the id: (5)or (7).

Further Discussion of Reduce Reduce and Shift Reduce

Conflicts


66/182


67/182

T : F T2 g;T2 : t F T2 ;T2 : ;

F : l T r ;F : v ;

This grammar is not much different than before, and has thesame problem,but the surrounding context (the "calling environments") of Fcause the

grammar to have a shift-reduce instead of reduce-reduce. Onceagain,the trouble is after you have seen an F and dwells on thequestion ofwhether to reduce the epsilon production, or instead to shift,uponseeing a token g.

The .output file generated by "bison -v" explains these conflictsinconsiderable detail. Part of what you need to interpret them aretheconcepts of "items" and "sets of items" discussed below.

YACC precedence and associativity declarations


68/182

YACC headers can specify precedence and associativity rulesfor otherwiseheavily ambiguous grammars. Precedence is determined byincreasing orderof these declarations. Example:

%right ASSIGN

%left PLUS MINUS%left TIMES DIVIDE%right POWER%%expr: expr ASSIGN expr

| expr PLUS expr| expr MINUS expr

| expr TIMES expr| expr DIVIDE expr| expr POWER expr;

YACC error handling and recovery


69/182

Use special predefined token error where errorsexpected

On an error, the parser pops states until it entersone that has an action on the error token. For example: statement: error ';' ; The parser must see 3 good tokens before it

decides it has recovered.

yyerrok tells parser to skip the 3 token recovery

rule yyclearin throws away the current (error-causing?)

token

yyerror(s) is called when a syntax error occurs (sis the error message)

Improving YACC's Error Reporting

yyerror(s) overrides the default error message, which usuallyjust says either"syntax error" or "parse error", or "stack overflow".


70/182

You can easily add information in your own yyerror() function,for example

GCC emits messages that look like:

goof.c:1: parse error before '}' token

using a yyerror function that looks like

void yyerror(char *s){

fprintf(stderr, "%s:%d: %s before '%s' token\n",yyfilename, yylineno, s, yytext);

}

You could instead, use the error recovery mechanism to produce

better messages.For example

lbrace : LBRACE | { error_code=MISSING_LBRACE; } error ;

Where LBRACE is an expected token {

This uses a global variable error_code to pass parse informationto yyerror().


71/182

Another related option is to call yyerror() explicitly with a bettermessagestring, and tell the parser to recover explicitly:

package_declaration: PACKAGE_TK error{ yyerror("Missing name"); yyerrok; } ;

But, using error recovery to perform better error reporting runsagainst

conventional wisdom that you should use error tokens verysparingly.What information from the parser determined we had an error inthe firstplace? Can we use that information to produce a better errormessage?

LR Syntax Error Messages: Advanced Methods

The pieces of information that YACC/Bison use to determinethat there

is an error in the first place are the parse state (yystate) and thecurrent input token (yychar). These are exactly the pieces ofinformationone might use to produce better diagnostic error messageswithout


72/182

relying on the error recovery mechanism and mucking up thegrammarwith a lot of extra production rules that feature the error token.

Even just the parse state is enough to do pretty good errormessages.yystate is not part of YACC's public interface, though, so youmayhave to play some tricks to pass it as a parameter into yyerror()from

yyparse(). Say, for example:

#define yyerror(s) __yyerror(s,yystate)

Inside __yyerror(msg, yystate) you can use a switch statement ora global

array to associate messages with specific parse states. But,figuringout which parse state means which syntax error message wouldbe by trialand error.

A tool called Merr is available that let's you generate thisyyerrorfunction from examples: you supply the sample syntax errorsand messages,and Merr figures out which parse state integer goes with whichmessage.


73/182


74/182


75/182

For HW3, test your work on as many test cases as possible.

Midterm Exam is coming up, March 16. Midterm reviewMarch 14.Three more lectures before that.

LR vs. LL vs. LR(0) vs. LR(1) vs. LALR(1)

The first char ("L") means input tokens are read from the left(left to right). The second char ("R" or "L") means parsingfinds the rightmost, or leftmost, derivation. Relevantif there is ambiguity in the grammar. (0) or (1) or (k) afterthe main lettering indicates how many lookahead characters areused. (0) means you only look at the parse stack, (1) means you

use the current token in deciding what to do, shift or reduce.(k) means you look at the next k tokens before deciding whatto do at the current position.

LR Parsers

LR denotes a class of bottom up parsers that is capable ofhandling virtually


76/182

all programming language constructs. LR is efficient; it runs inlinear timewith no backtracking needed. The class of languages handled

by LR is a propersuperset of the class of languages handled by top down"predictive parsers".LR parsing detects an error as soon as it is possible to doso. Generallybuilding an LR parser is too big and complicated a job to do byhand, we usetools to generate LR parsers.

The LR parsing algorithm is given below.

ip = first symbol of inputrepeat {

s = state on top of parse stacka = *ipcase action[s,a] of {

SHIFT s': { push(a); push(s') }REDUCE A->beta: {

pop 2*|beta| symbols; s' = new state on toppush Apush goto(s', A)}

ACCEPT: return 0 /* success */ERROR: { error("syntax error", s, a); halt }}

}


77/182

Constructing SLR Parsing Tables:

Note: in Spring 2006 this material is FYI but you will not beexamined on it.

Definition: An LR(0) item of a grammar G is a productionof G with a dot at some position of the RHS.

Example: The production A->aAb gives the items:

A -> . a A b

A -> a . A b

A -> a A . b


78/182

A -> a A b .

Note: A production A-> generates

only one item:

A -> .

Intuition: an item A-> . denotes:

1. - we have already seen a string2. derivable from

3.4. - we hope to see a string derivable5. from 6.

Functions on Sets of Items

Closure: if I is a set of items for a grammar G, then closure(I)is the set of items constructed as follows:

1. Every item in I is in closure(I).2.3. If A-> .B4. is in closure(I) and B->5. is a production, then add B-> .


79/182

6. to closure(I).7.

These two rules are applied repeatedly until no new items canbe added.

Intuition: If A -> . B is inclosure(I) then we hope to see a string derivable from B in theinput. So if B-> is a production,

we should hope to see a string derivable from .Hence, B->. is in closure(I).

Goto: if I is a set of items and X is a grammar symbol, then

goto(I,X)is defined to be:

goto(I,X) = closure({[A->X.] | [A->.X]is in I})

Intuition:

[A->.X] is in I => we've seen a string derivable from ; we hope to see a string derivable from X.


80/182


81/182


82/182


83/182

4. if 2= ,5. A -> 1is the handle,6. and we should reduce by this production

7.

Note: two valid items may tell us to do different things for thesame viable prefix. Some of these conflicts can be resolvedusinglookahead on the input string.

Constructing an SLR Parsing Table

1. Given a grammar G, construct the augmentedgrammar by adding

2. the production S' -> S.3.4. Construct C = {I0, I1, In},5. the set of sets of LR(0) items for G'.6.7. State I is constructed from Ii, with parsing action8. determined as follows:

9.10.o [A -> .aB] is ino Ii, where a is a terminal; goto(Ii,a) = Ijo : set action[i,a] = "shift j"


84/182

oo [A -> .] is ino Ii: set action[i,a] to "reduce A -> x"

o for all a e FOLLOW(A), where A != S'oo [S' -> S] is in Ii:o set action[i,$] to "accept"

11.12.13.14. goto transitions constructed as follows: for all non-

terminals:15. if goto(Ii, A) = Ij, then goto[i,A] = j16.17. All entries not defined by (3) & (4) are made "error".18. If there are any multiply defined entries, grammar is

not SLR.19.

20. Initial state S0of parser: that constructed from21. I0or [S' -> S]22.

Example:

S -> aABe FIRST(S) = {a} FOLLOW(S) = {$}A -> Abc FIRST{A} = {b} FOLLOW(A) =

{b,d}


85/182

A -> b FIRST{B} = {d} FOLLOW{B} ={e}

B -> d FIRST{S'}= {a} FOLLOW{S'}=

{$}I0= closure([S'->.S]= closure([S'->.S],[S->.aABe])

goto(I0,S) = closure([S'->S.]) = I1goto(I0,a) = closure([S->a.Abe])

= closure([S->a.Abe],[A->.Abc],[A->.b]) = I2goto(I2,A) = closure([S->aA.Be],[A->A.bc])

= closure([S->aA.Be],[A->A.bc],[B->.d]) = I3

goto(I2,B) = closure([A->b.]) = I4goto(I3,B) = closure([S->aAB.e]) = I5goto(I3,b) = closure([A->Ab.c]) = I6goto(I3,d) = closure([B->d.]) = I7goto(I5,e) = closure([S->aABe.]) = I8goto(I6,c) = closure([A->Abc.]) = I9


On Tree Traversals


86/182


87/182

struct tree *child[1]; /* array of children, size varies 0..k */};

struct tree *alctree(int label, int nkids, ...){int i;va_list ap;struct tree *ptr = malloc(sizeof(struct tree) +

(nkids-1)*sizeof(struct tree *));if (ptr == NULL) {fprintf(stderr, "alctree out of memory\n");

exit(1); }

ptr->label = label;ptr->nkids = nkids;va_start(ap, nkids);for(i=0; i < nkids; i++)

ptr->child[i] = va_arg(ap, struct tree *);va_end(ap);return ptr;

}

Besides a function to allocate trees, you need to write one ormore recursivefunctions to visit each node in the tree, either top to bottom(preorder),or bottom to top (postorder). You might do many differenttraversals on thetree in order to write a whole compiler: check types, generatemachine-


88/182

independent intermediate code, analyze the code to make itshorter, etc.You can write 4 or more different traversal functions, or you can

write1 traversal function that does different work at each node,determined bypassing in a function pointer, to be called for each node.

void postorder(struct tree *t, void (*f)(struct tree *)){

/* postorder means visit each child, then do work at the parent*/

int i;if (t == NULL) return;

/* visit each child */for (i=0; i < t-> nkids; i++)

postorder(t->child[i], f);

/* do work at parent */f(t);

}

You would then be free to write as many little helper functionsas youwant, for different tree traversals, for example:

void printer(struct tree *t){


89/182

if (t == NULL) return;printf("%p: %d, %d children\n", t, t->label, t->nkids);

}

Semantic Analysis

Semantic ("meaning") analysis refers to a phase of compilationin which theinput program is studied in order to determine what operationsare to becarried out. The two primary components of a classic semantic

analysisphase are variable reference analysis and type checking. Thesecomponentsboth rely on an underlying symbol table.

What we haveat the start of semantic analysis is a syntax tree

thatcorresponds to the source program as parsed using the contextfree grammar.Semantic information is added by annotating grammar symbolswith


90/182

semantic attributes, which are defined bysemantic rules.A semantic rule is a specification of how to calculate a semanticattribute

that is to be added to the parse tree.

So the input is a syntax tree...and the output is the same tree,only"fatter" in the sense that nodes carry more information.Another output of semantic analysis are error messagesdetecting manytypes of semantic errors.

Two typical examples of semantic analysis include:

variable reference analysisthe compiler must determine, for each use of a variable,which

variable declaration corresponds to that use. Thisdepends on

the semantics of the source language being translated.type checking

the compiler must determine, for each operation in thesource code,

the types of the operands and resulting value, if any.

Notations used in semantic analysis:


91/182


92/182

attributes computed from information obtained from one'sparent or siblings

These are generally harder to compute. Compilers may

be able to jumpthrough hoops to compute some inherited attributesduring parsing,

but depending on the semantic rules this may not bepossible in general.

Compilers resort to tree traversals to move semanticinformation around

the tree to where it will be used.

Attribute Examples

Isconst and Value

Not all expressions have constant values; the ones that do mayallowvarious optimizations.

CFG Semantic Rule

E1: E2+ TE1.isconst = E2.isconst &&T.isconstif (E1.isconst)


93/182


94/182

Symbol Table Module

Symbol tables are used to resolve names within name spaces.Symboltables are generally organized hierarchically according to thescope rules of the language. Although initially concerned withsimplystoring the names of various that are visible in each scope,symbol

tables take on additional roles in the remaining phases of thecompiler.In semantic analysis, they store type information. And for codegeneration,they store memory addresses and sizes of variables.

mktable(parent)creates a new symbol table, whose scope is local to (orinside) parent

enter(table, symbolname, type, offset)insert a symbol into a table

lookup(table, symbolname)lookup a symbol in a table; returns structure pointer

including type and offset. lookup operations are oftenchainedtogether progressively from most local scope onout to global scope.

addwidth(table)


95/182


96/182

In order to work with your tree, you must be able to tell,preferablytrivially easily, which nodes are tree leaves and which are

internal nodes,and for the leaves, how to access the lexical attributes.

Options:

1. encode in the parent what the types of children are2.3. encode in each child what its own type is (better)

4.

How do you do option #2 here?

Perhaps the best approach to all this is to unify the tokens andparse treenodes with something like the following, where perhaps an

nkids value of -1is treated as a flag that tells the reader to uselexical information instead of pointers to children:

struct node {int code; /* terminal or nonterminal symbol */int nkids;union {

struct token { ... } leaf;struct node *kids[9];}u;

} ;


97/182

There are actually nonterminal symbols with 0 children

(nonterminal witha righthand side with 0 symbols) so you don't necessarily wantto usean nkids of 0 is your flag to say that you are a leaf.

Type Checking

Perhaps the primary component of semantic analysis in manytraditionalcompilers consists of the type checker. In order to check types,one first

must have a representation of those types (a type system) andthen one mustimplement comparison and composition operators on thosetypes using thesemantic rules of the source language being compiled. Lastly,type checkingwill involve adding (mostly-) synthesized attributes through

those parts ofthe language grammar that involve expressions and values.

Type Systems


98/182

Types are defined recursively according to rules defined by the

sourcelanguage being compiled. A type system might start with ruleslike:

Base types (int, char, etc.) are types Named types (via typedef, etc.) are types

Types composed using other types are types, for

example:

o array(T, indices) is a type. In some

o languages indices always start with0, so array(T, size) works.

oo T1 x T2 is a type (specifying, more oro less, the tuple or sequence T1

followed by T2;o x is a so-called cross-product

operator).oo record((f1 x T1) x (f2 x T2) x ... x (fn x

Tn)) is a typeo


99/182

o in languages with pointers, pointer(T) is atype

o

o (T1x ... Tn) -> Tn+1is ao type denoting a function mappingparameter types to a return type

o

In some language type expressions may containvariables whose values

are types.

In addition, a type system includes rules for assigning thesetypesto the various parts of the program; usually this will beperformedusing attributes assigned to grammar symbols.


Midterm Exam Review

The Midterm will cover lexical analysis, finite automatas,context freegrammars, syntax analysis, and parsing. Sample problems:


100/182

1. Write a regular expression for numeric quantities of

U.S. money2. that start with a dollar sign, followed by one ormore digits.

3. Require a comma between every three digits, as in$7,321,212.

4. Also, allow but do not require a decimal pointfollowed by two

5. digits at the end, as in $5.99

6.7. Use Thompson's construction to write a non-

deterministic finite8. automaton for the following regular expression,

an abstraction9. of the expression used for real number literal

values in C.

10.(d+pd*|d*pd+)(ed+)?

11. Write a regular expression, or explain why you can'twrite a

12. regular expression, for Modula-2 comments whichuse (* *) as

13. their boundaries. Unlike C, Modula-2 commentsmay be nested,

14. as in (* this is a (* nested *) comment *)15.16. Write a context free grammar for the subset of C

expressions


101/182

17. that include identifiers and function calls withparameters.

18. Parameters may themselves be function calls, as in

f(g(x)),19. or h(a,b,i(j(k,l)))20.21. What are the FIRST(E) and FOLLOW(T) in the

grammar:22.23. E : E + T | T24. T : T * F | F

F : ( E ) | ident

25. What is the -closure(move({2,4},b)) in the followingNFA?

26. That is, suppose you might be in either state 2 or 4at the time

27. you see a symbol b: what NFA states might you

find yourself in28. after consuming b?

(automata to be written on the board)29.

Q: What elseis likely to appear on the midterm?

A: questions that allow you to demonstrate that you know thedifference


102/182

between an DFA and an NFA, questions about lex and flexand tokens

and lexical attributes, questions about context free grammars:

ambiguity, factoring, removing left recursion, etc.

On the mysterious TYPE_NAME

The C language typedef construct is an example where all thebeautifultheory we've used up to this point breaks down. Once a typedefisintroduced (which can first be recognized at the syntax level),certainidentifiers should be legal type names instead of identifiers. Tomake

things worse, they are still legal variable names: the lexicalanalyzerhas to know whether the syntactic context needs a type name oranidentifier at each point in which it runs into one of the

Compiler Notes - Ullman

Documents

Transcript of Compiler Notes - Ullman