A COURSE MATERIAL ON COMPILER DESIGN ECGS22 · a course material on compiler design – ecgs22...

108
A COURSE MATERIAL ON COMPILER DESIGN – ECGS22 2017-2018 BY Dr.M.Deepamalar., MCA.,M.Phil.,Ph.D ASSISTANT PROFESSOR DEPARTMENT OF COMPUTER SCIENCE PARVATHYS ARTS AND SCIENCE COLLEGE DINDIGUL

Transcript of A COURSE MATERIAL ON COMPILER DESIGN ECGS22 · a course material on compiler design – ecgs22...

  • A COURSE MATERIAL ON

    COMPILER DESIGN – ECGS22

    2017-2018

    BY

    Dr.M.Deepamalar., MCA.,M.Phil.,Ph.D

    ASSISTANT PROFESSOR

    DEPARTMENT OF COMPUTER SCIENCE

    PARVATHYS ARTS AND SCIENCE COLLEGE

    DINDIGUL

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 2

    PARVATHY’S ARTS AND SCIENCE COLLEGE, DINDIGUL

    DEPARTMENT OF COMPUTER SCIENCE

    COMPILER DESIGN – ECS22

    I M.SC (CS) – (2017 -2018)

    SYLLABUS

    UNIT – I

    Compilers and Translators – Why do We Need Translators? – The Structure of a Compiler –

    Lexical Analysis – Syntax Analysis – Intermediate Code Generation – Optimization – Code

    Generation – Book Keeping – Error Handling – Compiler – Writing Tools – Getting Started .

    The role of lexical analyzer – Simple approach to design of a lexical analyzer – Regular

    Expressions – Finite Automata – From regular expression to finite automata – Minimizing the

    number of states of a DFA – A language for specifying lexical analyzer – Implementing a lexical

    analyzer – The scanner generator as Swiss army Knife .

    UNIT – II

    The Syntactic Specification of Programming Languages – Derivation and Parse Trees –

    Capability of context free Grammars. Parsers : Shift –reduce Parsing – Operator precedence

    parsing – Top down parsing – Predictive Parsers.

    UNIT – III

    LR parsers – The canonical collection of LR(0) items – Constructing SLR parsing tables –

    Constructing canonical LR parsing tables – constructing SLR parsing tables – Constructing

    LALR parsing tables – Using ambiguous grammars – An automatic parse generator

    implementation of LR parsing tables – constructing LALR set of items. Syntax directed

    translation schemes – Implementation of syntax directed schemes – Intermediate code – parse

    trees and syntax trees – Three Address code, quadruples, and triples – Translation of assignment

    statements – Boolean expression – statements that alter the flow of control – postfix translations

    – Translations with a top down parser.

    UNIT – IV

    The contents of a symbol tables – Data Structures for a symbol table – Representing scope

    information . Errors : Lexical phase errors – syntactic phase errors – semantic errors. The

    principal sources of optimization – Loop optimization – The DAG representation of basic blocks

    – Value numbers and algebraic laws – Global data flow analysis.

    UNIT – V

    Dominators – Reducible Flow graphs – Depth first search – Loop invariant computations –

    Induction variable elimination – Some other loop optimization. Code Generation – Object

    programs – A machine model – A simple code generator – Register allocation and assignment –

    code generation from DAG’s – Peephole optimization.

    Text Book: Principles of Compiler Design, Alfred V.Aho & Jeffrey D.Ullman. 25th Reprint,

    2002.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 3

    UNIT – I : COMPILERS AND TRANSLATORS

    Translator

    • A translator is a program that takes as input a program written in one language and produces as output a program in another language. Beside program translation, the

    translator performs another very important role, the error-detection. Any violation of d

    HLL specification would be detected and reported to the programmers.

    • Important role of translator are: 1. Translating the HLL program input into an equivalent ML program. 2. Providing diagnostic messages wherever the programmer violates specification of

    the HLL.

    Type of Translators

    INTERPRETOR

    Converts the source code into machine code 1 line at a time.

    Program therefore runs very slowly.

    Main reason why an interpreter is used is at the testing / development stage.

    Programmers can quickly identify errors and fix them.

    The translator must be present on the computer for the program is to be run

    COMPILER

    Converts the whole code into one file (often a .exe file).

    The file can then be run on any computer without the translator needing to be present.

    Can take a long time to compile source code as the translator will often have to convert the instructions into various sets of machine code as different

    CPUs will understand instructions with different machine code from one

    another.

    ASSEMBLER This type of translator is used for Assembly Language (not High Level

    Languages).

    It converts mnemonic assembly language instructions into machine code.

    Why do We Need Translators?

    Translators are programs that convert high level language commands: print, IF, For etc.

    …into a set of machine code commands:

    1011, 11001, 11000011110 etc

    …so that the CPU can process the data!

    There are 2 ways in which translators work:

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 4

    1. Take the whole code and convert it into machine code before running it (known as compiling).

    2. Take the code one instruction at a time, translate and run the instruction, before translating the next instruction (known as interpreting).

    Compiler

    • A compiler is a program that reads a program written in one language called the source language into an equivalent program in another language called the target language. The

    target program is then provided the input to produce output. C, Java, Pascal all are

    compiled

    • Compiler is a translator program that translates a program written in (HLL) the source program and translate it into an equivalent program in (MLL) the target program. As an

    important part of a compiler is error showing to the programmer.

    • Executing a program written n HLL programming language is basically of two parts. the

    source program must first be compiled translated into a object program. Then the results

    object program is loaded into a memory executed.

    List of Compilers

    1. Ada compilers

    2 .ALGOL compilers

    3 .BASIC compilers

    4 .C# compilers

    5 .C compilers

    6 .C++ compilers

    7 .COBOL compilers

    8 .D compilers

    9 .Common Lisp compilers

    10. ECMAScript interpreters

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 5

    11. Eiffel compilers

    12. Felix compilers

    13. Fortran compilers

    14. Haskell compilers

    15 .Java compilers

    16. Pascal compilers

    17. PL/I compilers

    18. Python compilers

    19. Scheme compilers

    20. Smalltalk compilers

    21. CIL compilers

    Why do we need Compilers?

    Compilers are important – Responsible for many aspects of system performance – Attaining performance has become more difficult over time

    Compilers are interesting – Compilers include many applications of theory to practice – Writing a compiler exposes algorithmic & engineering issues

    Compilers are everywhere – Many practical applications have embedded languages Commands, macros,

    formatting tags

    Challenges of Compiler Construction

    Compiler construction poses challenging and interesting problems: o Compilers must process large inputs, perform complex algorithms, but also run

    quickly

    o Compilers have primary responsibility for run-time performance o Compilers are responsible for making it acceptable to use the full power of the

    programming language

    o Computer architects perpetually create new challenges for the compiler by building more complex machine

    o Compilers must hide that complexity from the programmer

    A successful compiler requires mastery of the many complex interactions between its constituent parts

    The Structure of a Compiler

    Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated

    operation that takes source program in one representation and produces output in another

    representation. The phases of a compiler are shown in below There are two phases of

    compilation.

    a. Analysis (Machine Independent/Language Dependent)

    b. Synthesis(Machine Dependent/Language independent)

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 6

    Compilation process is partitioned into no-of-sub processes called ‘phases’.

    Lexical Analysis:-

    LA or Scanners reads the source program one character at a time, carving the source program

    into a sequence of automic units called tokens.

    Token LA reads the source program one character at a time, carving the source program into a sequence

    of automatic units called ‘Tokens’.

    1, Type of the token.

    2, Value of the token.

    Type : variable, operator, keyword, constant

    Value : N1ame of variable, current variable (or) pointer to symbol table.

    If the symbols given in the standard format the LA accepts and produces token as output.

    Each token is a sub-string of the program that is to be treated as a single unit. Token are two

    types.

    1, Specific strings such as IF (or) semicolon.

    2, Classes of string such as identifiers, label, constants.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 7

    Syntax Analysis:-

    The second stage of translation is called Syntax analysis or parsing. In this phase expressions,

    statements, declarations etc… are identified by using the results of lexical analysis. Syntax

    analysis is aided by using techniques based on formal grammar of the programming language.

    Intermediate Code Generations:-

    An intermediate representation of the final machine language code is produced. This phase

    bridges the analysis and synthesis phases of translation.

    Code Optimization :-

    This is optional phase described to improve the intermediate code so that the output runs faster

    and takes less space.

    Code Generation:-

    The last phase of translation is code generation. A number of optimizations to reduce the length

    of machine language program are carried out during this phase. The output of the code

    generator is the machine language program of the specified computer.

    Table Management (or) Book-keeping:- This is the portion to keep the names used by the

    program and records essential information about each. The data structure used to record this

    information called a ‘Symbol Table’.

    Error Handlers:-

    It is invoked when a flaw error in the source program is detected. The output of LA is a stream of

    tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups the

    tokens together into syntactic structure called as expression. Expression may further be

    combined to form statements. The syntactic structure can be regarded as a tree whose leaves are

    the token called as parse trees.

    The parser has two functions. It checks if the tokens from lexical analyzer, occur in pattern that

    are permitted by the specification for the source language. It also imposes on tokens a tree-like

    structure that is used by the sub-sequent phases of the compiler.

    Example, if a program contains the expression A+/B after lexical analysis this expression might

    appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax analyzer

    should detect an error situation, because the presence of these two adjacent binary operators

    violates the formulations rule of an expression. Syntax analysis is to make explicit the

    hierarchical structure of the incoming token stream by identifying which parts of the token

    stream should be grouped.

    Example, (A/B*C has two possible interpretations.)

    1, divide A by B and then multiply by C or

    2, multiply B by C and then use the result to divide A.

    each of these two interpretations can be represented in terms of a parse tree.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 8

    Intermediate Code Generation:-

    The intermediate code generation uses the structure produced by the syntax analyzer to create a

    stream of simple instructions. Many styles of intermediate code are possible. One common style

    uses instruction with one operator and a small number of operands. The output of the syntax

    analyzer is some representation of a parse tree. the intermediate code generation phase

    transforms this parse tree into an intermediate language representation of the source program.

    Code Optimization

    This is optional phase described to improve the intermediate code so that the output runs faster

    and takes less space. Its output is another intermediate code program that does the some job as

    the original, but in a way that saves time and / or spaces.

    a. Local Optimization:-

    There are local transformations that can be applied to a program to make an improvement. For

    example,

    If A > B goto L2

    Goto L3

    L2 :

    This can be replaced by a single statement

    If A < B goto L3

    Another important local optimization is the elimination of common sub-expressions

    A := B + C + D

    E := B + C + F

    Might be evaluated as

    T1 := B + C

    A := T1 + D

    E := T1 + F

    Take this advantage of the common sub-expressions B + C.

    b. Loop Optimization:-

    Another important source of optimization concerns about increasing the speed of loops. A

    typical loop improvement is to move a computation that produces the same result each time

    around the loop to a point, in the program just before the loop is entered.

    Code generator :-

    Code Generator produces the object code by deciding on the memory locations for data,

    selecting code to access each datum and selecting the registers in which each computation is to

    be done. Many computers have only a few high speed registers in which computations can be

    performed quickly. A good code generator would attempt to utilize registers as efficiently as

    possible.

    Table Management OR Book-keeping :-

    A compiler needs to collect information about all the data objects that appear in the source

    program. The information about data objects is collected by the early phases of the compiler-

    lexical and syntactic analyzers. The data structure used to record this information is called as

    Symbol Table.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 9

    Error Handing :-

    One of the most important functions of a compiler is the detection and reporting of errors in the

    source program. The error message should allow the programmer to determine exactly where the

    errors have occurred. Errors may occur in all or the phases of a compiler. Whenever a phase of

    the compiler discovers an error, it must report the error to the error handler, which issues an

    appropriate diagnostic msg. Both of the table-management and error-Handling routines interact

    with all phases of the compiler.

    Example: Compilation Process of a source code through phases

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 10

    Compiler-Construction Tools or Compiler Writing Tools

    Software development tools are available to implement one or more compiler phases – Scanner generators (Lex and Flex) – Parser generators (Yacc and Bison) – Syntax-directed translation engines – Automatic code generators – Data-flow engines

    The role of lexical analyzer

    It is the first phase of compiler

    Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis

    Reasons to make it a separate phase are: – Simplifies the design of the compiler – Provides efficient implementation – Improves portability

    Interaction of the Lexical Analyzer with the Parser

    Lexical Analysis Vs Parsing

    Lexical analysis Parsing

    A Scanner simply turns an input String (say a file) into a list of tokens. These tokens

    represent things like identifiers, parentheses,

    operators etc.

    A parser converts this list of tokens into a Tree-like object to represent how the tokens fit

    together to form a cohesive whole (sometimes

    referred to as a sentence).

    The lexical analyzer (the "lexer") parses individual symbols from the source code file

    into tokens. From there, the "parser" proper

    turns those whole tokens into sentences of

    your grammar

    A parser does not give the nodes any meaning beyond structural cohesion. The next thing to

    do is extract meaning from this structure

    (sometimes called contextual analysis).

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 11

    Tokens, Patterns, and Lexemes

    • A token is a classification of lexical units

    – For example: id and num • Lexemes are the specific character strings that make up a token

    – For example: abc and 123 • Patterns are rules describing the set of lexemes belonging to a token

    – For example: “letter followed by letters and digits” and “non-empty sequence of digits”

    Difference between Token, Lexeme and Pattern

    Token Lexeme Pattern

    if if if

    relation = < or or >=

    id y, x Letter followed by letters and digits

    num 31 , 28 Any numeric constant

    operator + , *, - ,/ Any arithmetic operator

    + or * or – or /

    Attributes of Tokens

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 12

    Specification of Tokens

    • Alphabet: Finite, nonempty set of symbols

    Example

    • Strings: Finite sequence of symbols from an alphabet e.g. 0011001 • Empty String: The string with zero occurrences of symbols from alphabet. The empty

    string is denoted by ε

    • Length of String: Number of positions for symbols in the string. |w| denotes the length of string w

    Example |0110| = 4; |ε| = 0

    • Powers of an Alphabet: |∑k| = the set of strings of length k with symbols from ∑ • Example:

    • The set of all strings over ∑ is denoted ∑*

    • Language: is a specific set of strings over some fixed alphabet • Example: • The set of legal English words • The set of strings consisting of n 0's followed by n 1’s {ε ,01,0011,000111,….} • LP = the set of binary numbers whose value is prime {10,11,101,111,1011,….}

    Concatenation and Exponentiation

    • The concatenation of two strings x and y is denoted by xy • The exponentiation of a string s is defined by

    s0 =

    si = si-1s for i > 0

    note that s = s = s

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 13

    Language Operations

    • Union

    L M = {s s L or s M}

    • Concatenation

    LM = {xy x L and y M}

    • Exponentiation

    L0 = {}; Li = Li-1L

    • Kleene closure

    L* = i=0,…, Li

    • Positive closure

    L+ = i=1,…, Li

    Regular Expressions

    • Basis symbols:

    – is a regular expression denoting language {}

    – a is a regular expression denoting {a}

    • If r and s are regular expressions denoting languages L(r) and M(s) respectively,

    then

    – rs is a regular expression denoting L(r) M(s)

    – rs is a regular expression denoting L(r)M(s)

    – r* is a regular expression denoting L(r)*

    – (r) is a regular expression denoting L(r)

    • A language defined by a regular expression is called a Regular set or a Regular

    Language

    Regular Definitions

    • Regular definitions introduce a naming convention:

    d1 r1

    d2 r2

    dn rn

    where each ri is a regular expression over

    {d1, d2, …, di-1 }

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 14

    • Example:

    letter AB…Zab…z

    digit 01…9

    id letter ( letterdigit )*

    • The following shorthands are often used:

    r+ = rr*

    r? = r

    [a-z] = abc…z

    • Examples:

    digit [0-9]

    num digit+ (. digit+)? ( E (+-)? digit+ )?

    Regular Definitions and Grammars

    Grammar

    Regular definitions

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 15

    Coding Regular Definitions in Transition Diagrams

    relop ==

    id letter ( letterdigit )*

    Finite Automata

    • Finite Automata are used as a model for:

    – Software for designing digital circuits – Lexical analyzer of a compiler – Searching for keywords in a file or on the web. – Software for verifying finite state systems, such as communication

    protocols.

    Design of a Lexical Analyzer Generator

    • Translate regular expressions to NFA • Translate NFA to an efficient DFA

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 16

    Nondeterministic Finite Automata

    • An NFA is a 5-tuple (S, , , s0, F) where

    S is a finite set of states

    is a finite set of symbols, the alphabet

    is a mapping from S to a set of states

    s0 S is the start state

    F S is the set of accepting (or final) states

    Transition Graph

    • An NFA can be diagrammatically represented by a labeled directed graph called a

    transition graph

    Transition Table

    • The mapping of an NFA can be represented in a transition table

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 17

    The Language Defined by an NFA

    • An NFA accepts an input string x if and only if there is some path with edges

    labeled with symbols from x in sequence from the start state to some accepting

    state in the transition graph

    • A state transition from one state to another on the path is called a move

    • The language defined by an NFA is the set of input strings it accepts, such as

    (ab)*abb for the example NFA

    Converting RE to NFA

    • This is one way to convert a regular expression into a NFA. • There can be other ways (much efficient) for the conversion. • Thomson’s Construction is simple and systematic method. • It guarantees that the resulting NFA will have exactly one final state, and one start

    state.

    • Construction starts from simplest parts (alphabet symbols). • To create a NFA for a complex regular expression, NFAs of its sub-expressions

    are combined to create its NFA.

    From Regular Expression to -NFA

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 18

    Example:

    For a RE (a|b) * a, the NFA construction is shown below.

    Combining the NFAs of a Set of Regular Expressions

    Deterministic Finite Automata

    • A deterministic finite automaton is a special case of an NFA

    – No state has an -transition

    – For each state s and input symbol a there is at most one edge labeled a

    leaving s

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 19

    • Each entry in the transition table is a single state

    – At most one path exists to accept a string

    – Simulation algorithm is simple

    Example DFA

    A DFA that accepts (ab)*abb

    Conversion of an NFA into a DFA

    • The subset construction algorithm converts an NFA into a DFA using:

    -closure(s) = {s} {t s … t}

    -closure(T) = sT -closure(s)

    move(T,a) = {t s a t and s T}

    • The algorithm produces:

    Dstates is the set of states of the new DFA consisting of sets of states of the NFA

    Dtran is the transition table of the new DFA

    -closure and move Examples

    -closure({0}) = {0,1,3,7}

    move({0,1,3,7},a) = {2,4,7}

    -closure({2,4,7}) = {2,4,7}

    move({2,4,7},a) = {7}

    -closure({7}) = {7}

    move({7},b) = {8}

    -closure({8}) = {8}

    move({8},a) =

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 20

    Subset Construction Example 1

    Subset Construction Example 2

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 21

    Minimizing the number of states of a DFA

    Hopcroft’s Algorithm

    • Input: A DFA M with set of states S, set of inputs , transition function defined,

    start state So and set of accepting states F

    • Output: A DFA M’ accepting the same language as M and having fewer states as

    possible

    • Method:

    • Step1:Construct an initial partition P of the states with two groups : the accepting

    states (F) and the non accepting states (S-F)

    • Step2:Apply the following procedure (Construction of Pnew) to construct a new

    partition (Pnew)

    Procedure for Pnew construction

    • For each group G of P do partition G into subgroups such that two

    states and t are in the same subgroup if and only if for all input symbols

    a, states s and t have transitions on a to states in the same group of P

    • Replace G in Pnew by the set of all subgroups formed

    • Step3: If Pnew = P and proceed to step 4 . Otherwise repeat step 2 with P=Pnew

    • Step4:Choose one state as the state representative and add these states in M’

    • Step5: If M’ has a dead state and unreachable state then remove those states (A

    dead state is a non accepting state that has transitions to itself on all inputs. An

    unreachable state is any state not reachable from the start state )

    • Step6: Complete

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 22

    Example:

    • The DFA for (a|b) *abb

    Applying Minimization

    Lexical Errors

    Lexical errors are the errors thrown by your lexer when unable to continue. Which means

    that there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on

    the other side, will be thrown by your scanner when a given set of already recognised

    valid tokens don't match any of the right sides of your grammar rules. simple panic-mode

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 23

    error handling system requires that we return to a high-level parsing function when a

    parsing or lexical error is detected.

    Error-recovery actions are:

    i. Delete one character from the remaining input.

    ii. Insert a missing character in to the remaining input.

    iii. Replace a character by another character.

    iv. Transpose two adjacent characters.

    Definition Of Context Free Grammar (CFG)

    CFG contain terminals, N-T, start symbol and production.

    Terminal are basic symbols form which string are formed.

    N-terminals are synthetic variables that denote sets of strings

    In a Grammar, one N-T are distinguished as the start symbol, and the set of string it denotes is the language defined by the grammar.

    The production of the grammar specify the manor in which the terminal and N-T can be combined to form strings.

    Each production consists of a N-T, followed by an arrow, followed by a string of one terminal and terminals.

    Definition of Symbol Table

    An extensible array of records.

    The identifier and the associated records contains collected information about the identifier.

    FUNCTION identify (Identifier name)

    RETURNING a pointer to identifier information contains

    The actual string

    A macro definition

    A keyword definition

    A list of type, variable & function definition

    A list of structure and union name definition

    A list of structure and union field selected definitions.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 24

    A language for specifying lexical analyzer

    Lex specifications

    A Lex program (the .l file ) consists of three parts:

    declarations

    %%

    translation rules

    %%

    auxiliary procedures

    1. The declarations section includes declarations of variables,manifest constants(A

    manifest constant is an identifier that is declared to represent a constant e.g. # define

    PIE 3.14), and regular definitions.

    2. The translation rules of a Lex program are statements of the form :

    p1 {action 1}

    p2 {action 2}

    p3 {action 3}

    … …

    … …

    where each p is a regular expression and each action is a program fragment describing

    what action the lexical analyzer should take when a pattern p matches a lexeme. In

    Lex the actions are written in C.

    3. The third section holds whatever auxiliary procedures are needed by the actions.

    Alternatively these procedures can be compiled separately and loaded with the lexical

    analyzer.

    Input Buffering

    The LA scans the characters of the source pgm one at a time to discover tokens. Because

    of large amount of time can be consumed scanning characters, specialized buffering

    techniques have been developed to reduce the amount of overhead required to process an

    input character.

    Buffering techniques:

    1. Buffer pairs

    2. Sentinels

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 25

    The lexical analyzer scans the characters of the source program one a t a time to

    discover tokens. Often, however, many characters beyond the next token many have to be

    examined before the next token itself can be determined. For this and other reasons, it is

    desirable for the lexical analyzer to read its input from an input buffer. Figure shows a

    buffer divided into two haves of, say 100 characters each. One pointer marks the

    beginning of the token being discovered. A look ahead pointer scans ahead of the

    beginning point, until the token is discovered .we view the position of each pointer as

    being between the character last read and the character next to be read. In practice each

    buffering scheme adopts one convention either a pointer is at the symbol last read or the

    symbol it is ready to read.

    Token beginnings look ahead pointer. The distance which the look ahead pointer may

    have to travel past the actual token may be large. For example, in a PL/I program may

    see:

    DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a

    keyword or an array name until we see the character that follows the right parenthesis.

    In either case, the token itself ends at the second E. If the look ahead pointer travels

    beyond the buffer half in which it began, the other half must be loaded with the next

    characters from the source file. Since the buffer shown in above figure is of limited

    size there is an implied constraint on how much look ahead can be used before the

    next token is discovered. In the above example, if the look ahead traveled to the left

    half and all the way through the left half to the middle, we could not reload the right

    half, because we would lose characters that had not yet been grouped into tokens.

    While we can make the buffer larger if we chose or use another buffering scheme,

    cannot ignore the fact that overhead is limited.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 26

    Implementing a lexical analyzer with Lex

    Lex is a popular scanner (lexical analyzer) generator o Developed by M.E. Lesk and E. Schmidt of AT&T Bell Labs o Other versions of Lex exist, most notably flex (for Fast Lex)

    Input to Lex is called Lex specification or Lex program o Lex generates a scanner module in C from a Lex specification file o Scanner module can be compiled and linked with other C/C++ modules

    Commands: o lex filename.l o cc –c lex.yy.c o cc lex.yy.o other.o –o scan o scan infile outfile

    Lex Specification

    A Lex specification file consists of three sections: definition section

    %%

    rules section

    %%

    auxiliary functions

    _ The definition section contains a literal block and regular definitions

    _ The literal block is C code delimited by %{ and %}

    _ Contains variable declarations and function prototypes

    _ A regular definition gives a name to a regular expression

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 27

    _ A regular definition has the form: name expression

    _ A regular definition can be used by writing its name in braces: {name}

    _ The rules section contains regular expressions and C code; it has the form:

    r1 action1 r1 is a regular expression and action1 is C code fragment

    r2 action2 When r1 matches an input string, action1 is executed

    . . .

    rn actionn action1 should be in {} if more than one statement exists

    Lex Operators

    \ C escape sequence

    \n is newline, \t is tab, \\ is backslash, \" is double quote, etc.

    * Matches zero or more of the preceding expression: x* matches _, x, xx, ...

    + Matches one or more of the preceding expression:

    (ab)+ matches ab, abab, ababab, ...

    ? Matches zero or one occurrence of the preceding expression:

    (ab)? matches _ or ab

    | Matches the preceding or the subsequent expression: a|b matches a or b

    ( ) Used for grouping sub-expressions in a regular expression

    [ ] Matches any one of the characters within brackets

    [xyz] means (x|y|z)

    A range of characters is indicated with the dash operator (–)

    [0-9] matches any decimal digit, [A-Za-z] matches any letter

    If first character after [ is ^, it complements the character class

    [^A-Za-z] matches all characters which are NOT letters

    Meta-characters other than \ loose their meaning inside [ ]

    . Matches any single character except the newline character

    " " Matches everything within the quotation marks literally

    "x*" matches exactly x*

    Meta-characters, other than \ , loose their meaning inside " "

    C escape sequences retain their meaning inside " "

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 28

    { } {name} refers to a regular definition from the first section

    [A-Z]{3} matches strings of exactly 3 capital letters

    [A-Z]{1,3} matches strings of 1, 2, or 3 capital letters

    / The lookahead operator

    matches the left expression but only if followed by the right expression

    0/1 matches 0 in 01, but not in 02

    Only one slash is permitted per regular expression

    ^ As first character of a regular expression, ^ matches beginning of a line

    $ As last character of a regular expression, $ matches end of a line

    Same as /\n

    The scanner generator as Swiss Army Knife

    Swiss Army Knife is the scanner generator

    Features of Swiss Army Knife

    Subject it to serious scrutiny

    Strive for simplicity

    Reusable components should be a design goal

    Avoid futurities

    Avoid digressions

    Avoid quantum leaps

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 29

    UNIT II- THE SYNTACTIC SPECIFICATION OF THE PROGRAMMING

    LANGUAGES

    Programming Language Definition

    Appearance of programming language : Vocabulary : Regular expression : Backus-Naur Form(BNF) or Context Free Form(CFG)

    Semantics : Informal language or some examples

    The Syntax and Semantics of Programming Language

    A programming language must include the specification of syntax (structure) and semantics (meaning).

    Syntax typically means the context-free syntax because of the almost universal use of context-free-grammar (CFGs)

    Ex. a = b + c is syntactically legal b + c = a is illegal

    The semantics of a programming language are commonly divided into two classes:

    Static semantics

    Semantics rules that can be checked at compiled time.

    Ex. The type and number of a function’s arguments

    Runtime semantics

    Semantics rules that can be checked only at run time

    The Difference Between Syntax And Semantic

    • Syntax is the way in which we construct sentences by following principles and rules.

    • Semantics is the interpretations of and meanings derived from the sentence transmission and understanding of the message or in other words are the logical

    sentences making sense or not

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 30

    Syntax Definition

    To specify the syntax of a language : CFG and BNF o Example : if-else statement in C has the form of statement → if (

    expression ) statement else statement

    An alphabet of a language is a set of symbols. o Examples : {0,1} for a binary number system(language)={0,1,100,101,...} o {a,b,c} for language={a,b,c, ac,abcc..} o {if,(,),else ...} for a if statements={if(a==1)goto10, if--}

    A string over an alphabet o is a sequence of zero or more symbols from the alphabet. o Examples : 0,1,10,00,11,111,0202 ... strings for a alphabet {0,1} o Null string is a string which does not have any symbol of alphabet.

    Language o Is a subset of all the strings over a given alphabet. o Alphabets Ai Languages Li for Ai o A0={0,1} L0={0,1,100,101,...} o A1={a,b,c} L1={a,b,c, ac, abcc..} o A2={all of C tokens} L2= {all sentences of C program }

    Example 2.1. Grammar for expressions consisting of digits and plus and minus signs.

    o Language of expressions L={9-5+2, 3-1, ...} o The productions of grammar for this language L are: o list → list + digit o list → list - digit o list → digit o digit → 0|1|2|3|4|5|6|7|8|9 o list, digit : Grammar variables, Grammar symbols o 0,1,2,3,4,5,6,7,8,9,-,+ : Tokens, Terminal symbols

    Convention specifying grammar o Terminal symbols : bold face string if, num, id o Nonterminal symbol, grammar symbol : italicized names, list, digit ,A,B

    Grammar G=(N,T,P,S) o N : a set of nonterminal symbols o T : a set of terminal symbols, tokens o P : a set of production rules

    o S : a start symbol, S∈N Grammar G for a language L={9-5+2, 3-1, ...}

    o G=(N,T,P,S) o N={list,digit} o T={0,1,2,3,4,5,6,7,8,9,-,+} o P : list -> list + digit

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 31

    o list -> list - digit o list -> digit o digit -> 0|1|2|3|4|5|6|7|8|9 o S=list

    Some definitions for a language L and its grammar G

    o Derivation : A sequence of replacements S⇒α1⇒α2⇒…⇒αn is a derivation of αn.

    o Example, A derivation 1+9 from the grammar G o left most derivation

    o list ⇒ list + digit ⇒ digit + digit ⇒ 1 + digit ⇒ 1 + 9 o right most derivation

    o list ⇒ list + digit ⇒ list + 9 ⇒ digit + 9 ⇒ 1 + 9 Language of grammar L(G)

    o L(G) is a set of sentences that can be generated from the grammar G.

    o L(G)={x| S ⇒* x} where x ∈ a sequence of terminal symbols Example: Consider a grammar G=(N,T,P,S):

    o N={S} T={a,b} o S=S P ={S → aSb | ε } is aabb a sentecne of L(g)? (derivation of string

    aabb)

    o S⇒aSb⇒aaSbb⇒aaεbb⇒aabb(or S⇒* aabb) so, aabbεL(G)

    o there is no derivation for aa, so aa∉L(G) note L(G)={anbn| n≧0} where

    anbn meas n a's followed by n b's.

    Syntax Analysis

    • Syntax Analysis is also called Parsing or Hierarchical Analysis. • A Parser implements grammar of the language may it be C, C++ etc • The parser obtains a string of tokens from the lexical analyzer and verifies that the

    string can be generated by the grammar for the source language

    • The grammar that a parser implements is called a Context Free Grammar or CFG

    The Syntactic Specification of Programming Language

    Program Aspects

    Syntax: what valid programs look like Semantics: what valid programs mean; what they should compute Compiler must contain both information A programming language must include the specification of syntax (structure) and

    semantics (meaning).

    Syntax typically means the context-free syntax because of the almost universal use of context-free-grammar (CFGs)

    Ex.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 32

    a = b + c is syntactically legal b + c = a is illegal

    The semantics of a programming language are commonly divided into two classes: Static semantics

    Semantics rules that can be checked at compiled time. Ex. The type and number of a function’s arguments

    Runtime semantics Semantics rules that can be checked only at run time

    Basics of Syntax Analysis

    Syntax analysis or parsing is the second phase of a compiler.

    Syntax analyzer creates the syntactic structure of the given source program.

    This syntactic structure is mostly a parse tree.

    The syntax analyzer or parser checks whether a given source program satisfies the rules

    implied by a context – free grammar or not.

    o If it satisfies, the parser creates the parse tree of that program.

    o Otherwise the parser gives the error message.

    A context free grammar

    o Gives a precise syntactic specification of a programming language.

    o The design of the grammar is an initial phase of the design of a compiler

    o A grammar can be directly converted into a parser by some tools.

    Syntax analysis is done by the parser. o Detects whether the program is written following the grammar rules and reports

    syntax errors.

    o Produces a parse tree from which intermediate code can be generated.

    Limitations of Syntax Analyzers

    Syntax analyzers or parsers receive their inputs, in the form of tokens, from lexical

    analyzers. Lexical analyzers are responsible for the validity of a token supplied by the syntax

    analyzer. Syntax analyzers have the following drawbacks :

    it cannot determine if a token is valid,

    it cannot determine if a token is declared before it is being used,

    it cannot determine if a token is initialized before it is being used,

    it cannot determine if an operation performed on a token type is valid or not.

    These tasks are accomplished by the semantic analyzer, which are defined in Semantic Analysis.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 33

    Capability of Context Free Grammar

    Context Free Grammar (CFG)

    A lexical analyzer can identify tokens with the help of regular expressions and pattern

    rules. But a lexical analyzer cannot check the syntax of a given sentence due to the limitations of

    the regular expressions. Regular expressions cannot check balancing tokens, such as parenthesis.

    Therefore, this phase uses context-free grammar CFG, which is recognized by pushdown

    automata. CFG, on the other hand, is a superset of Regular Grammar. It implies that every

    Regular Grammar is also context-free, but there exists some problems, which are beyond the

    scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming

    languages.

    • The syntax of a programming language is described by a context-free grammar (Backus-Naur Form (BNF)).

    – Similar to the languages specified by regular expressions, but more general. – A grammar gives a precise syntactic specification of a language. – From some classes of grammars, tools exist that can automatically construct an

    efficient parser. These tools can also detect syntactic ambiguities and other

    problems automatically.

    – A compiler based on a grammatical description of a language is more easily maintained and updated.

    A context-free grammar has four components:

    1. A set of non-terminals V. Non-terminals are syntactic variables that denote sets of strings. The non-terminals define sets of strings that help define the language generated

    by the grammar.

    2. A set of tokens, known as terminal symbols Σ. Terminals are the basic symbols from which strings are formed.

    3. A set of productions P. The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings. Each production consists of

    a non-terminal called the left side of the production, an arrow, and a sequence of tokens

    and/or on- terminals, called the right side of the production.

    4. One of the non-terminals is designated as the start symbol S; from where the production begins.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 34

    The strings are derived from the start symbol by repeatedly replacing a non-terminal initially the

    start symbol by the right side of a production, for that non-terminal.

    • A grammar G = (N, T, P, S) o N is a finite set of non-terminal symbols o T is a finite set of terminal symbols

    o P is a finite subset of (𝑁 ∪ 𝑇)∗𝑁(𝑁 ∪ 𝑇)∗ × (𝑁 ∪ 𝑇)∗ • An element (𝛼, 𝛽)𝜖𝑃 is written as 𝛼 → 𝛽

    o S is a distinguished symbol in N and is called the start symbol. o Inherently recursive structures of a programming language are defined by a context-

    free grammar.

    o In a context-free grammar, we have: • A finite set of terminals (in our case, this will be the set of tokens) • A finite set of non-terminals (syntactic-variables) • A finite set of productions rules in the following form

    • A → α where A is a non-terminal and

    • α is a string of terminals and non-terminals (including the empty string)

    • A start symbol (one of the non-terminal symbol) o L(G) is the language of G (the language generated by G) which is a set of sentences. o A sentence of L(G) is a string of terminal symbols of G. o If S is the start symbol of G then

    • ω is a sentence of L(G) iff S ⇒ ω where ω is a string of terminals of G.

    • If G is a context-free grammar, L(G) is a context-free language.

    • Language defined by a grammar o “aAb derives awb in one step”, denoted as “aAb=>awb”, if A->w is a production and

    a and b are arbitrary strings of terminal or nonterminal symbols.

    o a1 derives am if a1=>a2=>…=>am, written as a1=>am

    The languages L(G) defined by G are the set of strings of the terminals w such that S=>w.

    Example

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 35

    The problem of palindrome language, which cannot be described by means of Regular

    Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by

    means of CFG, as illustrated below:

    G = ( V, Σ, P, S )

    Where:

    V = { Q, Z, N }

    Σ = { 0, 1 }

    P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 } S = { Q }

    This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111,

    Chomsky Hierarchy (classification of grammars)

    i) A grammar is said to be regular if it is

    • right-linear, where each production in P has the form, A → wB or A → w. Here, A and B are non-terminals and w is a terminal.

    • or left-linear

    ii) Context-free if each production in P is of the form A → α,, where 𝐴𝜖𝑁 and 𝛼 ∈ (𝑁 ∪ 𝑇)∗ iii) Context sensitive if each production in P is of the form α → β where |α| ≤ |β|

    iv) Unrestricted if each production in P is of the form α → β where α≠ɛ

    • Context-free grammar is sufficient to describe most programming languages. • Example: a grammar for arithmetic expressions.

    ->

    -> ( )

    -> -

    -> id

    -> + | - | * | /

    derive -(id) from the grammar:

    => - => - () =>-(id)

    sentence: a strings of terminals that can be derived from S

    sentential form: a strings of terminals or none terminals that can be derived from S.

    derive id + id * id from the grammar: E=>E+E=>E+E*E=>E+E*id=>E+id*id=>id+id*id

    leftmost/rightmost derivation -- each step replaces leftmost/rightmost non-terminal. E=>E+E=>id+E=>id+E*E=>id+id*E=>id+id*id

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 36

    Derivation and Parse Trees

    Derivations

    Starting with start symbol

    At each step: a nonterminal replaced with the body of a production

    A derivation is basically a sequence of production rules, in order to get the input string. During parsing, we take two decisions for some sentential form of input:

    o Deciding the non-terminal which is to be replaced. o Deciding the production rule, by which, the non-terminal will be replaced.

    To decide which non-terminal to be replaced with production rule, we can have two options.

    Left-most Derivation If the sentential form of an input is scanned and replaced from left to right, it is called

    left-most derivation. The sentential form derived by the left-most derivation is called the

    left-sentential form.

    Right-most Derivation If we scan and replace the input with production rules, from right to left, it is known as

    right-most derivation. The sentential form derived from the right-most derivation is called

    the right-sentential form.

    Example Production rules:

    E → E + E

    E → E * E

    E → id

    Input string: id + id * id

    The left-most derivation is:

    E → E * E

    E → E + E * E

    E → id + E * E

    E → id + id * E

    E → id + id * id

    Notice that the left-most side non-terminal is always processed first.

    The right-most derivation is:

    E → E + E

    E → E + E * E

    E → E + E * id

    E → E + id * id

    E → id + id * id

    Definition : Derivation

    o In general a derivation step is αAβ ⇒ αγβ if there is a production rule A→γ in our grammar where α and β are arbitrary strings of terminal and non-terminal

    symbols.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 37

    o α1 ⇒ α2 ⇒ ... ⇒ αn (αn derives from α1 or α1 derives αn ) o At each derivation step, we can choose any of the non-terminal in the sentential

    form of G for the replacement.

    o If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-most derivation.

    Example:

    E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id)

    o If we always choose the right-most non-terminal in each derivation step, this derivation is called as right-most derivation.

    Example:

    E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id)

    o We will see that the top-down parsers try to find the left-most derivation of the given source program.

    o We will see that the bottom-up parsers try to find the right-most derivation of the given source program in the reverse order.

    More on Derivations

    Example

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 38

    Parse Tree

    A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the start symbol.

    A parse tree pictorially shows how the start symbol of a grammar derives a specific string in the language.

    Filters out the order of nonterminals replacement.

    Many to one relationship between derivations and parse tree

    Given a context-free grammar, a parse tree has the following properties: o The root is labeled by the start symbol

    o Each leaf is labeled by a token or the empty string

    o Each interior node is labeled by a nonterminal

    o If A is a non-terminal labeling some interior node and abcdefg..z are the labels of

    the children of that node from left to right, then A->abcdefg..z is a production of

    the grammar.

    Example 1: Construct the Parse Tree for –(id+id)

    The left most derivation for –(id+id)

    𝐸 → −(𝐸)

    𝐸 → −(𝐸 + 𝐸)

    𝐸 → −(𝑖𝑑 + 𝐸)

    𝐸 → −(𝑖𝑑 + 𝑖𝑑)

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 39

    Parse Tree for –(id+id)

    Example 2: Construct a Parse Tree for id+id*id

    The left most derivation for id+id*id

    E → E * E

    E → E + E * E

    E → id + E * E

    E → id + id * E

    E → id + id * id

    Parse Tree for id+id*id

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 40

    Ambiguity

    A grammar produces more than one parse tree for a sentence is called as an ambiguous grammar.

    For the most parsers, the grammar must be unambiguous.

    Unambiguous grammar Unique selection of the parse tree for a sentence

    We should eliminate the ambiguity in the grammar during the design phase of the compiler.

    An unambiguous grammar should be written to eliminate the ambiguity.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 41

    We have to prefer one of the parse trees of a sentence (generated by an ambiguous

    grammar) to disambiguate that grammar to restrict to this choice.

    Ambiguous grammars (because of ambiguous operators) can be disambiguated according to the precedence and associativity rules.

    Example Production Rules :

    E E+E

    EE-E

    E id

    For the string id+id*id the above grammar produces two parse trees.

    Two parse tree for id+id*id for the above grammar

    The language generated by an ambiguous grammar is said to be inherently ambiguous.

    Ambiguity in grammar is not good for a compiler construction.

    No method can detect and remove ambiguity automatically, but it can be removed by either

    i) re-writing the whole grammar without ambiguity, or ii) by setting and following associativity and precedence constraints.

    Associativity If an operand has operators on both sides, the side on which the operator takes this

    operand is decided by the associativity of those operators. If the operation is left-

    associative, then the operand will be taken by the left operator or if the operation is right-

    associative, the right operator will take the operand.

    i) Left Associative Operations such as Addition, Multiplication, Subtraction, and Division are

    left associative.

    If the expression contains: id op id op id it will be evaluated as: (id op id) op id

    For example, id + id + id can be evaluated as (id+id)+id

    ii) Right Associative Operations like Exponentiation are right associative, i.e., the order of

    evaluation in the same expression will be: id op (id op id)

    For example, id ^ id ^ id can be evaluated as id ^ (id ^ id)

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 42

    Precedence o If two different operators share a common operand, the precedence of operators

    decides which will take the operand.

    o Use precedence of operators as follows: ^ (right to left)

    * (left to right)

    + (left to right)

    Both the Associativity and Precedence decrease the chances of ambiguity in a language or its grammar.

    Example To disambiguate the grammar E → E+E | E*E | E^E | id | (E), use precedence of

    operators as follows:

    ^ (right to left)

    * (left to right)

    + (left to right)

    We get the following unambiguous grammar:

    E → E+T | T

    T → T*F | F

    F → G^F | G

    G → id | (E)

    Left Recursion

    o A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as the left-most symbol.

    o Left-recursive grammar is considered to be a problematic situation for top-down parsers.

    o Top-down parsers start parsing from the Start symbol, which in itself is nonterminal.

    o So, when the parser encounters the same non-terminal in its derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it goes into an

    infinite loop.

    o A grammar is left recursive if it has a non-terminal A such that there is a

    derivation: A ⇒ Aα for some string α. o Top-down parsing techniques cannot handle left-recursive grammars. o So,to convert our left-recursive grammar into an equivalent grammar which is not

    left-recursive.

    o The left-recursion may appear in a single step of the derivation (immediate left-recursion), or may appear in more than one step of the derivation.

    Example:

    (1) A => Aα | β

    (2) S => Aα | β

    A => Sd

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 43

    (1) is an example of immediate left recursion, where A is any non-terminal symbol and α

    represents a string of non-terminals.

    (2) is an example of indirect-left recursion.

    A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the parser may go into a loop forever.

    Immediate Left Recursion and Its Elimination

    A → A α | β where β does not start with A

    ⇓ Eliminate immediate left recursion A → β A’

    A’ → α A’ | ε an equivalent grammar

    In general,

    A → A α1 | ... | A αm | β1 | ... | βn where β1 ... βn do not start with A

    ⇓ Eliminate immediate left recursion A → β1 A’ | ... | βn A’

    A’ → α1 A’ | ... | αm A’ | ε an equivalent grammar

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 44

    Example:

    E → E+T | T

    T → T*F | F

    F → id | (E)

    ⇓ Eliminate immediate left recursion

    E → T E’

    E’ → +T E’ | ε

    T → F T’

    T’ → *F T’ | ε

    F → id | (E)

    A grammar cannot be immediately left-recursive, but it still can be left-recursive.

    By just eliminating the immediate left-recursion, we may not get a grammar which is not left-recursive.

    Example:

    S → Aa | b

    A → Sc | d

    This grammar is not immediately left-recursive, but it is still left-recursive.

    S⇒ Aa ⇒ Sca Or

    A⇒ Sc ⇒ Aac causes to a left-recursion. So, we have to eliminate all left-recursions from the grammar.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 45

    Elimination of all left recursion

    Arrange non-terminals in some order: A1 ... An

    for i from 1 to n do

    {

    for j from 1 to i-1 do

    {

    replace each production

    Ai → Aj γ

    by

    Ai → α1 γ | ... | αk γ

    where Aj → α1 | ... | αk

    }

    Eliminate immediate left-recursions among Ai productions

    }

    Example:

    S → Aa | b

    A → Ac | Sd | f

    Case 1: Order of non-terminals: S, A

    for S:

    we do not enter the inner loop. there is no immediate left recursion in S.

    for A:

    Replace A → Sd with A → Aad | bd So, we will have A → Ac | Aad | bd | f Eliminate the immediate left-recursion in A

    A → bdA’ | fA’

    A’ → cA’ | adA’ | ε

    So, the resulting equivalent grammar which is not left-recursive is:

    S → Aa | b

    A → bdA’ | fA’

    A’ → cA’ | adA’ | ε

    Case 2: Order of non-terminals: A, S

    for A:

    we do not enter the inner loop. Eliminate the immediate left-recursion in A

    A → SdA’ | fA’

    A’ → cA’ | ε

    for S:

    Replace S → Aa with S → SdA’a | fA’a

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 46

    So, we will have S → SdA’a | fA’a | b Eliminate the immediate left-recursion in S

    S → fA’aS’ | bS’

    S’ → dA’aS’ | ε

    So, the resulting equivalent grammar which is not left-recursive is:

    S → fA’aS’ | bS’

    S’ → dA’aS’ | ε

    A → SdA’ | fA’

    A’ → cA’ | ε

    Left Factoring

    If more than one grammar production rules has a common prefix string, then the top-down parser cannot make a choice as to which of the production it should take to

    parse the string in hand.

    Then it cannot determine which production to follow to parse the string as both productions are starting from the same terminal ornon − terminal. To remove this

    confusion, use a technique called left factoring.

    Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, make one production for each common prefixes and the rest of the

    derivation is added by new productions.

    A predictive parser (a top-down parser without backtracking) insists that the grammar must be left-factored.

    grammar a new equivalent grammar suitable for predictive parsing stmt → if expr then stmt else stmt | if expr then stmt

    when we see if, we cannot now which production rule to choose to re-write stmt in the derivation

    In general, A → βα1 | βα2 where α is non-empty and the first symbols of β1 and β2 (if they have one) are different.

    when processing α we cannot know whether expand A to βα1 or

    A to βα2

    But, if we re-write the grammar as follows A → αA’

    A’ → β1 | β2 so, we can immediately expand A to αA’

    Algorithm For each non-terminal A with two or more alternatives (production rules) with a

    common non-empty prefix, let say

    A → βα1 | ... | βαn | γ1 | ... | γm

    convert it into

    A → αA’ | γ1 | ... | γm

    A’ → β1 | ... | βn

    Example 1:

    A → abB | aB | cdg | cdeB | cdfB

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 47

    ⇓ A → aA’ | cdg | cdeB | cdfB

    A’ → bB | B

    ⇓ A → aA’ | cdA’’

    A’ → bB | B

    A’’ → g | eB | fB

    Example 2:

    A → ad | a | ab | abc | b

    ⇓ A → aA’ | b

    A’ → d | ε | b | bc

    ⇓ A → aA’ | b

    A’ → d | ε | bA’’

    A’’ → ε | c

    Example left factoring

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 48

    Unit III

    Parsers

    Parser

    A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams.

    The parser analyzes the source code token stream against the production rules to detect any errors in the code.

    The output of the phase is a parse tree.

    This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a parse tree as the output of the phase.

    Parsers are expected to parse the whole code even if some errors exist in the program.

    Parsers use error recovering strategies.

    Three general types of parsers

    Universal parsing methods:

    o can parse any grammars

    o too inefficient to use in production compilers

    Top-down methods:

    o Parse-trees built from root to leaves.

    o Input to parser scanned from left to right one symbol at a time

    Bottom-up methods:

    o Start from leaves and work their way up to the root.

    o Input to parser scanned from left to right one symbol at a time

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 49

    Given a formal syntax specification (typically as a context-free grammar [CFG] ), the

    parse reads tokens and groups them into units as specified by the productions of the CFG

    being used.

    As syntactic structure is recognized, the parser either calls corresponding semantic

    routines directly or builds a syntax tree.

    CFG ( Context-Free Grammar )

    BNF ( Backus-Naur Form )

    GAA ( Grammar Analysis Algorithms )

    LL, LR, SLR, LALR Parsers

    YACC

    TOP-DOWN PARSING

    • Constructing a parse tree for an input string starting from root • Parse tree built in preorder (depth-first) • Finding left-most derivation • At each step of a top-down parse:

    o determine the production to be applied o matching terminal symbols in production body with input string

    The parse tree is created top to bottom.

    Top-down parser

    Recursive-Descent Parsing o Backtracking is needed (If a choice of a production rule does not work, we

    backtrack to try other alternatives.)

    o It is a general parsing technique, but not widely used. o Not efficient

    Predictive Parsing o No backtracking o Efficient o Needs a special form of grammars i.e. LL (1) grammars.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 50

    o Recursive Predictive Parsing is a special form of Recursive Descent parsing without backtracking.

    o Non-Recursive (Table Driven) Predictive Parser is also known as LL (1) parser.

    Top – Down Parsing

    1. Recursive –Descent Parsing (Uses Backtracking)

    • Backtracking is needed.

    • It tries to find the left-most derivation.

    i. Predictive Parser

    ii. Non-recursive Predictive Parser

    Algorithm of Recursive Descent Parsing

    Example 1:

    If the grammar is S → aBc; B → bc | b and the input is abc:

    b

    S S

    a B c

    b c

    a B c

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 51

    Example 2:

    2. Predictive Parser

    When re-writing a non-terminal in a derivation step, a predictive parser can uniquely choose a production rule by just looking the current symbol in the input string.

    Example:

    stmt → if ...... |

    while ...... |

    begin ...... |

    for .....

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 52

    When we are trying to write the non-terminal stmt, we have to choose first production rule.

    When we are trying to write the non-terminal stmt, we can uniquely choose the production rule by just looking the current token.

    We eliminate the left recursion in the grammar, and left factor it. But it may not be suitable for predictive parsing (not LL (1) grammar).

    3. Recursive Predictive Parsing

    Each non-terminal corresponds to a procedure.

    Example:

    A → aBb | bAB

    proc A

    {

    case of the current token

    {

    ‘a’: - match the current token with a, and move to the next token;

    - call ‘B’;

    - match the current token with b, and move to the next token;

    ‘b’: - match the current token with b, and move to the next token;

    - call ‘A’;

    - call ‘B’;

    }

    }

    Applying ε-productions

    A → aA | bB | ε

    If all other productions fail, we should apply an ε-production. For example, if the current token is not a or b, we may apply the ε-production.

    Most correct choice: We should apply an ε-production for a non-terminal A when the current token is in the follow set of A (which terminals can follow A in the sentential

    forms).

    Example:

    A → aBe | cBd | C

    B → bB | ε

    C → f

    proc A

    {

    case of the current token

    {

    a: - match the current token with a and move to the next token;

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 53

    - call B;

    - match the current token with e and move to the next token;

    c: - match the current token with c and move to the next token;

    - call B;

    - match the current token with d and move to the next token;

    f: - call C //First Set of C

    }

    }

    proc C

    {

    match the current token with f and move to the next token;

    }

    proc B

    {

    case of the current token

    {

    b: - match the current token with b and move to the next token;

    - call B

    e,d: - do nothing //Follow Set of B

    }

    }

    4. Non-Recursive Predictive Parsing - LL(1) Parser

    Non-Recursive predictive parsing is a table-driven parser.

    • It is a top-down parser. • It is also known as LL(1) Parser.

    • Input buffer o The string to be parsed. We will assume that its end is marked with a special

    symbol $.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 54

    • Output o A production rule representing a step of the derivation sequence (left-most

    derivation) of the string in the input buffer.

    • Stack o contains the grammar symbols o at the bottom of the stack, there is a special end marker symbol $. o initially the stack contains only the symbol $ and the starting symbol S. ($S

    initial stack)

    o when the stack is emptied (i.e. only $ left in the stack), the parsing is completed.

    • Parsing table o a two-dimensional array M[A,a] o each row is a non-terminal symbol o each column is a terminal symbol or the special symbol $ o each entry holds a production rule.

    Parser Actions

    The symbol at the top of the stack (say X) and the current symbol in the input string (say a)

    determine the parser action. There are four possible parser actions.

    Example:

    For the Grammar is S → aBa; B → bB | ε and the following LL(1) parsing table:

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 55

    Construction of Predictive Parsing Tables – LL(1) Parsing Table :

    Definition FIRST

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 56

    Definition FOLLOW

    Definition LL(1) Grammar

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 57

    LL(1) Grammars

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 58

    Definition Parsing Table

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 59

    Derivation of id+id*id Using Predictive Parsing Table

    Non – LL (1) Grammars

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 60

    Bottom Up Parsing

    Given a string of terminals

    Build parse tree starting from leaves and working up toward the root

    Reverse of right-most derivation

    Used for type of grammars called LR

    LR parsers are difficult to build by hand

    We use automatic parser generators for LR grammars

    A bottom-up parser creates the parse tree of the given input starting from leaves towards the root.

    A bottom-up parser tries to find the right-most derivation of the given input in the reverse order.

    (a) S ⇒ ... ⇒ ω (the right-most derivation of ω)

    (b) ← (the bottom-up parser finds the right-most derivation in the reverse order)

    Bottom-up parsing is also known as shift-reduce parsing because its two main actions are shift and reduce.

    o At each shift action, the current symbol in the input string is pushed to a stack. o At each reduction step, the symbols at the top of the stack (this symbol sequence

    is the right side of a production) will replaced by the non-terminal at the left side

    of that production.

    o There are also two more actions: accept and error.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 61

    Example

    1. Shift – Reduce Parsing

    • A shift-reduce parser tries to reduce the given input string into the starting symbol. • At each reduction step, a substring of the input matching to the right side of a production

    rule is replaced by the non-terminal at the left side of that production rule.

    • If the substring is chosen correctly, the right most derivation of that string is created in the reverse order.

    • Form of bottom-up parsing • Consists of:

    – Stack: holds grammar symbols – input buffer: holds the rest of the string to be parsed

    • Handle always appears on the top of the stack

    Syntax

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 62

    Example

    Handle

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 63

    Stack Implementation

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 64

    Conflicts during Shift Reduce Parsing

    • There are context-free grammars for which shift-reduce parsers cannot be used. • Stack contents and the next input symbol may not decide action:

    – shift/reduce conflict: Whether make a shift operation or a reduction. – reduce/reduce conflict: The parser cannot decide which of several reductions to

    make.

    If a shift-reduce parser cannot be used for a grammar, that grammar is called as non-LR(k) grammar.

    Types of Shift – Reduce Parsing

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 65

    2. Operator Precedence Parsing

    Precedence Relation

    • In operator-precedence parsing, we define three disjoint precedence relations between certain pairs of terminals.

    o a b b has lower precedence than a

    • The determination of correct precedence relations between terminals are based on the traditional notions of associativity and precedence of operators. (Unary minus causes a

    problem).

    • The intention of the precedence relations is to find the handle of a right-sentential form, o marking the right hand.

    • In our input string $a1a2...an$, we insert the precedence relation between the pairs of terminals (the precedence relation holds between the terminals in that pair).

    • Example

    Using Precedence Relation to Find Handles

    • Scan the string from left end until the first .> is encountered. • Then scan backwards (to the left) over any =· until a

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 66

    • The handle contains everything to left of the first .> and to the right of the

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 67

    Creating Operator Precedence Relation From Associativity and Precedence

    Example

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 68

    Operator Precedence Grammar

    There is another more general way to compute precedence relations among terminals:

    1. a = b if there is a right side of a production of the form αaβbγ, where β is either a single

    non-terminal or ε.

    2. a < b if for some non-terminal A there is a right side of the form αaAβ and A derives to

    γbδ where γ is a single non-terminal or ε.

    3. a > b if for some non-terminal A there is a right side of the form αAbβ and A derives to

    γaδ where δ is a single non-terminal or ε.

    Note that the grammar must be unambiguous for this method. Unlike the previous method, it

    does not take into account any other property and is based purely on grammar productions. An

    ambiguous grammar will result in multiple entries in the table and thus cannot be used.

    Handling Unary Minus

    Operator-Precedence parsing cannot handle the unary minus when also use the binary minus in our grammar.

    • The best approach to solve this problem is to let the lexical analyzer handle this problem. The lexical analyzer will return two different operators for the unary minus and the

    binary minus.

    The lexical analyzer will need a look ahead to distinguish the binary minus from the unary minus.

    • Then, make O if unary-minus has higher precedence than O unary-minus b

    Advantages and Disadvantages

    Advantages: o simple o powerful enough for expressions in programming languages

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 69

    Disadvantages: o It cannot handle the unary minus (the lexical analyzer should handle the unary

    minus).

    o Small class of grammars. o Difficult to decide which language is recognized by the grammar.

    3. LR Parser or LR Parsing

    LR parsing is attractive because:

    LR parsing is most general non-backtracking shift-reduce parsing, yet it is still efficient.

    The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive parsers. LL(1)-Grammars ⊂ LR(1)-Grammars

    An LR-parser can detect a syntactic error as soon as it is possible to do so a left-to- right scan of the input.

    Parser Configuration

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 70

    Parser Actions

    Construction of Parsing Tables

    An LR parser using SLR parsing tables for a grammar G is called as the SLR parser for G.

    If a grammar G has an SLR parsing table, it is called SLR grammar.

    Every SLR grammar is unambiguous, but every unambiguous grammar is not a SLR grammar.

    Augmented Grammar: G’ is G with a new production rule S’→S where S’ is the new starting symbol.

    Closure Operation

    If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0) items constructed from I by the two rules:

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 71

    1. Initially, every LR(0) item in I is added to closure(I).

    2. If A → α.Bβ is in closure(I) and Bγ→ is a production rule of G; then B→.γ will be in the

    closure(I). We will apply this rule until no more new LR(0) items can be added to closure(I).

    GOTO Operation

    If I is a set of LR(0) items and X is a grammar symbol (terminal or non-terminal), then

    goto(I,X) is defined as follows:

    If A → α.Xβ in I then every item in closure({A → αX.β}) will be in goto(I,X).

    Example:

    I = { E’ → .E, E → .E+T, E → .T, T → .T*F, T → .F, F → .(E), F → .id }

    goto(I,E) = { E’ → E., E → E.+T }

    goto(I,T) = { E → T., T → T.*F }

    goto(I,F) = {T → F. }

    goto(I,() = {F→ (.E), E→ .E+T, E→ .T, T→ .T*F, T→ .F, F→ .(E), F→ .id }

    goto(I,id) = { F → id. }

    Construction of the Canonical LR(0) items

    To create the SLR parsing tables for a grammar G, we will create the canonical LR(0)

    collection of the grammar G’.

    Algorithm:

    C is { closure({S’→.S}) }

    repeat the followings until no more set of LR(0) items can be added to C.

    for each I in C and each grammar symbol X

    if goto(I,X) is not empty and not in C

    add goto(I,X) to C

    GOTO function is a DFA on the sets in C.

    Example

    Let following be the grammar and its LR Parsing Table

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 72

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 73

    Construction of Parsing Table

    1. Construct the canonical collection of sets of LR(0) items for G’.

    C←{I0,...,In}

    2. Create the parsing action table as follows

    a. If a is a terminal, Aα→.aβ in Ii and goto(Ii,a)=Ij then action[i,a] is shift j.

    b. If Aα→. is in Ii , then action[i,a] is reduce Aα→ for all a in FOLLOW(A) where

    A≠S’.

    c. If S’→S. is in Ii , then action[i,$] is accept.

    d. If any conflicting actions generated by these rules, the grammar is not SLR(1).

    3. Create the parsing goto table

    a. for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j

    4. All entries not defined by (2) and (3) are errors.

    5. Initial state of the parser contains S’→.S

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 74

    LALR Parsing tables

    LALR stands for LookAhead LR.

    LALR parsers are often used in practice because LALR parsing tables are smaller than Canonical LR parsing tables.

    The number of states in SLR and LALR parsing tables for a grammar G are equal.

    But LALR parsers recognize more grammars than SLR parsers.

    yacc creates a LALR parser for the given grammar.

    A state of LALR parser will be again a set of LR(1) items.

    This shrink process may introduce a reduce/reduce conflict in the resulting LALR

    parser.

    In that case the grammar is NOT LALR.

    This shrink process cannot produce a shift/reduce conflict.

    Constructing LALR set of items

    The core of a set of LR(1) items is the set of its first component.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 75

    Find the states (sets of LR(1) items) in a canonical LR(1) parser with same cores. Then we will merge them as a single state.

    Do this for all states of a canonical LR(1) parser to get the states of the LALR parser.

    In fact, the number of the states of the LALR parser for a grammar will be equal to the number of states of the SLR parser for that grammar.

    Parsing Tables Construction

    Shift / Reduce Conflict

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 76

    Reduce / Reduce Conflict

    LALR(1) Items

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 77

    Using Ambiguous Grammar

    All grammars used in the construction of LR-parsing tables must be un-ambiguous.

    Can we create LR-parsing tables for ambiguous grammars? o Yes, but they will have conflicts. o We can resolve these conflicts in favor of one of them to disambiguate the

    grammar.

    o At the end, we will have again an unambiguous grammar.

    Why we want to use an ambiguous grammar? o Some of the ambiguous grammars are much natural, and a corresponding

    unambiguous grammar can be very complex.

    o Usage of an ambiguous grammar may eliminate unnecessary reductions.

    Example

    Sets of LR(0) Items for Ambiguous Grammar

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 78

    SLR-Parsing Tables for Ambiguous Grammar

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 79

    Syntax Directed Translation Schemes

    Syntax Directed Translation

    Translation guided by syntax

    Generate Intermediate code, evaluate expression, type checking.

    Attach rules to productions in the grammar

    Rules (program fragments) executed when the production is used during syntax analysis

    Grammar symbols are associated with attributes to associate information with the

    programming language constructs that they represent.

    Values of these attributes are evaluated by the semantic rules associated with the production rules.

    Evaluation of these semantic rules: o may generate intermediate codes o may put information into the symbol table o may perform type checking o may issue error messages o may perform some other activities o In fact, they may perform almost any activities.

    An attribute may hold almost any thing. o A string, a number, a memory location, a complex record.

    Evaluation of a semantic rule defines the value of an attribute. But a semantic rule may also have some side effects such as printing a value.

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 80

    Syntax Directed Definition

    Example : Consider the grammar

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 81

    Implementation of Syntax Directed Translation

    The Syntax Directed Translation can be implemented by

    Dependency Graph

    S-Attributed Definitions

    L-Attributed Definitions

    Synthesized Attributes

    SDD to Construct Syntax Tree

    Inherited Attributes

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 82

    S-Attributed Definitions

    An SDD is S-attributed if every attribute is synthesized We can have a post-order traversal of parse-tree to evaluate attributes in S-attributed

    definitions

    postorder(N)

    {

    for (each child C of N, from the left) postorder(C);

    evaluate the attributes associated with node N;

    }

    S-Attributed definitions can be implemented during bottom-up parsing without the need to explicitly create parse trees

    L-Attributed Definitions

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 83

    Syntax Directed Translation Schemes (SDT)

    Translation Schemes are more implementation oriented than syntax directed Definitions since they indicate the order in which semantic rules and attributes are to be

    evaluated.

    Definition. A Translation Scheme is a context-free grammar in which 1. Attributes are associated with grammar symbols;

    2. Semantic Actions are enclosed between braces {} and are inserted within the

    right-hand side of productions.

    Yacc uses Translation Schemes. Translation Schemes deal with both synthesized and inherited attributes. Semantic Actions are treated as terminal symbols: Annotated parse-trees contain

    semantic actions as children of the node standing for the corresponding production.

    Translation Schemes are useful to evaluate L-Attributed definitions at parsing time (even if they are a general mechanism).

    An L-Attributed Syntax-Directed Definition can be turned into a Translation Scheme.

    An SDT is a Context Free grammar with program fragments embedded within production bodies

    Those program fragments are called semantic actions They can appear at any position within production body Any SDT can be implemented by first building a parse tree and then performing the

    actions in a left-to-right depth first order

    Typically SDT’s are implemented during parsing without building a parse tree

    Consider the Translation Scheme for the L-Attributed Definition for “type declarations”:

  • COMPILER DESIGN

    Dr.M.Deepamalar, Assistant Professor 84

    Semantic Analysis

    Check for semantic consistency with the language definition

    Type checking - checks if each operator is applied to the right type of operands

    Checks that cannot be done in syntax analysis, like variables declared before use

    Type conversions - coercions

    Uses intermediate representation (syntax tree for example) and symbol table

    Semantic Analysis computes additional information relat