CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

62
UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis Fall 2009 Marco Valtorta [email protected] Syntactic sugar causes cancer of the semicolon. A.Perlis

description

CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis. Fall 2009 Marco Valtorta [email protected] Syntactic sugar causes cancer of the semicolon . A.Perlis. Contents. 3.1 Chomsky Hierarchy 3.2 Lexical Analysis 3.3 Syntactic Analysis. - PowerPoint PPT Presentation

Transcript of CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

Page 1: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

CSCE 330Programming Language

StructuresChapter 3: Lexical and

Syntactic AnalysisFall 2009

Marco [email protected]

Syntactic sugar causes cancer of the semicolon. A.Perlis

Page 2: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Contents

• 3.1 Chomsky Hierarchy• 3.2 Lexical Analysis• 3.3 Syntactic Analysis

Page 3: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

3.1 Chomsky Hierarchy

• Regular grammar -- least powerful• Context-free grammar (BNF)• Context-sensitive grammar• Unrestricted grammar

Page 4: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Regular Grammar

• Simplest; least powerful• Equivalent to:

– Regular expression– Finite-state automaton

• Right regular grammar: T*, B NA → BA →

Page 5: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Example

• Integer → 0 Integer | 1 Integer | ... | 9 Integer | 0 | 1 | ... | 9

Page 6: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Regular Grammars

• Left regular grammar: equivalent• Used in construction of tokenizers

(scanners, lexers)• Less powerful than context-free

grammars• Not a regular language

{ aⁿ bⁿ | n ≥ 1 }i.e., cannot balance: ( ), { }, begin end

Page 7: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Context-free Grammars

• BNF a stylized form of CFG• Equivalent to a pushdown automaton• For a wide class of unambiguous CFGs,

there are table-driven, linear time parsers

Page 8: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Context-Sensitive Grammars

• Production:• α → β |α| ≤ |β|• α, β (N T)*• i.e., left-hand side can be composed of

strings of terminals and nonterminals

Page 9: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Undecidable Properties of CSGs

• Given a string and grammar G: L(G)• L(G) is non-empty• Defn: Undecidable means that you cannot

write a computer program that is guaranteed to halt to decide the question for all L(G).

Page 10: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Unrestricted Grammar

• Equivalent to:– Turing machine– von Neumann machine– C++, Java

• That is, can compute any computable function.

Page 11: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Contents

• 3.1 Chomsky Hierarchy• 3.2 Lexical Analysis• 3.3 Syntactic Analysis

Page 12: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Lexical Analysis

• Purpose: transform program representation

• Input: printable Ascii characters• Output: tokens• Discard: whitespace, comments

• Defn: A token is a logically cohesive sequence of characters representing a single symbol.

Page 13: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Example Tokens

• Identifiers• Literals: 123, 5.67, 'x', true• Keywords: bool char ...• Operators: + - * / ...• Punctuation: ; , ( ) { }

Page 14: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Other Sequences

• Whitespace: space tab• Comments

// any-char* end-of-line• End-of-line• End-of-file

Page 15: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Why a Separate Phase?

• Simpler, faster machine model than parser

• 75% of time spent in lexer for non-optimizing compiler

• Differences in character sets• End of line convention differs

Page 16: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Regular Expressions

• RegExpr Meaning• x a character x • \x an escaped character,

e.g., \n• { name } a reference to a name• M | N M or N• M N M followed by N• M* zero or more occurrences

of M

Page 17: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• RegExpr Meaning• M+ One or more

occurrences of M• M? Zero or one occurrence

of M• [aeiou] the set of vowels• [0-9] the set of digits• . Any single character

Page 18: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Clite Lexical Syntax

• Category Definition• anyChar [ -~]• Letter [a-zA-Z]• Digit [0-9]• Whitespace [ \t]• Eol \n• Eof \004

Page 19: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• Category Definition• Keyword bool | char | else | false |

float |if | int | main | true | while

• Identifier {Letter}({Letter} | {Digit})*

• integerLit {Digit}+• floatLit {Digit}+\.{Digit}+• charLit ‘{anyChar}’

Page 20: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• Category Definition• Operator = | || | && | == | != | < | <=

| > | >= | + | - | * | / |! | [ | ]• Separator ; | . | { | } | ( | )• Comment // ({anyChar} |

{Whitespace})* {eol}

Page 21: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Generators

• Input: usually regular expression• Output: table (slow), code• C/C++: Lex, Flex• Java: JLex

Page 22: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Finite State Automata

• Set of states: representation – graph nodes

• Input alphabet + unique end symbol• State transition function

Labelled (using alphabet) arcs in graph• Unique start state• One or more final states

Page 23: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Deterministic FSA

• Defn: A finite state automaton is deterministic if for each state and each input symbol, there is at most one outgoing arc from the state labeled with the input symbol.

Page 24: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• A Finite State Automaton for Identifiers

Page 25: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Definitions

• A configuration on an FSA consists of a state and the remaining input.

• A move consists of traversing the arc exiting the state that corresponds to the leftmost input symbol, thereby consuming it. If no such arc, then:– If no input and state is final, then

accept.– Otherwise, error.

Page 26: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• An input is accepted if, starting with the start state, the automaton consumes all the input and halts in a final state.

Page 27: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Example

• (S, a2i$) ├ (I, 2i$)• ├ (I, i$)• ├ (I, $)• ├ (F, )

• Thus: (S, a2i$) ├* (F, )

Page 28: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Some Conventions

• Explicit terminator used only for program as a whole, not each token.

• An unlabeled arc represents any other valid input symbol.

• Recognition of a token ends in a final state.

• Recognition of a non-token transitions back to start state.

Page 29: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• Recognition of end symbol (end of file)

ends in a final state.• Automaton must be deterministic.

– Drop keywords; handle separately.– Must consider all sequences with a

common prefix together.

Page 30: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Page 31: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Page 32: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Lexer Code

• Parser calls lexer whenever it needs a new token.

• Lexer must remember where it left off.• Greedy consumption goes 1 character

too far– peek function– pushback function– no symbol consumed by start state

Page 33: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

From Design to Code

• private char ch = ‘ ‘;• public Token next ( ) {• do {• switch (ch) {• ...• }• } while (true);• }

Page 34: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Remarks

• Loop only exited when a token is found

• Loop exited via a return statement.• Variable ch must be global. Initialized

to a space character.• Exact nature of a Token irrelevant to

design.

Page 35: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Translation Rules

• Traversing an arc from A to B:– If labeled with x: test ch == x– If unlabeled: else/default part of

if/switch. If only arc, no test need be performed.

– Get next character if A is not start state

Page 36: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• A node with an arc to itself is a do-while.– Condition corresponds to whichever

arc is labeled.

Page 37: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• Otherwise the move is translated to a if/switch:– Each arc is a separate case.– Unlabeled arc is default case.

• A sequence of transitions becomes a sequence of translated statements.

Page 38: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• A complex diagram is translated by boxing its components so that each box is one node.– Translate each box using an outside-

in strategy.

Page 39: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• private boolean isLetter(char c) {• return ch >= ‘a’ && ch <= ‘z’ ||• ch >= ‘A’ && ch <= ‘Z’;• }

Page 40: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• private String concat(String set) {• StringBuffer r = new

StringBuffer(“”);• do {• r.append(ch);• ch = nextChar( );• } while (set.indexOf(ch) >= 0);• return r.toString( );• }

Page 41: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• public Token next( ) {• do { if (isLetter(ch) { // ident or keyword• String spelling = concat(letters+digits);• return Token.keyword(spelling);• } else if (isDigit(ch)) { // int or float literal• String number = concat(digits);• if (ch != ‘.’) • return Token.mkIntLiteral(number);• number += concat(digits);• return Token.mkFloatLiteral(number);

Page 42: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

• } else switch (ch) {• case ‘ ‘: case ‘\t’: case ‘\r’: case eolnCh:• ch = nextCh( ); break;• case eofCh: return Token.eofTok;• case ‘+’: ch = nextChar( );• return Token.plusTok;• …• case ‘&’: check(‘&’); return Token.andTok;• case ‘=‘: return chkOpt(‘=‘, Token.assignTok,• Token.eqeqTok);

Page 43: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Source Tokens

• // a first program• // with 2 comments• int main ( ) {

char c;int i;c = 'h';i = c + 3;

• } // main

• int• main• (• )• {• char• Identifier c• ;

Page 44: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

JLex: A Lexical Analyzer Generator for Java

Definition of tokens

Regular Expressions

JLex

Java File: Scanner Class

Recognizes Tokens

We will look at an example JLex specification (adopted from the manual).

Consult the manual for details on how to write your own JLex specifications.

Page 45: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

The JLex tooluser code (added to start of generated file)

%% options

%{ user code (added inside the scanner class declaration)%} macro definitions

%%

lexical declaration

user code (added to start of generated file)

%% options

%{ user code (added inside the scanner class declaration)%} macro definitions

%%

lexical declaration

Layout of JLex file:

User code is copied directly into the output class

JLex directives allow you to include code in the lexical analysis class, change names of various components, switch on character counting, line counting, manage EOF, etc.

Macro definitions gives names for useful regexps

Regular expression rules define the tokens to be recognised and actions to be taken

Page 46: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Java.io.StreamTokenizer• An alternative to JLex is to use the class

StreamTokenizer from java.io• The class recognizes 4 types of lexical

elements (tokens):• number (sequence of decimal numbers

eventually starting with the –(minus) sign and/or containing the decimal point)

• word (sequence of characters and digits starting with a character)

• line separator• end of file

Page 47: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Parsing• Some terminology• Different types of parsing strategies

– bottom up– top down

• Recursive descent parsing– What is it– How to implement one given an EBNF

specification– (How to generate one using tools –

later)• (Bottom up parsing algorithms)

Page 48: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Parsing: Some Terminology

• RecognitionTo answer the question “does the input conform

to the syntax of the language?”

• ParsingRecognition + determination of phrase structure

(for example by generating AST data structures)

• (Un)ambiguous grammar:A grammar is unambiguous if there is only at

most one way to parse any input (i.e. for syntactically correct program there is precisely one parse tree)

Page 49: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Different kinds of Parsing Algorithms

• Two big groups of algorithms can be distinguished:– bottom up strategies– top down strategies

• Example parsing of “Micro-English”

Sentence ::= Subject Verb Object .Subject ::= I | a Noun | the Noun Object ::= me | a Noun | the NounNoun ::= cat | mat | ratVerb ::= like | is | see | sees

Sentence ::= Subject Verb Object .Subject ::= I | a Noun | the Noun Object ::= me | a Noun | the NounNoun ::= cat | mat | ratVerb ::= like | is | see | sees

The cat sees the rat.The rat sees me.I like a cat

The rat like me.I see the rat.I sees a rat.

Page 50: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Top-down parsing

The cat sees a rat .The cat sees rat .

The parse tree is constructed starting at the top (root).

Sentence

Subject Verb Object .

Sentence

Noun

Subject

The

Noun

cat

Verb

sees a

Noun

Object

Noun

rat .

Page 51: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Bottom up parsing

The cat sees a rat .The cat

Noun

Subject

sees

Verb

a rat

Noun

Object

.

Sentence

The parse tree “grows” from the bottom (leaves) up to the top (root).

Page 52: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Look-Ahead

Derivation

LL-Analyse (Top-Down)Left-to-Right Left Derivative

Scans string left to rightBuilds leftmost derivation

Look-Ahead

Reduction

LR-Analyse (Bottom-Up)Left-to-Right Right Derivative

Scans string left to rightBuilds rightmost derivation

Top-Down vs. Bottom-Up parsing

Page 53: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Recursive Descent Parsing

• Recursive descent parsing is a straightforward top-down parsing algorithm.

• We will now look at how to develop a recursive descent parser from an EBNF specification.

• Idea: the parse tree structure corresponds to the “call graph” structure of parsing procedures that call each other recursively.

Page 54: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Recursive Descent Parsing

Sentence ::= Subject Verb Object .Subject ::= I | a Noun | the Noun Object ::= me | a Noun | the NounNoun ::= cat | mat | ratVerb ::= like | is | see | sees

Sentence ::= Subject Verb Object .Subject ::= I | a Noun | the Noun Object ::= me | a Noun | the NounNoun ::= cat | mat | ratVerb ::= like | is | see | sees

Define a procedure parseN for each non-terminal N

private void parseSentence() ;private void parseSubject();private void parseObject(); private void parseNoun();private void parseVerb();

private void parseSentence() ;private void parseSubject();private void parseObject(); private void parseNoun();private void parseVerb();

Page 55: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Recursive Descent Parsing

public class MicroEnglishParser {

private TerminalSymbol currentTerminal;

//Auxiliary methods will go here ...

//Parsing methods will go here ...}

public class MicroEnglishParser {

private TerminalSymbol currentTerminal;

//Auxiliary methods will go here ...

//Parsing methods will go here ...}

Page 56: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Recursive Descent Parsing: Auxiliary Methods

public class MicroEnglishParser {

private TerminalSymbol currentTerminal

private void accept(TerminalSymbol expected) {if (currentTerminal matches expected) currentTerminal = next input terminal ;else report a syntax error

}

...}

public class MicroEnglishParser {

private TerminalSymbol currentTerminal

private void accept(TerminalSymbol expected) {if (currentTerminal matches expected) currentTerminal = next input terminal ;else report a syntax error

}

...}

Page 57: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Recursive Descent Parsing: Parsing Methods

private void parseSentence() { parseSubject(); parseVerb(); parseObject(); accept(‘.’);}

private void parseSentence() { parseSubject(); parseVerb(); parseObject(); accept(‘.’);}

Sentence ::= Subject Verb Object .Sentence ::= Subject Verb Object .

Page 58: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Recursive Descent Parsing: Parsing Methods

private void parseSubject() { if (currentTerminal matches ‘I’) accept(‘I’); else if (currentTerminal matches ‘a’) { accept(‘a’); parseNoun(); } else if (currentTerminal matches ‘the’) { accept(‘the’); parseNoun(); } else report a syntax error}

private void parseSubject() { if (currentTerminal matches ‘I’) accept(‘I’); else if (currentTerminal matches ‘a’) { accept(‘a’); parseNoun(); } else if (currentTerminal matches ‘the’) { accept(‘the’); parseNoun(); } else report a syntax error}

Subject ::= I | a Noun | the Noun Subject ::= I | a Noun | the Noun

Page 59: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Recursive Descent Parsing: Parsing Methods

private void parseNoun() { if (currentTerminal matches ‘cat’) accept(‘cat’); else if (currentTerminal matches ‘mat’) accept(‘mat’); else if (currentTerminal matches ‘rat’) accept(‘rat’); else report a syntax error}

private void parseNoun() { if (currentTerminal matches ‘cat’) accept(‘cat’); else if (currentTerminal matches ‘mat’) accept(‘mat’); else if (currentTerminal matches ‘rat’) accept(‘rat’); else report a syntax error}

Noun ::= cat | mat | ratNoun ::= cat | mat | rat

Page 60: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Algorithm to convert EBNF into a RD parser

private void parseN() { parse X}

private void parseN() { parse X}

N ::= X N ::= X

• The conversion of an EBNF specification into a Java implementation for a recursive descent parser is so “mechanical” that it can easily be automated!

=> JavaCC “Java Compiler Compiler”• We can describe the algorithm by a set of mechanical rewrite

rules

Page 61: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Algorithm to convert EBNF into a RD parser

// a dummy statement// a dummy statement

parse parse

parse N where N is a non-terminalparse N where N is a non-terminal

parseN();parseN();

parse t where t is a terminalparse t where t is a terminal

accept(t);accept(t);

parse XYparse XY

parse Xparse Y

parse Xparse Y

Page 62: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Algorithm to convert EBNF into a RD parser

parse X* parse X*

while (currentToken.kind is in starters[X]) { parse X}

while (currentToken.kind is in starters[X]) { parse X}

parse X|Y parse X|Y

switch (currentToken.kind) { cases in starters[X]: parse X break; cases in starters[Y]: parse Y break; default: report syntax error }

switch (currentToken.kind) { cases in starters[X]: parse X break; cases in starters[Y]: parse Y break; default: report syntax error }