Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. ·...

33
Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and 2.3.1

Transcript of Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. ·...

Page 1: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Language Syntax:Scanning and Parsing

Read: Scott, Chapter 2.2 and 2.3.1

Page 2: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 2

Lecture Outline

n Overview of scanningn Overview of top-down and bottom-up parsing

n Top-down parsingn Recursive descentn LL(1) parsing tables

Page 3: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Scanning

n Scanner groups characters into tokens

n Scanner simplifies thejob of the parser

n Scanner is essentially a Finite Automatonn Regular expressions specify the syntax of tokensn Scanner recognizes the tokens in the program

3

Scanner

Parser

position = initial + rate * 60;

id = id + id * num ;

Programming Languages CSCI 4430, A. Milanova

Page 4: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Question

n Why most programming languages disallow nested multi-line comments?

n Comments are usually handled by the scanner, which essentially is a DFA. Handling multiline comments would require recognizing (/*)n(*/)nwhich is beyond the power of a DFA.

Programming Languages CSCI 4430, A. Milanova 4

Page 5: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Calculator Language

n Tokenstimes ® *plus ® +id ® letter ( letter | digit )*

except for read and write which are keywords (keywords are tokens as well)

Programming Languages CSCI 4430, A. Milanova 5

Page 6: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Ad-hoc (By hand) Scanner

skip any initial white space (space, tab, newline)if current_char in { +, * }return corresponding single-character token (plus or times)

if current_char is a letterread any additional letters and digits check to see if the resulting string is read or writeif so, then return the corresponding tokenelse return id

else announce an ERROR

Programming Languages CSCI 4430, A. Milanova 6

Page 7: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

The Scanner as a DFA

Programming Languages CSCI 4430, A. Milanova 7

1

2 3

4

*+

letter

letter, digit

Startspace, tab, newline

Page 8: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Building a Scanner

n Scanners are (usually) automatically generated from regular expressions:

Step 1: From a Regular Expression to an NFAStep 2: From an NFA to a DFAStep 3: Minimizing the DFA

n lex/flex utilities generate scanner coden Scanner code explicitly captures the states

and transitions of the DFA

Programming Languages CSCI 4430, A. Milanova 8

Page 9: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Table-Driven Scanning…cur_state := 1loop

read cur_charcase scan_tab[cur_char, cur_state].action of

move:… cur_state = scan_tab[cur_char, cur_state].new_state

recognize: // emits the tokentok = token_tab[current_state]unread cur_char --- push back charexit loop

error: … Programming Languages CSCI 4430, A. Milanova 9

Page 10: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Table-Driven Scanning

Sketch of table: scan_tab and token_tab. See Scott for details.Programming Languages CSCI 4430, A. Milanova 10

space,tab,newline * + digit letter other1 5 2 3 - 4 -2 - - - - - - plus3 - - - - - - times4 - - - 4 4 - id

5 5 - - - - - space

Page 11: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 11

Today’s Lecture Outline

n Overview of scanningn Overview of top-down and bottom-up parsing

n Top-down parsingn Recursive descentn LL(1) parsing tables

Page 12: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

CSCI 4430, A. Milanova

A Simple Calculator Language

Scanner

Parser

Character stream: position = initial + rate * time

Parse tree:

Token stream: id = id + id * id

12

asst_stmt ® id = expr // asst_stmt is the start symbolexpr ® expr + expr | expr * expr | id

id = expr

+id expr

*id id

asst_stmt

(Parse tree simplified to fit on slide.)

Page 13: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova

A Simple Calculator Language

Scanner

Parser

Character stream: position + initial = rate * time

Parse tree:

Token stream: id + …

13

asst_stmt ® id = expr // asst_stmt is the start symbolexpr ® expr + expr | expr * expr | id

Token stream is ill-formed according to our grammar,parse tree construction fails, therefore Syntax error!

Most compiler errors occur in the parser.

Page 14: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Parsing

n For any CFG, one can build a parser that runs in O(n3) n Well-known algorithms

n But O(n3) time is unacceptable for a parser in a compiler!

Programming Languages CSCI 4430, A. Milanova 14

Page 15: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 15

Parsing

n Objective: build a parse tree for an input string of tokens from a single scan of inputn Only special subclasses of context-free

grammars (LL and LR) can do thisn Two approaches

n Top-down: builds parse tree from the root to the leaves

n Bottom-up: builds parse tree from the leaves to the top

n Both are easily automated

Page 16: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Grammar for Comma-separated Listslist ® id list_tail // list is the start symbollist_tail ® , id list_tail | ;Generates comma-separated lists of id’s.E.g., id ; id, id, id ;

Example derivation:list Þ id list_tail

Þ id , id list_tail Þ id , id ;

Programming Languages CSCI 4430, A. Milanova 16

Page 17: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

17

Top-down Parsing

n Terminals are seen in the order of appearance inthe token streamid , id , id ;

n The parse tree is constructedn From the top to the leavesn Corresponds to a left-most derivation

n Look at left-most nonterminal in current sentential form, and lookahead terminal and “predict” which production to apply

list ® id list_tail list_tail ® , id list_tail | ;

list

id list_tail

;

, list_tailid

, list_tailid

Page 18: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

18

Bottom-up Parsing

n Terminals are seen in theorder of appearance in the token streamid , id , id ;

n The parse tree is constructedn From the leaves to the topn A right-most derivation in reverse

list ® id list_taillist_tail ® , id list_tail | ;

list

id list_tail

;

, list_tailid

, list_tailid

Programming Languages CSCI 4430, A. Milanova

Page 19: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 19

Today’s Lecture Outline

n Overview of scanningn Overview of top-down and bottom-up parsing

n Top-down parsingn Recursive descentn LL(1) parsing tables

Page 20: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 20

Top-down Predictive Parsing

n “Predicts” production to apply based on one or more lookahead token(s)

n Predictive parsers work with LL(k) grammarsn First L stands for “left-to-right” scan of inputn Second L stands for “left-most” derivation

n Parse corresponds to left-most derivation n k stands for “need k tokens of lookahead to

predict”n We are interested in LL(1)

Page 21: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

21

Question

n Can we always predict (i.e., for any input) what production to applies, based on one token of lookahead?id , id , id ;

n Yes, there is at most one choice (i.e., at most one production applies)

n This grammar is an LL(1) grammar

list ® id list_tail list_tail ® , id list_tail | ;

list

id list_tail

;

, list_tailid

, list_tailid

Programming Languages CSCI 4430, A. Milanova

Page 22: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

22

Question

n A new grammarn What language does it generate?

n Same, comma-separated listsn Can we predict based on onetoken of lookahead?

id , id , id ;

list ® list_prefix ;list_prefix ® list_prefix , id | id

list

list_prefix ;

?

Programming Languages CSCI 4430, A. Milanova

Page 23: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 23

Top-down Predictive Parsing

n Back to predictive parsing

n “Predicts” production to apply based on one or more lookahead token(s)n Parser always gets it right! n There is no need to backtrack, undo expansion,

then try a different production

n Predictive parsers work with LL(k) grammars

Page 24: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 24

Top-down Predictive Parsing

n Expression grammar:n Not LL(1)

n Unambiguous version:n Still not LL(1). Why?

n LL(1) version:

expr ® expr + expr| expr * expr| id

expr ® expr + term | termterm ® term * id | id

expr ® term term_tailterm_tail ® + term term_tail | εterm ® id factor_tailfactor_tail ® * id factor_tail | ε

Page 25: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Exercise

n Draw parse tree for expressionid + id * id + id

Programming Languages CSCI 4430, A. Milanova 25

expr ® term term_tailterm_tail ® + term term_tail | εterm ® id factor_tailfactor_tail ® * id factor_tail | ε

Page 26: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 26

Recursive Descentn Each nonterminal has a proceduren The right-hand-sides (rhs) for the nonterminal

form the body of its procedure

n lookahead()n Peeks at current token in input stream

n match(t)n if lookahead() == t then consume current token,

else PARSE_ERROR

Page 27: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

27

Recursive Descent

start()case lookahead() of

id: expr(); match($$) ($$ - end-of-input marker)otherwise PARSE_ERROR

expr()case lookahead() of

id: term(); term_tail()otherwise PARSE_ERROR

term_tail()case lookahead() of

+: match(‘+’); term(); term_tail()$$: skipotherwise: PARSE_ERROR

start ® expr $$expr ® term term_tail term_tail ® + term term_tail | εterm ® id factor_tail factor_tail ® * id factor_tail | ε

Predicting production term_tail ® + term term_tail

Predicting epsilon production term_tail ® ε

Page 28: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

28

Recursive Descent

term()case lookahead() of

id: match(‘id’); factor_tail()otherwise: PARSE_ERROR

factor_tail()case lookahead() of

*: match(‘*’); match(‘id’); factor_tail();+,$$: skipotherwise PARSE_ERROR

Predicting production factor_tail ® *id factor_tail

Predicting production factor_tail ® ε

start ® expr $$expr ® term term_tail term_tail ® + term term_tail | εterm ® id factor_tail factor_tail ® * id factor_tail | ε

Programming Languages CSCI 4430, A. Milanova

Page 29: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

Programming Languages CSCI 4430, A. Milanova 29

LL(1) Parsing Table

n But how does the parser “predict”?n E.g., how does the parser know to expand a

factor_tail by factor_tail ® ε on + and $$?n It uses the LL(1) parsing table

n One dimension is nonterminal to expandn Other dimension is lookahead token

n We are interested in one token of lookaheadn Entry “nonterminal on token” contains the

production to apply or contains nothing

Page 30: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

30

LL(1) Parsing Table

n One dimension is nonterminal to expandn Other dimension is lookahead token

n E.g., entry “nonterminal A on terminal a” contains production A ® α

Meaning: when parser is at nonterminal A and lookahead token is a, then parser expands Aby production A ® α

A

a

α

Page 31: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

31

LL(1) Parsing Table

id + * $$

start expr $$ - - -

expr term term_tail - - -

term_tail - + term term_tail - ε

term id factor_tail - - -

factor_tail - ε * id factor_tail ε

start ® expr $$expr ® term term_tail term_tail ® + term term_tail | εterm ® id factor_tail factor_tail ® * id factor_tail | ε

Programming Languages CSCI 4430, A. Milanova

Page 32: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

32

Question

id , ; $$

start

list

list_tail

start ® list $$list ® id list_tail list_tail ® , id list_tail | ;

• Fill in the LL(1) parsing table for the comma-separated list grammar

list $$

id list_tail

, id list_tail ;-

-

-

-

-

-

-

-

Programming Languages CSCI 4430, A. Milanova

Page 33: Programming Language Syntax: Scanning and Parsingmilanova/csci4430/Lecture3.pdf · 2020. 9. 12. · Programming Language Syntax: Scanning and Parsing Read: Scott, Chapter 2.2 and

The End

Programming Languages CSCI 4430, A. Milanova 33