CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau [email protected].
-
Upload
jean-claycomb -
Category
Documents
-
view
224 -
download
0
Transcript of CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau [email protected].
![Page 1: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/1.jpg)
CS252: Systems Programming
Ninghui Li
Topic 5: Parsing
Prepared by Evan Hanau
![Page 2: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/2.jpg)
Introduction to Parsing with Yacc
An Introduction to Parsing with Yacc
• Context-Free Grammars• Yacc Parsing• An example Infix Calculator Program
![Page 3: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/3.jpg)
Context-Free Grammar
Background: The Context-Free Grammar
• By CS252 you are already somewhat familiar with Regular Expressions.
• Regular expressions can be used to describe regular languages, which belong to a larger classification of language types.
![Page 4: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/4.jpg)
In CS, we classify languages on the Chomsky Hierarchy.
Type-0 Recursively EnumerableType-1 Context-SensitiveType-2 Context-Free Type-3 Regular
Context-Free Grammar
Type-(i) is a superset of Type-(i+1)
![Page 5: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/5.jpg)
• Languages generated by regular expressions belong to type 3.
• Note: Your specific regular expression engine (e.g. POSIX extended RE) is likely capable of more complex productions.
• In any case, we need more than regular expressions to parse computer programming languages and shell scripts.
Context-Free Grammar
![Page 6: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/6.jpg)
• You can do a great deal with regular expressions.
• Exercise: Create a regular expression that matches on any English phrase that is a palindrome, for instance the string “some men interpret nine memos”.
Context-Free Grammar
![Page 7: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/7.jpg)
• This is in fact not possible with regex (by its strict CS definition!). You would be limited to palindromes of a finite length only.• RE cannot express “anbn”, a string with some number of
a’s followed by equal number of b’s• The expression a*b* does not require number of a’s
equal that of b’s
• We must use a context-free grammar to describe palindromes and other constructs.
• More powerful than a regular expression, and useful when some notion of “what came before” is required.
Context-Free Grammar
![Page 8: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/8.jpg)
• BNF or Backus-Naur form is used in CS to describe context-free grammars. It is often used to describe the syntax of programming languages.consists of one or more of the following:
<nonterminal> ::= __expression__
• Where __expression__ consists of one or more terminals and nonterminals or nothing (epsilon).
Backus-Naur form
![Page 9: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/9.jpg)
<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
US Post Address in Backus-Naur form (from wikipedia)
![Page 10: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/10.jpg)
Let’s define a grammar for a primitive add or multiply expression:
<expr> ::= <expr> * <expr> | <expr> + <expr>
| number
In this case, <expr> is a terminal and number, *, and + are the nonterminals.
Context-Free Grammar for Simple Expressions
![Page 11: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/11.jpg)
• Clearly, there is some ambiguity here, because operator precedence (sometimes referred to as binding) is not defined.
• The grammar does not distinguish between 2+2*2+2 = 16 (incorrect under normal rules) or 2+2*2+2 = 8 (correct).
Context-Free Grammar
![Page 12: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/12.jpg)
One Solution: Define expressions of different levels:<expr> ::= <add_expr><add_expr> ::= <add_expr> + <mul_expr> | <mul_expr><mul_expr> ::= <mul_expr> * number | number
Now, multiplication will bind tighter than addition (this may require a few sample expressions to wrap your head around!)
Context-Free Grammar
![Page 13: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/13.jpg)
Associativity follows from the above example (Hint: What side of the multiply and add operation did we have the “deeper” production on?)
Context-Free Grammar
![Page 14: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/14.jpg)
<palindrome> ::= letter
| // empty string
<palindrome> ::= “a” <palindrome> “a”
| “b” <palindrome> “b”
….
Or, <palindrome> ::= letter <palindrome> letter
However, we need to check the two letters are the same.
CFG for anbn:
<anbn> = // empty string | “a” <anbn> “b”
CFG for palindrome
![Page 15: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/15.jpg)
Chomsky Hierarchy (From Wikipedia Page)
Grammar Languages AutomatonProduction rules
(constraints)
Type-0 Recursively enumerable
Turing machine (no restrictions)
Type-1 Context-sensitive Linear-bounded non-deterministic Turing machine
(equivalently, Right side no shorter than left)
Type-2 Context-freeNon-deterministic pushdown automaton
Type-3 Regular Finite state automaton
and
![Page 16: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/16.jpg)
Type-0 Recursively EnumerableType-1 Context-Sensitive
Cannot encode all strings r1r2 such that r1 and r2 are two regular expressions that are equivalentType-2 Context-Free (Pushdown Automaton, i.e, Finite State Automaton with a Stack)
Can encode anbn, but not anbncn
Type-3 Regular (Finite State Automaton)
Can encode a*b*, but not anbn
Chomsky Hierarchy Revisited.
![Page 17: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/17.jpg)
Why is Context-Free Grammar Called Context Free?
In a CFG, the left hand of each production is a single non-terminal, e.g., <palindrome> ::= “a” <palindrome> “a”
This means that “a”, followed by a <palindrome>, and by “a” will always be considered as <palindrome>, no matter what is the context, hence context free.
In a Context-Sensitive Grammar, left hand of production rules can include other things
![Page 18: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/18.jpg)
An Example Context-Sensitive Grammar for anbncn
1. S a B C
2. S a S B C
3. B C C B
4. a B a b
5. b B b b
6. b C b c
7. c C c c
S→2 aSBC →1 aaBCBC
→3 aaBBCC
→4 aabBCC
→5 aabbCC
→6 aabbcC
→7 aabbcc
![Page 19: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/19.jpg)
Yacc & Parsing
There are many ways to parse BNF grammars, most of which are discussed in a compilers course.• Recall: A finite state automaton (FSA) is
used for regular expressions. (CS182).• For a context-free grammar, we use a
pushdown automaton, which combines features of a FSA with a stack.
![Page 20: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/20.jpg)
Yacc generates what is known as a LALR parser, which is generated from the BNF grammar in your Yacc file. This parser is defined in the C source file that Yacc generates.
We use Lex to make a lexer to generate our terminals, which are matched with regular expressions before being fed into the parser.
Yacc & Parsing
![Page 21: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/21.jpg)
Yacc is capable of generating a powerful parser that will handle many different grammars.
LEX Lexer
YACC Parser
Rule-Based
Behavior
terminals
Input Characters
Yacc & Parsing
![Page 22: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/22.jpg)
• Recall that parsing combines a state machine with a stack. States go on a stack to keep track of where parsing is. Yacc uses a parse table which defines possible states.
• Yacc’s parser operates using two primary actions, shift and reduce.
• shift puts a state on the stack, reduce pops state(s) off the stack and reduces combinations of nonterminals and terminals to a single nonterminal. After a reduction to a rule, Yacc’s parser will optionally run some user-defined code.
Yacc & Parsing
![Page 23: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/23.jpg)
A very basic example:
<rule> := “hello” “world” “\n”
The parser would shift each word, successively pushing each state (.”hello”, .”world”, .”\n”) onto the stack. Then at the end of the rule, reduce everything to <rule> and pop the three states.
Yacc & Parsing
![Page 24: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/24.jpg)
A Lex/Yacc Infix Calculator
• Yacc’s parser is powerful, but is not capable of parsing all grammars.
• Certain ambiguous grammars may produce what is known as a shift/reduce or reduce/reduce conflict. Yacc will, by default, shift instead of reduce.
![Page 25: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/25.jpg)
Consider the classic shift/reduce conflict example:
<ifexp> ::= IF <expr> THEN <stmt> ELSE <stmt> | IF <expr> THEN <stmt>
Yacc will have a shift/reduce conflict here, but will go with shift (the top option) by default. It’s greedy!
Yacc & Parsing
![Page 26: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/26.jpg)
A Lex/Yacc Infix Calculator
• To demonstrate the utility of Lex and Yacc (or in our case, Flex and Bison) we provide an example infix calculator.
• Similar to several of the examples provided on the Lex and Yacc manpage at http://dinosaur.compilertools.net, but with added features
![Page 27: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/27.jpg)
A Lex/Yacc Infix Calculator
• Make sure to read ALL source code comments, particularly those that describe source file organization.
• Lex definition file: calculator.l• Yacc grammar file: calculator.y• AST Classes: ast.cc• Symbol table: symtab.cc
![Page 28: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/28.jpg)
A Lex/Yacc Infix Calculator
The example calculator application uses Lex and Yacc to parse mathematical expressions and produce an Abstract Syntax Tree, which is then used to evaluate those expressions.
It allows the =, +, *, -, +, ^, () and unary minus operators, with appropriate levels of binding and precedence. Examine calculator.y, because it is heavily commented.
![Page 29: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/29.jpg)
A Lex/Yacc Infix Calculator
• The symbol table (implemented here in simple O(n) access time) maps variables to values.
• Print the AST after every expression evaluation by running calculator with the –t flag, e.g. “calc –t”.
![Page 30: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/30.jpg)
calc> 2*2^3/3 = 5.333333 3.000 /(/) \ 3.000 / (^) \ 2.000 / (*) \ 2.000
A calculator example. Type “2*2^3/3 and press enter:
A Lex/Yacc Infix Calculator
![Page 31: CS252: Systems Programming Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu.](https://reader035.fdocuments.in/reader035/viewer/2022062301/56649c805503460f949382dd/html5/thumbnails/31.jpg)
Review
• What are required:• Able to write simple Context Free
Grammars, similar to those used in implementing FIZ
• Able to determine whether a string of tokens is accepted by a grammar
• Able to show how a string of tokens is parsed into some non-terminal (i.e., draw the parsing tree)