@dbolkensteyn @_godin_#parsing
The Art of Parsing
Evgeny Mandrikov @_godin_Dinesh Bolkensteyn @dbolkensteynhttp://sonarsource.com
@dbolkensteyn @_godin_#parsing 2/56
The Art of Parsing// TODO: don't forget to add huge disclaimer that all opinions hereinbelow are our own and not our employer (they wish they had them)
Evgeny Mandrikov@_godin_
Dinesh Bolkensteyn@dbolkensteyn
@dbolkensteyn @_godin_#parsing 3/56
I want to create a parser
«Done»!
Use Yacc, JavaCC, ANTLR, SSLR, …
or hand-written ?
@dbolkensteyn @_godin_#parsing 4/56
What is the plan?
Why• javac and GCC are hand-written• do we use parser-generators ?
Together we will implement parser for• arithmetic expressions• common constructions from Java• C++ ;)
@dbolkensteyn @_godin_#parsing 5/56
Java formal grammar
JLS8
JLS7
@dbolkensteyn @_godin_#parsing 6/56
Answer is
42
@dbolkensteyn @_godin_#parsing 7/56
Pill of theory
NUM ➙ 42Nonterminal
Productions
Terminals(tokens)
@dbolkensteyn @_godin_#parsing 8/56
Grammar for numbers
NUM ➙ NUM DIGIT | DIGITDIGIT ➙ 0|1|2|3|4|5|6|7|8|9
4, 8, 15, 16, 23, 42,…
Alternatives
@dbolkensteyn @_godin_#parsing 9/56
Arithmetic expressions
4 – 3 – 2 = ?
@dbolkensteyn @_godin_#parsing 10/56
expr ➙ expr – expr | NUM
Arithmetic expressions
4 – 3 – 2 = ?
@dbolkensteyn @_godin_#parsing 11/56
Arithmetic expressions
expr
4 3
2
expr
expr ➙ expr – expr | NUM
(4 – 3)– 2 =-1
@dbolkensteyn @_godin_#parsing 12/56
Arithmetic expressions
4
3 2
expr
expr
expr ➙ expr – expr | NUM
(4 – 3)– 2 =-1 4 –(3 – 2)= 3
expr
4 3
2
expr
@dbolkensteyn @_godin_#parsing 13/56
Arithmetic expressionsexpr ➙ NUM – expr | NUM
expr ➙ expr – expr | NUM
(4 – 3)– 2 =-1 4 –(3 – 2)= 3
expr
4 3
2
expr 4
3 2
expr
expr
@dbolkensteyn @_godin_#parsing 14/56
Arithmetic expressionsexpr ➙ NUM – expr | NUM
expr ➙ expr – expr | NUM
expr ➙ expr – NUM | NUM
(4 – 3)– 2 =-1 4 –(3 – 2)= 3
4
3 2
expr
expr
expr
4 3
2
expr
@dbolkensteyn @_godin_#parsing 15/56
Show me the code
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
expr ➙ expr – NUM | NUM
@dbolkensteyn @_godin_#parsing 16/56
Show me the code right code
? ? int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
expr ➙ expr – NUM | NUM
@dbolkensteyn @_godin_#parsing 17/56
Show me the code right code
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }
expr ➙ expr – NUM | NUM
int expr() { int res = num(); while (token == '–') res = res – num(); return res; }
int expr() { int res = num(); while (token == '–') res = res – num(); return res; }
@dbolkensteyn @_godin_#parsing 18/56
Arithmetic expressions
4 – 3 * 2 = ?
@dbolkensteyn @_godin_#parsing 19/56
Arithmetic expressions
4 – 3 * 2 = -2
expr ➙ expr – NUM | expr * NUM | NUM
@dbolkensteyn @_godin_#parsing 20/56
Arithmetic expressions
4 –(3 * 2)= -2(4 – 3)* 2 = 2 expr ➙ expr – NUM | expr * NUM | NUM
@dbolkensteyn @_godin_#parsing 21/56
Arithmetic expressions
subs ➙ subs – mult | mult mult ➙ mult * NUM | NUM
4 –(3 * 2)= -2
@dbolkensteyn @_godin_#parsing 22/56
Show me the code
int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }
int mult() { int res = num(); while (token == '*') res = res * num(); return res; }
int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }
int mult() { int res = num(); while (token == '*') res = res * num(); return res; }
subs ➙ subs – mult | multmult ➙ mult * NUM | NUM
@dbolkensteyn @_godin_#parsing 23/56
LL(1)
● back to 1969● one token lookahead● no left-recursion
@dbolkensteyn @_godin_#parsing 24/56
What is the plan?
✔ arithmetic expressions✔ LL(1)
• a few common constructions from Java• C++ ;)
@dbolkensteyn @_godin_#parsing 25/56
The real deal
expr-stmt ➙ expr ; obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 26/56
The real deal
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 27/56
The real deal
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
field-access ➙ qualified-id
qualified-id ➙ qualified-id . id
| id
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 28/56
The real deal
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
field-access ➙ qualified-id
qualified-id ➙ qualified-id . id
| id
method-call ➙ qualified-id ()
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 29/56
The real deal
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
field-access ➙ qualified-id
qualified-id ➙ qualified-id . id
| id
method-call ➙ qualified-id ()
assignment ➙ qualified-id = expr
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 30/56
int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }
int expr() { // ??? }
int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }
int expr() { // ??? }
Show me the code
expr-stmt ➙ expr ;
expr ➙ field-access
| method-call
| assignment
field-access ➙ qualified-id
qualified-id ➙ qualified-id . id
| id
method-call ➙ qualified-id ()
assignment ➙ qualified-id = expr
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 31/56
int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }
int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }
The LL(1) wayexpr ➙ field-access
| method-call
| assignment
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 32/56
Realityhttp://hg.openjdk.java.net/jdk8/jdk8/langtools/.../JavacParser.java
@dbolkensteyn @_godin_#parsing 33/56
The better wayexpr ➙ field-access
| method-call
| assignment
int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }
int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 34/56
int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }
int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }
Show me the code right codeexpr ➙ method-call
/ assignment
/ field-access
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 35/56
Parsing Expression Grammars
● 2002● ordered choice «/»● backtracking● no left-recursion
@dbolkensteyn @_godin_#parsing 36/56
enum Nonterminals { EXPR, METHOD_CALL, … }
void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }
enum Nonterminals { EXPR, METHOD_CALL, … }
void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }
DSL for PEGexpr ➙ method-call
/ assignment
/ field-access
obj.method(); a = obj.field; obj.method(); a = obj.field;
@dbolkensteyn @_godin_#parsing 37/56
What is the plan?
✔ arithmetic expressions✔ LL(1)
✔ common constructions from Java✔ PEG
• C++ ;)
@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing
Tea
Break
@dbolkensteyn @_godin_#parsing 39/56
if (false) if (true) System.out.println("foo"); else System.out.println("bar");
if (false) if (true) System.out.println("foo"); else System.out.println("bar");
Quiz
@dbolkensteyn @_godin_#parsing 40/56
if (false) if (true) System.out.println("foo"); else System.out.println("bar");
if (false) if (true) System.out.println("foo"); else System.out.println("bar");
«Dangling else»
if-stmt ➙ IF (cond) stmt ELSE stmt / IF (cond) stmt
@dbolkensteyn @_godin_#parsing 41/56
Java is awesome
(A)*B
(A)*B
@dbolkensteyn @_godin_#parsing 42/56
C++ all the pains of the world
int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'
int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'
int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'
int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'
Java is good, because itwas influenced by bad experience of C++ (A)*B (A)*B
@dbolkensteyn @_godin_#parsing 43/56
rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));
rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));
Hit the wall !
(A)*B (A)*B
@dbolkensteyn @_godin_#parsing 44/56
rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));
rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));
Hit the wall !
(A)*B (A)*B
@dbolkensteyn @_godin_#parsing 45/56
Dreammul-expr ➙ mul-expr * unary-expr | unary-exprunary-expr ➙ ( type-id ) unary-expr | * unary-expr | primaryprimary ➙ ( expr ) | id
(A)*B (A)*B
@dbolkensteyn @_godin_#parsing 46/56
Generalized parsers
● Earley (1968)● slow
● GLR (1984)● complex
@dbolkensteyn @_godin_#parsing 47/56
Chicken and egg problem
(A)*B
unary-expr mul-expr
(A) (A)*B
B*...
(A)*B (A)*Bmul-expr ➙ mul-expr * unary-expr
| unary-expr
unary-expr ➙ ( type-id ) unary-expr
| * unary-expr
| primary
primary ➙ ( expr )
| id
@dbolkensteyn @_godin_#parsing 48/56
Back to the future «dangling else»
if (…) if (…) then-stmt else else-stmt
if (…) if (…) then-stmt else else-stmt
outer-if
inner-if inner-if
then-stmt else-stmt
inner-if · else-stmt
@dbolkensteyn @_godin_#parsing 49/56
GLL : How does it work ?
mul-expr ➙ mul-expr * unary-expr
| unary-expr
@dbolkensteyn @_godin_#parsing 50/56
Generalized LL
● 2010● no grammar left behind (left-recursive, ambiguous)
● simpler than GLR● syntactic ambiguities
@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing
Sum
mar
y
@dbolkensteyn @_godin_#parsing 52/56
Summary
LL(1)
• trivial• major grammar changes• only good for arithmetic expressions• on steroids as in JavaCC usable for real languages
@dbolkensteyn @_godin_#parsing 53/56
Summary
PEG
• trivial• fewer grammar changes• no ambiguities• usable for real languages• nice tools such as SSLR• dead-end for C/C++
@dbolkensteyn @_godin_#parsing 54/56
Summary
GLL
• any grammar• relatively simple• ambiguities• reasonable performances• the only clean choice for C/C++• only «academic» tools for now... ;)
@dbolkensteyn @_godin_#parsing 55/56
Summary
Hand-written
● based on LL(1)● precise error-reporting and recovery
● best performances● maintainance hell
@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing
Q & A
Top Related