Lexing, Parsing
Laure Gonnord & Ludovic Henrio & Matthieu Moyhttps:
//compil-lyon.gitlabpages.inria.fr/compil-lyon/
Master 1, ENS de Lyon
2019-2020
Analysis Phase
source code↓
lexical analysis↓
sequence of “lexems” (tokens)↓
syntactic analysis (Parsing)↓
abstract syntax tree (AST )↓
semantic analysis↓
abstract syntax (+ symbol table)
Goal of this chapter
I Understand the syntaxic structure of a language;
I Separate the different steps of syntax analysis;
I Be able to write a syntax analysis tool for a simplelanguage;
I Remember: syntax6=semantics.
Syntax analysis steps
How do you read text ?
I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens:
I Words: groups of letters;I Punctuation;I Spaces.
I Group tokens into:I Propositions;I Sentences.
I Then proceed with word meanings:I Definition of each word.
ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...
Syntax analysis=Lexical analysis+Parsing
Syntax analysis steps
How do you read text ?
I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis
I Words: groups of letters;I Punctuation;I Spaces.
I Group tokens into:I Propositions;I Sentences.
I Then proceed with word meanings:I Definition of each word.
ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...
Syntax analysis=Lexical analysis+Parsing
Syntax analysis steps
How do you read text ?
I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis
I Words: groups of letters;I Punctuation;I Spaces.
I Group tokens into:I Propositions;I Sentences.
I Then proceed with word meanings:I Definition of each word.
ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...
Syntax analysis=Lexical analysis+Parsing
Syntax analysis steps
How do you read text ?
I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis
I Words: groups of letters;I Punctuation;I Spaces.
I Group tokens into: ParsingI Propositions;I Sentences.
I Then proceed with word meanings:I Definition of each word.
ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...
Syntax analysis=Lexical analysis+Parsing
Syntax analysis steps
How do you read text ?
I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis
I Words: groups of letters;I Punctuation;I Spaces.
I Group tokens into: ParsingI Propositions;I Sentences.
I Then proceed with word meanings:I Definition of each word.
ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...
Syntax analysis=Lexical analysis+Parsing
Syntax analysis steps
How do you read text ?
I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis
I Words: groups of letters;I Punctuation;I Spaces.
I Group tokens into: ParsingI Propositions;I Sentences.
I Then proceed with word meanings: SemanticsI Definition of each word.
ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...
Syntax analysis=Lexical analysis+Parsing
Syntax analysis steps
How do you read text ?
I Text=a sequence of symbols (letters, spaces, punctuation);I Group symbols into tokens: Lexical analysis
I Words: groups of letters;I Punctuation;I Spaces.
I Group tokens into: ParsingI Propositions;I Sentences.
I Then proceed with word meanings: SemanticsI Definition of each word.
ex: a dog is a hairy mammal, that barks and...I Role in the phrase: verb, subject, ...
Syntax analysis=Lexical analysis+Parsing
What for ?
int y = 12 + 4*x ;
=⇒ [TINT, VAR("y"), EQ, INT(12), PLUS, INT(4), TIMES,VAR("x"), SCOL]
I Group characters into a list of tokens, e.g.:
I The word “int” stands for type integer;
I A sequence of letters stands for a variable;
I A sequence of digits stands for an integer;
I ...
Principle
I Take a lexical description: E = ( E1︸︷︷︸Tokens class
| . . . |En)∗
I Construct an automaton.
Example - lexical description (“lex file”)E = ((0|1)+|(0|1)+.(0|1)+|′+′)∗
What’s behind
Regular languages, regular automata:
I Thompson construction I non-det automaton
I Determinization, completion
I Minimisation
Construction/use of automaton
Let E = ((0|1)+︸ ︷︷ ︸#1
| (0|1)+.(0|1)+︸ ︷︷ ︸#2
| ′+′︸︷︷︸#3
)∗ and
Σ = {0, 1,+, .,#1,#2,#3}.We define E# = ((0|1)+#1|(0|1)+.(0.1)+#2|′ +′ #3).
1start 2 3 4
5 sink
0,1 . 0,1
+
0,10,1
#1
#2
#3
Source C. Alias drawn by former M1 students
Remarks
I Notion of ambiguity.
I Compaction of transition table.
Tools: lexical analyzer constructors
I Lexical analyzer constructor: builds an automaton from aregular language definition;
I Ex: Lex (C), JFlex (Java), OCamllex, ANTLR (multi), ...
I input: a set of regular expressions with actions (Toto.g4);
I output: a file(s) (Toto.java) that contains thecorresponding automaton.
Analyzing text with the compiled lexer
I The input of the lexer is a text file;I Execution:
I Checks that the input is accepted by the compiledautomaton;
I Executes some actions during the “automaton traversal”.
Lexing tool for Java: ANTLR
I The official webpage : www.antlr.org (BSD license);
I ANTLR is both a lexer and a parser;
I ANTLR is multi-language (not only Java).
ANTLR lexer format and compilation
.g4
grammar XX;
@header {
// Some init code ...
}
@members {
// Some global variables
}
// More optional blocks are available
--->> lex rules
Compilation:
antlr4 Toto.g4 // produces several Java files
javac *.java // compiles into xx.class files
grun Toto r // Run analyzer with starting rule r
Lexing with ANTLR: example
Lexing rules:
I Must start with an upper-case letter;
I Follow the usual extended regular-expressions syntax(same as egrep, sed, ...).
A simple example
lexer grammar Tokens;
HELLO : 'hello' ; // beware the single quotes
ID : [a-z]+ ; // match lower -case identifiers
INT : [0-9]+ ;
KEYWORD : 'begin' | 'end' | 'for' ; // perhaps this should
be elsewhere
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Lexing - We can count!
Counting in ANTLR - CountLines2.g4
lexer grammar CountLines2;
// Members can be accessed in any rule
@members {int nbLines =0;}
NEWLINE : [\r\n] {
nbLines ++;
System.out.println("Current lines:"+nbLines);} ;
WS : [ \t]+ -> skip ;
What’s Parsing ?
Relate tokens by structuring them.
Flat tokens[TINT, VAR("y"), EQ, INT(12), PLUS, INT(4), TIMES, VAR("x"),SCOL]
⇒ Parsing⇒
Accept→ Structured tokens=
yint +
12
4 x
*
int
Analysis Phase
source code↓
lexical analysis↓
sequence of “lexems” (tokens)↓
syntactic analysis (Parsing)↓
abstract syntax tree (AST )↓
semantic analysis↓
abstract syntax (+ symbol table)
In this course
Only write acceptors “OK” or “Syntax Error”.
What’s behind ?
From a Context-free Grammar, produce a Stack Automaton(already seen in L3 course?)
Recalling grammar definitions
GrammarA grammar is composed of :
I A finite set N of non terminal symbols
I A finite set Σ of terminal symbols (disjoint from N )
I A finite set of production rules, each rule of the formw → w′ where w is a word on Σ ∪N with at least one letterof N . w′ is a word on Σ ∪N .
I A start symbol S ∈ N .
Example
Example:S → aSb
S → ε
is a grammar with N = .... and ....
Associated Language
DerivationG a grammar defines the relation :
x⇒G y iff ∃u, v, p, qx = upv and y = uqv and (p→ q) ∈ P
I A grammar describes a language (the set of words on Σ thatcan be derived from the start symbol).
Example - associated language
S → aSb
S → ε
The grammar defines the language {anbn, n ∈ N}
S → aBSc
S → abc
Ba→ aB
Bb→ bb
The grammar defines the language {anbncn, n ∈ N}
Context-free grammars
Context-free grammarA CF-grammar is a grammar where all production rules are ofthe form N → (Σ ∪N)∗.
Example:S → S + S|S ∗ S|a
The grammar defines a language of arithmetical expressions.
I Notion of derivation tree.Draw a derivation tree of a*a+a, of S+S !
Recognizing grammars
Grammar type Rules DecidableRegular grammar X → aY,X → b YESContext-free grammar X → u YESContext-sensitive grammar uXv → uwv YESGeneral grammar u→ v NO
Parser construction
There exists algorithms to recognize class of grammars:
I Predictive (descending) analysis (LL)
I Ascending analysis (LR)
I See the Dragon book.
Tools: parser generators
I Parser generator: builds a stack automaton from agrammar definition;
I Ex: yacc(C), javacup (Java), OCamlyacc, ANTLR, ...
I input : a set of grammar rules with actions (Toto.g4);
I output : a file(s) (Toto.java) that contains thecorresponding stack automaton.
Lexing vs Parsing
I Lexing supports (' regular) languages;
I We want more (general) languages⇒ rely on context-freegrammars;
I To that intent, we need a way:I To declare terminal symbols (tokens);I To write grammars.
I Use both Lexing rules and Parsing rules.
From a grammar to a parser
The grammar must be context-free:
S-> aSb
S-> eps
I The grammar rules are specified as Parsing rules;
I a and b are terminal tokens, produced by Lexing rules.
Parsing with ANTLR: example 1
Tokens.g4
lexer grammar Tokens;
HELLO : 'hello' ; // beware the single quotes
ID : [a-z]+ ; // match lower -case identifiers
INT : [0-9]+ ;
KEYWORD : 'begin' | 'end' | 'for' ; // perhaps this should
be elsewhere
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Parsing with ANTLR: example 2
AnBnLexer.g4
l e xe r grammar AnBnLexer;
/ / Lexing r u l e s : recognize tokensA: 'a' ;B: 'b' ;
WS : [ \ t \ r \ n ]+ −> sk ip ; / / sk ip spaces, t abs , newl ines
Parsing with ANTLR: example 2 (cont’)
AnBnParser.g4
parser grammar AnBnParser;options {tokenVocab=AnBnLexer;} / / ex tern tokens d e f i n i t i o n
/ / Parsing r u l e s : s t r u c t u r e tokens toge therprog : s EOF ; / / EOF: predef ined end−of− f i l e tokens : A s B
| ; / / no th ing f o r empty a l t e r n a t i v e
ANTLR expressivity
LL(*)At parse-time, decisions gracefully throttle up from con-ventional fixed k > 1 lookahead to arbitrary lookahead.
Further reading (PLDI’11 paper, T. Parr, K. Fisher)
http://www.antlr.org/papers/LL-star-PLDI11.pdf
Left recursion
ANTLR permits left recursion:
a: a b;
But not indirect left recursion. X1 → . . .→ Xn
There exist algorithms to eliminate indirect recursions.
Lists
ANTLR permits lists:
prog: statement+ ;
Read the documentation!
https:
//github.com/antlr/antlr4/blob/master/doc/index.md
So Far ...
ANTLR has been used to:
I Produce acceptors for context-free languages;
I Do a bit of computation on-the-fly.
⇒ In a classic compiler, parsing produces an Abstract SyntaxTree.I Next course!
Top Related