G52CMP: Lecture 6 - cs.nott.ac.ukpsznhn/G52CMP-Obsolete/LectureNotes-2010/lecture06.pdf · 6–20)....
Transcript of G52CMP: Lecture 6 - cs.nott.ac.ukpsznhn/G52CMP-Obsolete/LectureNotes-2010/lecture06.pdf · 6–20)....
G52CMP: Lecture 6Defining Programming Languages II
Henrik Nilsson
University of Nottingham, UK
G52CMP: Lecture 6 – p.1/30
MiniTriangle
In part II of the coursework, we are going to usea language called MiniTriangle :
• Originates from Watt & Brown (defined at pp.6–20).
• Our version has evolved and is now quitedifferent in some respects.
• We use MiniTriangle in this lecture to:- Illustrate the ideas of concrete and abstract
syntax- Introduce you to the language
G52CMP: Lecture 6 – p.2/30
This Lecture
• Concrete Syntax- Lexical syntax for MiniTriangle- Context-free syntax for MiniTriangle
• Abstract Syntax- Abstract syntax for MiniTriangle
• Representing Abstract Syntax Trees (ASTs)
G52CMP: Lecture 6 – p.3/30
A MiniTriangle Program
This is an example of a valid MiniTriangleprogram:
letvar y: Integer := 0
inbegin
y := y + 1 ;putint(y)
end
G52CMP: Lecture 6 – p.4/30
Concrete Syntax
The Concrete Syntax , or surface syntax, of alanguage is usually defined at two levels:
G52CMP: Lecture 6 – p.5/30
Concrete Syntax
The Concrete Syntax , or surface syntax, of alanguage is usually defined at two levels:
• The Lexical syntax : the syntax of- language symbols or tokens- white space- comments
G52CMP: Lecture 6 – p.5/30
Concrete Syntax
The Concrete Syntax , or surface syntax, of alanguage is usually defined at two levels:
• The Lexical syntax : the syntax of- language symbols or tokens- white space- comments
• The Context-Free syntax .
G52CMP: Lecture 6 – p.5/30
Regular Grammars
• Lexical syntax is usually defined as aRegular Language (RL).
G52CMP: Lecture 6 – p.6/30
Regular Grammars
• Lexical syntax is usually defined as aRegular Language (RL).
• A regular language can be described by
G52CMP: Lecture 6 – p.6/30
Regular Grammars
• Lexical syntax is usually defined as aRegular Language (RL).
• A regular language can be described by- a Regular Expression
G52CMP: Lecture 6 – p.6/30
Regular Grammars
• Lexical syntax is usually defined as aRegular Language (RL).
• A regular language can be described by- a Regular Expression- a Context-Free Grammar (as the RLs are a
proper subset of the CFLs)
G52CMP: Lecture 6 – p.6/30
Regular Grammars
• Lexical syntax is usually defined as aRegular Language (RL).
• A regular language can be described by- a Regular Expression- a Context-Free Grammar (as the RLs are a
proper subset of the CFLs)• If a grammar G is left-linear or right-linear ,
then G is a regular grammar and L(G) is aregular language.
G52CMP: Lecture 6 – p.6/30
Regular Grammars
• Lexical syntax is usually defined as aRegular Language (RL).
• A regular language can be described by- a Regular Expression- a Context-Free Grammar (as the RLs are a
proper subset of the CFLs)• If a grammar G is left-linear or right-linear ,
then G is a regular grammar and L(G) is aregular language.
• Regular languages are easy to recognise (DFA).G52CMP: Lecture 6 – p.6/30
Right-linear Grammar
A CFG G = (N,T, P, S) is right-linear if all itsproductions are of the forms
A → wB
A → w
where A,B ∈ N and w ∈ T ∗.
Example: The regular language 0(10)∗ isgenerated by the right-linear grammar
S → 0A
A → 10A | ǫ
G52CMP: Lecture 6 – p.7/30
Left-linear Grammar
A CFG G = (N,T, P, S) is left-linear if all itsproductions are of the forms
A → Bw
A → w
where A,B ∈ N and w ∈ T ∗.
Example: The regular language 0(10)∗ isgenerated by the left-linear grammar
S → S10 | 0
G52CMP: Lecture 6 – p.8/30
MiniTriangle Lexical Syntax (1)Program → (Token | Separator )∗
Token → Keyword | Identifier | IntegerLiteral | Operator
| , | ; | : | := | = | ( | ) | eot
Keyword → begin | const | do | else | end | if | in
| let | then | var | while
Identifier → Letter | Identifier Letter | Identifier Digit
except Keyword
IntegerLiteral → Digit | IntegerLiteral Digit
Operator → + | - | * | / | < | <= | == | != | >= | > | && | || | !
Separator → Comment | space | eol
Comment → // (any character except eol )∗ eolG52CMP: Lecture 6 – p.9/30
MiniTriangle Lexical Syntax (2)
Notes:
G52CMP: Lecture 6 – p.10/30
MiniTriangle Lexical Syntax (2)
Notes:• Essentially a left-linear grammar.
G52CMP: Lecture 6 – p.10/30
MiniTriangle Lexical Syntax (2)
Notes:• Essentially a left-linear grammar.• Not completely formal (e.g. the use of
“except” for excluding keywords fromidentifiers).
G52CMP: Lecture 6 – p.10/30
MiniTriangle Lexical Syntax (2)
Notes:• Essentially a left-linear grammar.• Not completely formal (e.g. the use of
“except” for excluding keywords fromidentifiers).
• Note! Each individual character of a terminalis actually a terminal symbol! I.e., really:Keyword → b e g i n | c o n s t | . . .
G52CMP: Lecture 6 – p.10/30
MiniTriangle Lexical Syntax (2)
Notes:• Essentially a left-linear grammar.• Not completely formal (e.g. the use of
“except” for excluding keywords fromidentifiers).
• Note! Each individual character of a terminalis actually a terminal symbol! I.e., really:Keyword → b e g i n | c o n s t | . . .
• Special characters are written like this.Note! They are single terminal symbols!
G52CMP: Lecture 6 – p.10/30
MiniTriangle: Tokens
Some valid MiniTriangle tokens:• const3 (Identifier)• const (Keyword)• 42 (Integer-Literal)• + (Operator)
G52CMP: Lecture 6 – p.11/30
MiniTriangle: Tokens
Some valid MiniTriangle tokens:• const3 (Identifier)• const (Keyword)• 42 (Integer-Literal)• + (Operator)
Q: Is const3 really a single token?
G52CMP: Lecture 6 – p.11/30
MiniTriangle: Tokens
Some valid MiniTriangle tokens:• const3 (Identifier)• const (Keyword)• 42 (Integer-Literal)• + (Operator)
Q: Is const3 really a single token?The grammar is ambiguous !
G52CMP: Lecture 6 – p.11/30
MiniTriangle: Tokens
Some valid MiniTriangle tokens:• const3 (Identifier)• const (Keyword)• 42 (Integer-Literal)• + (Operator)
Q: Is const3 really a single token?The grammar is ambiguous !
A: An implicit “maximal munch rule ” used todisambiguate!
G52CMP: Lecture 6 – p.11/30
MiniTriangle: Non Tokens
Some non tokens:
G52CMP: Lecture 6 – p.12/30
MiniTriangle: Non Tokens
Some non tokens:• 123abc
G52CMP: Lecture 6 – p.12/30
MiniTriangle: Non Tokens
Some non tokens:• 123abc (two tokens: Integer-Literal 123 and
Identifier abc)
G52CMP: Lecture 6 – p.12/30
MiniTriangle: Non Tokens
Some non tokens:• 123abc (two tokens: Integer-Literal 123 and
Identifier abc)• put_x
G52CMP: Lecture 6 – p.12/30
MiniTriangle: Non Tokens
Some non tokens:• 123abc (two tokens: Integer-Literal 123 and
Identifier abc)• put_x (Identifier put, illegal character “_”,
Identifier x)
G52CMP: Lecture 6 – p.12/30
MiniTriangle: Non Tokens
Some non tokens:• 123abc (two tokens: Integer-Literal 123 and
Identifier abc)• put_x (Identifier put, illegal character “_”,
Identifier x)• 3.14
G52CMP: Lecture 6 – p.12/30
MiniTriangle: Non Tokens
Some non tokens:• 123abc (two tokens: Integer-Literal 123 and
Identifier abc)• put_x (Identifier put, illegal character “_”,
Identifier x)• 3.14 (Integer-Literal 3, illegal character “.”,
Integer-Literal 14)
G52CMP: Lecture 6 – p.12/30
MiniTriangle: Non Tokens
Some non tokens:• 123abc (two tokens: Integer-Literal 123 and
Identifier abc)• put_x (Identifier put, illegal character “_”,
Identifier x)• 3.14 (Integer-Literal 3, illegal character “.”,
Integer-Literal 14)• 3e8
G52CMP: Lecture 6 – p.12/30
MiniTriangle: Non Tokens
Some non tokens:• 123abc (two tokens: Integer-Literal 123 and
Identifier abc)• put_x (Identifier put, illegal character “_”,
Identifier x)• 3.14 (Integer-Literal 3, illegal character “.”,
Integer-Literal 14)• 3e8 (two tokens: Integer-Literal 3 and
Identifier e8)
G52CMP: Lecture 6 – p.12/30
MiniTriangle Context-Free Syntax (1)
Program → Command
Commands → Command
| Command ; Commands
Command → VarExpression := Expression
| VarExpression ( Expressions )
| if Expression then Command else Command
| whileExpression do Command
| let Declarations in Command
| beginCommands end
G52CMP: Lecture 6 – p.13/30
MiniTriangle Context-Free Syntax (2)
Expressions → Expression
| Expression , Expressions
Expression → PrimaryExpression
| Expression Operator PrimaryExpression
PrimaryExpression → IntegerLiteral
| VarExpression
| Operator PrimaryExpression
| ( Expression )
VarExpression → Identifier
G52CMP: Lecture 6 – p.14/30
MiniTriangle Context-Free Syntax (3)
Declarations → Declaration
| Declaration ; Declarations
Declaration → const Identifier : TypeDenoter = Expression
| var Identifier : TypeDenoter
| var Identifier : TypeDenoter := Expression
TypeDenoter → Identifier
G52CMP: Lecture 6 – p.15/30
Another MiniTriangle Program
The following is a syntactically validMiniTriangle program (slightly changed fromearlier to save some space):
letvar y: Integer
inbegin
y := y + 1 ;putint(y)
end
G52CMP: Lecture 6 – p.16/30
Parse Tree for the ProgramProgram
Command
let Declarations in
CommandsDeclaration
var Identifier : TypeDenoter
Integer
y Identifier
Identifier
y
:= Expression
Expression Operator PrimaryExpression
+PrimaryExpression
Identifier
y
IntegerLiteral
1
Command
begin end
Command
VarExpression
Commands
Command
VarExpression ( )Expressions
ExpressionIdentifier
putint
VarExpression
Identifier
y
;
VarExpression PrimaryExpression
G52CMP: Lecture 6 – p.17/30
Exercise 1
Draw the parse tree for the following MiniTriangleprogram:
while b don := 0
G52CMP: Lecture 6 – p.18/30
Why a Lexical Grammar? (1)
Together, the lexical grammar and thecontext-free grammar specify the concretesyntax .
G52CMP: Lecture 6 – p.19/30
Why a Lexical Grammar? (1)
Together, the lexical grammar and thecontext-free grammar specify the concretesyntax .
In our case, both grammars are expressed in(E)BNF and looks similar.
So . . .
G52CMP: Lecture 6 – p.19/30
Why a Lexical Grammar? (1)
Together, the lexical grammar and thecontext-free grammar specify the concretesyntax .
In our case, both grammars are expressed in(E)BNF and looks similar.
So . . .• Why not join them?
G52CMP: Lecture 6 – p.19/30
Why a Lexical Grammar? (1)
Together, the lexical grammar and thecontext-free grammar specify the concretesyntax .
In our case, both grammars are expressed in(E)BNF and looks similar.
So . . .• Why not join them?• Why not do away with scanning, and just do
parsing?
G52CMP: Lecture 6 – p.19/30
Why a Lexical Grammar? (2)
Answer:• Simplicity : dealing with white space and
comments in the context free grammarbecomes extremely complicated. (Try it!)
G52CMP: Lecture 6 – p.20/30
Why a Lexical Grammar? (2)
Answer:• Simplicity : dealing with white space and
comments in the context free grammarbecomes extremely complicated. (Try it!)
• Efficiency :- Working on classified groups of characters
(tokens) facilitates parsing: may bepossible to use a simpler parsing algorithm.
G52CMP: Lecture 6 – p.20/30
Why a Lexical Grammar? (2)
Answer:• Simplicity : dealing with white space and
comments in the context free grammarbecomes extremely complicated. (Try it!)
• Efficiency :- Working on classified groups of characters
(tokens) facilitates parsing: may bepossible to use a simpler parsing algorithm.
- Grouping and classifying characters by assimple means as possible increasesefficiency.
G52CMP: Lecture 6 – p.20/30
MiniTriangle Abstract Syntax (1)This grammar specifies the phrase structure ofMiniTriangle. In addition, it gives node labels tobe used when drawing Abstract Syntax Trees.
Program → Command Program
Command → Expression := Expression CmdAssign
| Expression ( Expression∗ ) CmdCall
| Command∗ CmdSeq
| if Expression then Command CmdIf
else Command
| whileExpression do Command CmdWhile
| let Declaration∗ in Command CmdLetG52CMP: Lecture 6 – p.21/30
MiniTriangle Abstract Syntax (2)Expression → IntegerLiteral ExpLitInt
| Name ExpVar
| Expression ( Expression∗ ) ExpApp
Declaration → constName : TypeDenoter DeclConst
= Expression
| var Name : TypeDenoter DeclVar
(:= Expression | ǫ)
TypeDenoter → Name TDBaseType
Note: Keywords and other fixed-spelling terminals serveonly to make the connection with the concrete syntax clear.Identifier ⊆ Name, Operator ⊆ Name
G52CMP: Lecture 6 – p.22/30
Abstract Syntax Tree for the ProgramProgram
CmdLet
DeclVar
CmdAssignName TDBaseType
Integer
y Name
Name
y
ExpApp
ExpVar
Name
ExpLitInt
+
Name
y
IntegerLiteral
1
CmdSeq
ExpVar
ExpVar
CmdCall
ExpVar
Name
putint
Name
y
ExpVar
Note: fixed-spelling terminals are omittedbecause they are implied by the node labels.
G52CMP: Lecture 6 – p.23/30
Exercise 2
Draw the Abstract Syntax Tree for the followingMiniTriangle program:
while b don := 0
G52CMP: Lecture 6 – p.24/30
Concrete AST RepresentationMapping of abstract syntax to algebraic datatypes:
G52CMP: Lecture 6 – p.25/30
Concrete AST RepresentationMapping of abstract syntax to algebraic datatypes:
• Each non-terminal is mapped to a type .
G52CMP: Lecture 6 – p.25/30
Concrete AST RepresentationMapping of abstract syntax to algebraic datatypes:
• Each non-terminal is mapped to a type .• Each label is mapped to a constructor for
the corresponding type.
G52CMP: Lecture 6 – p.25/30
Concrete AST RepresentationMapping of abstract syntax to algebraic datatypes:
• Each non-terminal is mapped to a type .• Each label is mapped to a constructor for
the corresponding type.• The constructors get one argument for each
non-terminal and “variable” terminal in theRHS of the production.
G52CMP: Lecture 6 – p.25/30
Concrete AST RepresentationMapping of abstract syntax to algebraic datatypes:
• Each non-terminal is mapped to a type .• Each label is mapped to a constructor for
the corresponding type.• The constructors get one argument for each
non-terminal and “variable” terminal in theRHS of the production.
• Sequences are represented by lists.
G52CMP: Lecture 6 – p.25/30
Concrete AST RepresentationMapping of abstract syntax to algebraic datatypes:
• Each non-terminal is mapped to a type .• Each label is mapped to a constructor for
the corresponding type.• The constructors get one argument for each
non-terminal and “variable” terminal in theRHS of the production.
• Sequences are represented by lists.• Options are represented by values of type Maybe.
G52CMP: Lecture 6 – p.25/30
Concrete AST RepresentationMapping of abstract syntax to algebraic datatypes:
• Each non-terminal is mapped to a type .• Each label is mapped to a constructor for
the corresponding type.• The constructors get one argument for each
non-terminal and “variable” terminal in theRHS of the production.
• Sequences are represented by lists.• Options are represented by values of type Maybe.• “Literal” terminals are ignored.
G52CMP: Lecture 6 – p.25/30
Concrete AST Representation (2)
data Command
= CmdAssign Expression Expression
| CmdCall Expression [Expression]
| CmdSeq [Command]
| CmdIf Expression Command Command
| CmdWhile Expression Command
| CmdLet [Declaration] Command
G52CMP: Lecture 6 – p.26/30
Concrete AST Representation (3)
data Expression
= ExpLitInt Integer
| ExpVar Name
| ExpApp Expression [Expression]
data Declaration
= DeclConst Name TypeDenoter Expression
| DeclVar Name TypeDenoter (Maybe Expression)
G52CMP: Lecture 6 – p.27/30
Concrete AST Representation (4)In fact, the lab code uses labelled fields:data Command
= CmdAssign {
caVar :: Expression,
caVal :: Expression,
cmdSrcPos :: SrcPos
}
| CmdCall {
ccProc :: Expression,
ccArgs :: [Expression],
cmdSrcPos :: SrcPos
}
... G52CMP: Lecture 6 – p.28/30
Haskell Representation of the Program
CmdLet
(DeclVar "y" (TDBaseName "Integer") Nothing)
(CmdSeq [CmdAssign (ExpVar "y")
(ExpApp (ExpVar "+")
[ExpVar "y",
ExpLitInt 1]),
CmdCall (ExpVar "putint")
[ExpVar "y"]])
Assumption:type Name = String
G52CMP: Lecture 6 – p.29/30
Exercise 3
Provide the Haskell representation of thefollowing MiniTriangle fragment:
while b don := 0
G52CMP: Lecture 6 – p.30/30