Grounded Language Learning Models for Ambiguous Supervision
Joohyun KimSupervising Professor Raymond J Mooney
PhD Thesis Defense TalkAugust 23 2013
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
2
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
3
Language Grounding
bull The process to acquire the semantics of natural language with respect to relevant perceptual contexts
bull Human child grounds language to perceptual contexts via repetitive exposure in statistical way (Saffran et al 1999 Saffran 2003)
bull Ideally we want computational system to learn from the similar way
4
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
5
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
Machine 6
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
7
Computer VisionLanguage Learning
Natural Language and Meaning Representation
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
8
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
2
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
3
Language Grounding
bull The process to acquire the semantics of natural language with respect to relevant perceptual contexts
bull Human child grounds language to perceptual contexts via repetitive exposure in statistical way (Saffran et al 1999 Saffran 2003)
bull Ideally we want computational system to learn from the similar way
4
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
5
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
Machine 6
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
7
Computer VisionLanguage Learning
Natural Language and Meaning Representation
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
8
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
3
Language Grounding
bull The process to acquire the semantics of natural language with respect to relevant perceptual contexts
bull Human child grounds language to perceptual contexts via repetitive exposure in statistical way (Saffran et al 1999 Saffran 2003)
bull Ideally we want computational system to learn from the similar way
4
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
5
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
Machine 6
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
7
Computer VisionLanguage Learning
Natural Language and Meaning Representation
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
8
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Language Grounding
bull The process to acquire the semantics of natural language with respect to relevant perceptual contexts
bull Human child grounds language to perceptual contexts via repetitive exposure in statistical way (Saffran et al 1999 Saffran 2003)
bull Ideally we want computational system to learn from the similar way
4
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
5
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
Machine 6
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
7
Computer VisionLanguage Learning
Natural Language and Meaning Representation
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
8
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
5
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
Machine 6
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
7
Computer VisionLanguage Learning
Natural Language and Meaning Representation
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
8
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
Machine 6
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
7
Computer VisionLanguage Learning
Natural Language and Meaning Representation
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
8
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Language Grounding Machine
Iranrsquos goalkeeper blocks the ball
Block(IranGoalKeeper)
7
Computer VisionLanguage Learning
Natural Language and Meaning Representation
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
8
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Natural Language and Meaning Representation
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
8
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Natural Language and Meaning Representation
Natural Language (NL)
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
9
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Natural Language and Meaning Representation
NL A language that arises naturally by the innate nature of human intellect such as English German French Korean etc
MRL Formal languages that machine can understand such as logic or any computer-executable code
Meaning Representation Language (MRL)Natural Language (NL)
10
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
MRL
Semantic Parsing (NL MRL)
11
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Semantic Parsing and Surface Realization
NL
Semantic Parsing maps a natural-language sentence to a full detailed semantic representationrarr Machine understands natural language
Surface Realization Generates a natural-language sentence from a meaning representationrarr Machine communicates with natural language
MRL
Semantic Parsing (NL MRL)
Surface Realization (NL MRL)
12
Iranrsquos goalkeeper blocks the ball Block(IranGoalKeeper)
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Conventional Language Learning Systemsbull Requires manually annotated corporabull Time-consuming hard to acquire and not scalable
Manually Annotated Training Corpora(NLMRL pairs)
Semantic Parser
MRLNL
Semantic Parser Learner
13
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Learning from Perceptual Environment
bull Motivated by how children learn language in rich ambiguous perceptual environment with linguistic input
bull Advantagesndash Naturally obtainable corporandash Relatively easy to annotatendash Motivated by natural process of human language
learning
14
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
식당에서 우회전 하세요Alice
Bob15Slide from David Chen
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
Alice
Bob
병원에서 우회전 하세요
16Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
17Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
18Slide from David Chen
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
Scenario 1
Scenario 2식당에서 우회전 하세요
병원에서 우회전 하세요
Make a right turn
19Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
20Slide from David Chen
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
Scenario 1
Scenario 2
식당
21Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
Scenario 1
Scenario 2병원에서 우회전 하세요
식당에서 우회전 하세요
22Slide from David Chen
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Navigation Example
Scenario 1
Scenario 2병원
23Slide from David Chen
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Thesis Contributionsbull Generative models for grounded language learning from
ambiguous perceptual environmentndash Unified probabilistic model incorporating linguistic cues and MR
structures (vs previous approaches)ndash General framework of probabilistic approaches that learn NL-MR
correspondences from ambiguous supervisionbull Adapting discriminative reranking to grounded language
learningndash Standard reranking is not availablendash No single gold-standard reference for training datandash Weak response from perceptual environment can train
discriminative reranker
24
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
25
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
bull Learn to interpret and follow navigation instructions ndash eg Go down this hall and make a right when you see
an elevator to your left bull Use virtual worlds and instructorfollower data
from MacMahon et al (2006)bull No prior linguistic knowledgebull Infer language semantics by observing how
humans follow instructions
Navigation Task (Chen and Mooney 2011)
26
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
H
C
L
S S
B C
H
E
L
E
H ndash Hat Rack
L ndash Lamp
E ndash Easel
S ndash Sofa
B ndash Barstool
C - Chair
Sample Environment (MacMahon et al 2006)
27
Executing Test Instruction
28
Executing Test Instruction
28
Task Objectivebull Learn the underlying meanings of instructions by observing
human actions for the instructions ndash Learn to map instructions (NL) into correct formal plan of actions
(MR)bull Learn from high ambiguityndash Training input of NL instruction landmarks plan (Chen and Mooney
2011) pairsndash Landmarks plan
Describe actions in the environment along with notable objects encountered on the way
Overestimate the meaning of the instruction including unnecessary details
Only subset of the plan is relevant for the instruction29
Challenges
30
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Landmarks plan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Challenges
31
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Landmarks plan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Challenges
32
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Correctplan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan
Previous Work (Chen and Mooney 2011)
bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining
landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR
components out of landmarks planndash Trains supervised semantic parser to map novel instruction
(NL) to correct formal plan (MR)ndash Loses information during refinement
Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all
33
Proposed Solution (Kim and Mooney 2012)
bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to
formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for
building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic
Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals
34
35
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
(Supervised) Semantic Parser Learner
Plan Refinement
Semantic Parser
Action Trace
System Diagram (Chen and Mooney 2011)
Landmarks Plan
Supervised Refined Plan
LearningInference
Possibleinformationloss
36
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
Probabilistic Semantic Parser Learner (from
ambiguous supervison)
Semantic Parser
Action Trace
System Diagram of Proposed Solution
Landmarks Plan
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Challenges
30
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Landmarks plan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Challenges
31
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Landmarks plan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Challenges
32
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Correctplan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan
Previous Work (Chen and Mooney 2011)
bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining
landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR
components out of landmarks planndash Trains supervised semantic parser to map novel instruction
(NL) to correct formal plan (MR)ndash Loses information during refinement
Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all
33
Proposed Solution (Kim and Mooney 2012)
bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to
formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for
building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic
Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals
34
35
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
(Supervised) Semantic Parser Learner
Plan Refinement
Semantic Parser
Action Trace
System Diagram (Chen and Mooney 2011)
Landmarks Plan
Supervised Refined Plan
LearningInference
Possibleinformationloss
36
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
Probabilistic Semantic Parser Learner (from
ambiguous supervison)
Semantic Parser
Action Trace
System Diagram of Proposed Solution
Landmarks Plan
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Challenges
31
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Landmarks plan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Challenges
32
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Correctplan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan
Previous Work (Chen and Mooney 2011)
bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining
landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR
components out of landmarks planndash Trains supervised semantic parser to map novel instruction
(NL) to correct formal plan (MR)ndash Loses information during refinement
Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all
33
Proposed Solution (Kim and Mooney 2012)
bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to
formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for
building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic
Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals
34
35
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
(Supervised) Semantic Parser Learner
Plan Refinement
Semantic Parser
Action Trace
System Diagram (Chen and Mooney 2011)
Landmarks Plan
Supervised Refined Plan
LearningInference
Possibleinformationloss
36
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
Probabilistic Semantic Parser Learner (from
ambiguous supervison)
Semantic Parser
Action Trace
System Diagram of Proposed Solution
Landmarks Plan
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Challenges
32
Instruc-tion
at the easel go left and then take a right onto the blue path at the corner
Correctplan
Travel ( steps 1 ) Verify ( at EASEL side CONCRETE HALLWAY ) Turn ( LEFT ) Verify ( front CONCRETE HALLWAY ) Travel ( steps 1 ) Verify ( side BLUE HALLWAY front WALL ) Turn ( RIGHT ) Verify ( back WALL front BLUE HALLWAY front CHAIR front HATRACK left WALL right EASEL )
Exponential Number of Possibilities Combinatorial matching problem between instruction and landmarks plan
Previous Work (Chen and Mooney 2011)
bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining
landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR
components out of landmarks planndash Trains supervised semantic parser to map novel instruction
(NL) to correct formal plan (MR)ndash Loses information during refinement
Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all
33
Proposed Solution (Kim and Mooney 2012)
bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to
formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for
building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic
Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals
34
35
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
(Supervised) Semantic Parser Learner
Plan Refinement
Semantic Parser
Action Trace
System Diagram (Chen and Mooney 2011)
Landmarks Plan
Supervised Refined Plan
LearningInference
Possibleinformationloss
36
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
Probabilistic Semantic Parser Learner (from
ambiguous supervison)
Semantic Parser
Action Trace
System Diagram of Proposed Solution
Landmarks Plan
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Previous Work (Chen and Mooney 2011)
bull Circumvent combinatorial NL-MR correspondence problemndash Constructs supervised NL-MR training data by refining
landmarks plan with learned semantic lexiconGreedily select high-score lexemes to choose probable MR
components out of landmarks planndash Trains supervised semantic parser to map novel instruction
(NL) to correct formal plan (MR)ndash Loses information during refinement
Deterministically select high-score lexemesIgnores possibly useful low-score lexemesSome relevant MR components are not considered at all
33
Proposed Solution (Kim and Mooney 2012)
bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to
formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for
building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic
Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals
34
35
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
(Supervised) Semantic Parser Learner
Plan Refinement
Semantic Parser
Action Trace
System Diagram (Chen and Mooney 2011)
Landmarks Plan
Supervised Refined Plan
LearningInference
Possibleinformationloss
36
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
Probabilistic Semantic Parser Learner (from
ambiguous supervison)
Semantic Parser
Action Trace
System Diagram of Proposed Solution
Landmarks Plan
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Proposed Solution (Kim and Mooney 2012)
bull Learn probabilistic semantic parser directly from ambiguous training datandash Disambiguate input + learn to map NL instructions to
formal MR planndash Semantic lexicon (Chen and Mooney 2011) as basic unit for
building NL-MR correspondencesndash Transforms into standard PCFG (Probabilistic
Context-Free Grammar) induction problem with semantic lexemes as nonterminals and NL words as terminals
34
35
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
(Supervised) Semantic Parser Learner
Plan Refinement
Semantic Parser
Action Trace
System Diagram (Chen and Mooney 2011)
Landmarks Plan
Supervised Refined Plan
LearningInference
Possibleinformationloss
36
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
Probabilistic Semantic Parser Learner (from
ambiguous supervison)
Semantic Parser
Action Trace
System Diagram of Proposed Solution
Landmarks Plan
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
35
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
(Supervised) Semantic Parser Learner
Plan Refinement
Semantic Parser
Action Trace
System Diagram (Chen and Mooney 2011)
Landmarks Plan
Supervised Refined Plan
LearningInference
Possibleinformationloss
36
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
Probabilistic Semantic Parser Learner (from
ambiguous supervison)
Semantic Parser
Action Trace
System Diagram of Proposed Solution
Landmarks Plan
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
36
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
TrainingTesting
Action TraceNavigation Plan Constructor
Probabilistic Semantic Parser Learner (from
ambiguous supervison)
Semantic Parser
Action Trace
System Diagram of Proposed Solution
Landmarks Plan
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
PCFG Induction Model for Grounded Language Learning (Borschinger et al 2011)
37
bull PCFG rules to describe generative process from MR components to corresponding NL words
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Hierarchy Generation PCFG Model (Kim and Mooney 2012)
bull Limitations of Borschinger et al 2011ndash Only work in low ambiguity settings
1 NL ndash a handful of MRs ( order of 10s)ndash Only output MRs included in the constructed PCFG from training
databull Proposed model
ndash Use semantic lexemes as units of semantic conceptsndash Disambiguate NL-MR correspondences in semantic concept
(lexeme) levelndash Disambiguate much higher level of ambiguous supervisionndash Output novel MRs not appearing in the PCFG by composing MR
parse with semantic lexeme MRs38
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
bull Pair of NL phrase w and MR subgraph gbull Based on correlations between NL instructions and
context MRs (landmarks plans)ndash How graph g is probable given seeing phrase w
bull Examplesndash ldquoto the stoolrdquo Travel() Verify(at BARSTOOL)ndash ldquoblack easelrdquo Verify(at EASEL)ndash ldquoturn left and walkrdquo Turn() Travel()
Semantic Lexicon (Chen and Mooney 2011)
39
cooccurrenceof g and w
general occurrenceof g without w
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Lexeme Hierarchy Graph (LHG)bull Hierarchy of semantic lexemes
by subgraph relationship constructed for each training examplendash Lexeme MRs = semantic
conceptsndash Lexeme hierarchy = semantic
concept hierarchyndash Shows how complicated
semantic concepts hierarchically generate smaller concepts and further connected to NL word groundings
40
Turn
RIGHT sideHATRACK
frontSOFA
steps3
atEASEL
Verify Travel Verify
Turn
atEASEL
Travel Verify
atEASEL
Verify
Turn
RIGHT sideHATRACK
Verify Travel
Turn
sideHATRACK
Verify
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
PCFG Construction
bull Add rules per each node in LHGndash Each complex concept chooses which subconcepts to
describe that will finally be connected to NL instructionEach node generates all k-permutations of children nodes
ndash we do not know which subset is correctndash NL words are generated by lexeme nodes by unigram
Markov process (Borschinger et al 2011)
ndash PCFG rule weights are optimized by EMMost probable MR components out of all possible
combinations are estimated41
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
PCFG Construction
42
m
Child concepts are generated from parent concepts selec-tively
All semantic concepts gen-erate relevant NL words
Each semantic concept generates at least one NL word
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Parsing New NL Sentences
bull PCFG rule weights are optimized by Inside-Outside algorithm with training data
bull Obtain the most probable parse tree for each test NL sentence from the learned weights using CKY algorithm
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for generating
NL wordsndash From the bottom of the tree mark only responsible MR
components that propagate to the top levelndash Able to compose novel MRs never seen in the training data
43
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn left and find the sofa then turn around
the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify
Turn
LEFT frontSOFA
Verify
Turn
LEFT
Turn
RIGHT
steps2
atSOFA
Travel Verify Turn
RIGHT
atSOFA
Travel Verify Turn
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
46
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Unigram Generation PCFG Model
bull Limitations of Hierarchy Generation PCFG Model ndash Complexities caused by Lexeme Hierarchy Graph
and k-permutationsndash Tend to over-fit to the training data
bull Proposed Solution Simpler modelndash Generate relevant semantic lexemes one by onendash No extra PCFG rules for k-permutationsndash Maintains simpler PCFG rule set faster to train
47
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
PCFG Construction
bull Unigram Markov generation of relevant lexemesndash Each context MR generates relevant lexemes one
by onendash Permutations of the appearing orders of relevant
lexemes are already considered
48
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
PCFG Construction
49
Each semantic concept is generated by unigram Markov process
All semantic concepts gen-erate relevant NL words
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Parsing New NL Sentences
bull Follows the similar scheme as in Hierarchy Generation PCFG model
bull Compose final MR parse from lexeme MRs appeared in the parse treendash Consider only the lexeme MRs responsible for
generating NL wordsndash Mark relevant lexeme MR components in the
context MR appearing in the top nonterminal
50
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT
atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn left and find the sofa then turn around the corner
Most probable parse tree for a test NL instruction
NL
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Context MR
Context MR
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify
Turn
ContextMR
RelevantLexemes
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
54
Turn
LEFTfrontBLUEHALL
frontSOFA
steps2
atSOFA
Verify Travel Verify Turn
RIGHT
Turn
LEFT atSOFA
Travel Verify Turn
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Databull 3 maps 6 instructors 1-15 followersdirectionbull Hand-segmented into single sentence steps to make the learning easier
(Chen amp Mooney 2011)bull Mandarin Chinese translation of each sentence (Chen 2012)
bull Word-segmented version by Stanford Chinese Word Segmenterbull Character-segmented version
55
Take the wood path towards the easel At the easel go left and then take a right on the the blue path at the corner Follow the blue path towards the chair and at the chair take a right towards the stool When you reach the stool you are at 7
Paragraph Single sentenceTake the wood path towards the easel
At the easel go left and then take a right on the the blue path at the corner
Turn Forward Turn left Forward Turn right Forward x 3 Turn right Forward
Forward Turn left Forward Turn right
Turn
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Data Statistics
56
Paragraph Single-Sentence
Instructions 706 3236
Avg sentences 50 (plusmn28) 10 (plusmn0)
Avg actions 104 (plusmn57) 21 (plusmn24)
Avg words sent
English 376 (plusmn211) 78 (plusmn51)
Chinese-Word 316 (plusmn181) 69 (plusmn49)
Chinese-Character 489 (plusmn283) 106 (plusmn73)
Vo-cabu-lary
English 660 629
Chinese-Word 661 508
Chinese-Character 448 328
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Evaluationsbull Leave-one-map-out approach
ndash 2 maps for training and 1 map for testingndash Parse accuracy amp Plan execution accuracy
bull Compared with Chen and Mooney 2011 and Chen 2012ndash Ambiguous context (landmarks plan) is refined by greedy
selection of high-score lexemes with two different lexicon learning algorithmsChen and Mooney 2011 Graph Intersection Lexicon Learning (GILL)Chen 2012 Subgraph Generation Online Lexicon Learning (SGOLL)
ndash Semantic parser KRISP (Kate and Mooney 2006) trained on the resulting supervised data
57
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Parse Accuracy
bull Evaluate how well the learned semantic parsers can parse novel sentences in test data
bull Metric partial parse accuracy
58
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Parse Accuracy (English)
Precision Recall F1
9016
5541
6859
8836
5703
6931
8758
6541
7481
861
6879
7644
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
59
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Parse Accuracy (Chinese-Word)
Precision Recall F1
8887
5876
7074
8056
7114
7553
7945
73667641
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
60
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Parse Accuracy (Chinese-Character)
Precision Recall F1
9248
5647
7001
7977
6738
7305
7973
75527755
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
61
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
End-to-End Execution Evaluations
bull Test how well the formal plan from the output of semantic parser reaches the destination
bull Strict metric Only successful if the final position matches exactlyndash Also consider facing direction in single-sentencendash Paragraph execution is affected by even one
single-sentence execution
62
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
End-to-End Execution Evaluations(English)
Single-Sentence Paragraph
544
1618
5728
1918
5722
2017
6714
2812
Chen amp Mooney (2011) Chen (2012)Hierarchy Generation PCFG Model Unigram Generation PCFG Model
63
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
End-to-End Execution Evaluations(Chinese-Word)
Single-Sentence Paragraph
587
2013
6103
1908
634
2312
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
64
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
End-to-End Execution Evaluations(Chinese-Character)
Single-Sentence Paragraph
5727
1673
5561
1274
6285
2333
Chen (2012) Hierarchy Generation PCFG ModelUnigram Generation PCFG Model
65
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Discussionbull Better recall in parse accuracy
ndash Our probabilistic model uses useful but low score lexemes as well rarr more coverage
ndash Unified models are not vulnerable to intermediate information loss bull Hierarchy Generation PCFG model over-fits to training data
ndash Complexities LHG and k-permutation rulesParticularly weak in Chinese-character corpus Longer avg sentence length hard to estimate PCFG weights
bull Unigram Generation PCFG model is betterndash Less complexity avoid over-fitting better generalization
bull Better than Borschinger et al 2011ndash Overcome intractability in complex MRLndash Learn from more general complex ambiguityndash Novel MR parses never seen during training 66
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Comparison of Grammar Size and EM Training Time
67
Data
Hierarchy GenerationPCFG Model
Unigram GenerationPCFG Model
|Grammar| Time (hrs) |Grammar| Time (hrs)
English 20451 1726 16357 878
Chinese (Word) 21636 1599 15459 805
Chinese (Character) 19792 1864 13514 1258
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
68
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Discriminative Rerankingbull Effective approach to improve performance of generative
models with secondary discriminative modelbull Applied to various NLP tasks
ndash Syntactic parsing (Collins ICML 2000 Collins ACL 2002 Charniak amp Johnson ACL 2005)
ndash Semantic parsing (Lu et al EMNLP 2008 Ge and Mooney ACL 2006)
ndash Part-of-speech tagging (Collins EMNLP 2002)
ndash Semantic role labeling (Toutanova et al ACL 2005)
ndash Named entity recognition (Collins ACL 2002)
ndash Machine translation (Shen et al NAACL 2004 Fraser and Marcu ACL 2006)
ndash Surface realization in language generation (White amp Rajkumar EMNLP 2009 Konstas amp Lapata ACL 2012)
bull Goal ndash Adapt discriminative reranking to grounded language learning
69
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Discriminative Reranking
bull Generative modelndash Trained model outputs the best result with max probability
TrainedGenerative
Model
1-best candidate with maximum probability
Candidate 1
Testing Example
70
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Discriminative Rerankingbull Can we do better
ndash Secondary discriminative model picks the best out of n-best candidates from baseline model
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Testing Example
TrainedSecondary
DiscriminativeModel
Best prediction
Output
71
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
How can we apply discriminative reranking
bull Impossible to apply standard discriminative reranking to grounded language learningndash Lack of a single gold-standard reference for each training examplendash Instead provides weak supervision of surrounding perceptual
context (landmarks plan)bull Use response feedback from perceptual world ndash Evaluate candidate formal MRs by executing them in simulated
worldsUsed in evaluating the final end-task plan execution
ndash Weak indication of whether a candidate is goodbadndash Multiple candidate parses for parameter update
Response signal is weak and distributed over all candidates
72
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Reranking Model Averaged Perceptron (Collins 2000)
bull Parameter weight vector is updated when trained model predicts a wrong candidate
TrainedBaseline
GenerativeModel
GEN
hellip
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate nhellip
Training Example
Perceptron
Gold StandardReference
Best prediction
Updatefeaturevector119938120783
119938120784
119938120785
119938120786
119938119951119938119944
119938119944minus119938120786
perceptronscore
-016
121
-109
146
059
73
Our generative models
NotAvailable
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Weight Update
bull Pick a pseudo-gold parse out of all candidatesndash Most preferred one in terms of plan executionndash Evaluate composed MR plans from candidate parses ndash MARCO (MacMahon et al AAAI 2006) execution module runs and
evaluates each candidate MR in the worldAlso used for evaluating end-goal plan execution performance
ndash Record Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic average over 10 trials
ndash Prefer the candidate with the best execution success rate during training
74
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Updatebull Select pseudo-gold reference based on MARCO execution
results
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Pseudo-goldReference
Best prediction
UpdateDerived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
179
021
-109
146
059
75
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Weight Update with Multiple Parses
bull Candidates other than pseudo-gold could be useful ndash Multiple parses may have same maximum execution success ratesndash ldquoLowerrdquo execution success rates could mean correct plan given
indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate but contain correct MR components to reach the
desired goal
bull Weight update with multiple candidate parsesndash Use candidates with higher execution success rates than currently
best-predicted candidatendash Update with feature vector difference weighted by difference
between execution success rates
76
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (1)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
77
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Weight Update with Multiple Parses
bull Weight update with multiple candidates that have higher execution success rate than currently predicted parse
n-best candidates
Candidate 1
Candidate 2
Candidate 3
Candidate 4
Candidate n
hellip
Perceptron
Best prediction
Update (2)Derived
MRs
119924119929120783
119924119929120784
119924119929120785
119924119929120786
119924119929119951
Feature vector Difference
MARCOExecutionModule
ExecutionSuccess
Rate120782 120788120782 120786120782 120782
120782 120791
120782 120784
PerceptronScore
124
183
-109
146
059
78
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
bull Binary indicator whether a certain composition of nonterminalsterminals appear in parse tree(Collins EMNLP 2002 Lu et al EMNLP 2008 Ge amp Mooney ACL 2006)
L1 Turn(LEFT) Verify(frontSOFA backEASEL) Travel(steps2) Verify(atSOFA) Turn(RIGHT)
Features
Turn left and find the sofa then turn around the corner
L2 Turn(LEFT) Verify(frontSOFA) L3 Travel(steps2) Verify(atSOFA) Turn(RIGHT)
L4 Turn(LEFT) L5 Travel() Verify(atSOFA) L6 Turn()
79
119943 (119923120783rarr119923120785 )=120783119943 (119923120785rarr119923120787or119923120783)=120783119943 (119923120785rArr119923120787119923120788 )=120783119943 (119923120787119839119842119847119837 )=120783
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Evaluationsbull Leave-one-map-out approachndash 2 maps for training and 1 map for testingndash Parse accuracy ndash Plan execution accuracy (end goal)
bull Compared with two baseline modelsndash Hierarchy and Unigram Generation PCFG modelsndash All reranking results use 50-best parsesndash Try to get 50-best distinct composed MR plans and according
parses out of 1000000-best parsesMany parse trees differ insignificantly leading to same derived MR
plansGenerate sufficiently large 1000000-best parse trees from baseline
model80
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Update vs Baseline(English)
81
Hierarchy Unigram
7481
7644
7332
7724
Parse F1
BaselineResponse-based
Hierarchy Unigram
5722
6714
5965
6827
Single-sentence
Baseline Single
Hierarchy Unigram
2017
2812
2262
292
Paragraph
BaselineResponse-based
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Update vs Baseline (Chinese-Word)
82
Hierarchy Unigram
7553
7641
7726
7774
Parse F1
BaselineResponse-based
Hierarchy Unigram
6103
6346412
6564
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1908
2312
2129
2374
Paragraph
BaselineResponse-based
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Update vs Baseline(Chinese-Character)
83
Hierarchy Unigram
7305
7755
7626
7976
Parse F1
BaselineResponse-based
Hierarchy Unigram
5561
62856408
655
Single-sentence
BaselineResponse-based
Hierarchy Unigram
1274
23332225
2535
Paragraph
BaselineResponse-based
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Update vs Baseline
bull vs Baselinendash Response-based approach performs better in the final end-
task plan executionndash Optimize the model for plan execution
84
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Update with Multiple vs Single Parses (English)
85
Hierarchy Unigram
7332
7724
7343
7781
Parse F1
Single Multi
Hierarchy Unigram
5965
6827
6281
6893
Single-sentence
Single Multi
Hierarchy Unigram
2262
292
2657
291
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Update with Multiple vs Single Parses (Chinese-Word)
86
Hierarchy Unigram
7726
7774
788
7811
Parse F1
Single Multi
Hierarchy Unigram
6412
6564
6415
6627
Single-sentence
Single Multi
Hierarchy Unigram
2129
2374
2155
2595
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Update with Multiple vs Single Parses (Chinese-Character)
87
Hierarchy Unigram
7626
79767944
7994
Parse F1
Single Multi
Hierarchy Unigram
6408
655
6408
6684
Single-sentence
Single Multi
Hierarchy Unigram
2225
2535
2258
2716
Paragraph
Single Multi
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Response-based Update with Multiple vs Single Parses
bull Using multiple parses improves the performance in generalndash Single-best pseudo-gold parse provides only weak
feedbackndash Candidates with low execution success rates
produce underspecified plans or plans with ignorable details but capturing gist of preferred actions
ndash A variety of preferable parses help improve the amount and the quality of weak feedback
88
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
89
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Future Directions
bull Integrating syntactic componentsndash Learn joint model of syntactic and semantic structure
bull Large-scale datandash Data collection model adaptation to large-scale
bull Machine translationndash Application to summarized translation
bull Real perceptual datandash Learn with raw features (sensory and vision data)
90
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Outlinebull IntroductionMotivationbull Grounded Language Learning in Limited Ambiguity (Kim and
Mooney COLING 2010)ndash Learning to sportscast
bull Grounded Language Learning in High Ambiguity (Kim and Mooney EMNLP 2012)ndash Learn to follow navigational instructions
bull Discriminative Reranking for Grounded Language Learning (Kim and Mooney ACL 2013)
bull Future Directionsbull Conclusion
91
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Conclusion
bull Conventional language learning is expensive and not scalable due to annotation of training data
bull Grounded language learning from relevant perceptual context is promising and training corpus is easy to obtain
bull Our proposed models provide general framework of full probabilistic model for learning NL-MR correspondences with ambiguous supervision
bull Discriminative reranking is possible and effective with weak feedback from perceptual environment
92
Thank You
Thank You
Top Related