End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding
-
Upload
yun-nung-vivian-chen -
Category
Technology
-
view
629 -
download
0
Transcript of End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding
1
Y U N - N U N G ( V I V I A N ) C H E N
H T T P : / / V I V I A N C H E N . I D V . T W
H A K K A N I - T U R , T U R , G A O , D E N G
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
2
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
3
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
Spoken Dialogue System (SDS)
• Spoken dialogue systems are intelligent agents that are able to help
users finish tasks more efficiently via spoken interactions.
• Spoken dialogue systems are being incorporated into various devices
(smart-phones, smart TVs, in-car navigating system, etc).
4
Good intelligent assistants help users to organize and access information conveniently
JARVIS – Iron Man’s Personal Assistant Baymax – Personal Healthcare Companion
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
Dialogue System Pipeline
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
5
ASRLanguage Understanding (LU)• User Intent Detection• Slot Filling
Dialogue Management (DM)• Dialogue State Tracking• Policy Decision
Output Generation
Hypothesisare there any action movies to see this weekend
Semantic Frame (Intents, Slots)request_moviegenre=actiondate=this weekend
System Actionrequest_locaion
Text responseWhere are you located?
Screen Displaylocation?
Text InputAre there any action movies to see this weekend?
Speech Signal
LU Importance
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
15
30
45
60
75
90
10
5
12
0
13
5
15
0
16
5
18
0
19
5
21
0
22
5
24
0
25
5
27
0
28
5
30
0
31
5
33
0
34
5
36
0
37
5
39
0
40
5
42
0
43
5
45
0
46
5
48
0
49
5
Succ
ess
Rat
e
Simulation Epoch
Learning Curve of System Performance
Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05
RL Agent w/o LU errors
Rule Agent w/o LU errors
LU Importance
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
15
30
45
60
75
90
10
5
12
0
13
5
15
0
16
5
18
0
19
5
21
0
22
5
24
0
25
5
27
0
28
5
30
0
31
5
33
0
34
5
36
0
37
5
39
0
40
5
42
0
43
5
45
0
46
5
48
0
49
5
Succ
ess
Rat
e
Simulation Epoch
Learning Curve of System Performance
Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05
RL Agent w/o LU errors
RL Agent w/ 5% LU errors
Rule Agent w/o LU errors
Rule Agent w/ 5% LU errors
>5% performance drop
The system performance is sensitive to LU errors, for both rule-based and reinforcement learning agents.
Dialogue System Pipeline
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
8
SLU usually focuses on understanding single-turn utterances
The understanding result is usually influenced by 1) local observations 2) global knowledge.
ASRLanguage Understanding (LU)• User Intent Detection• Slot Filling
Dialogue Management (DM)• Dialogue State Tracking• Policy Decision
Output Generation
Hypothesisare there any action movies to see this weekend
Semantic Frame (Intents, Slots)request_moviegenre=actiondate=this weekend
System Actionrequest_locaion
Text responseWhere are you located?
Screen Displaylocation?
Text InputAre there any action movies to see this weekend?
Speech Signal
current bottleneck error propagation
Spoken Language Understanding
9
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
just sent email to bob about fishing this weekend
O O O OB-contact_name
O
B-subject I-subject I-subject
U
S
I send_email
D communication
send_email(contact_name=“bob”, subject=“fishing this weekend”)
are we going to fish this weekend
U1
S2
send_email(message=“are we going to fish this weekend”)
send email to bob
U2
send_email(contact_name=“bob”)
B-messageI-message
I-message I-message I-messageI-message I-message
B-contact_nameS1
Domain Identification Intent Prediction Slot Filling
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
10
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
MODEL ARCHITECTURE
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
11
u
Knowledge Attention Distributionpi
mi
Memory Representation
Weighted Sum
h
∑ Wkg
oKnowledge Encoding
Representation
history utterances {xi}
current utterance
c
Inner Product
Sentence Encoder
RNNin
x1 x2 xi…
Contextual Sentence Encoder
x1 x2 xi…
RNNmem
slot tagging sequence y
ht-1 ht
V V
W W W
wt-1 wt
yt-1 yt
U U
RNN Tagger
M M
Idea: additionally incorporating contextual knowledge during slot tagging
Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.
1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
MODEL ARCHITECTURE
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
12
u
Knowledge Attention Distributionpi
mi
Memory Representation
Weighted Sum
h
∑ Wkg
oKnowledge Encoding
Representation
history utterances {xi}
current utterance
c
Inner Product
Sentence Encoder
RNNin
x1 x2 xi…
Contextual Sentence Encoder
x1 x2 xi…
RNNmem
slot tagging sequence y
ht-1 ht
V V
W W W
wt-1 wt
yt-1 yt
U U
RNN Tagger
M M
Idea: additionally incorporating contextual knowledge during slot tagging
Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.
1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding
CNN
CNN
END-TO-END TRAINING
• Tagging Objective
• RNN Tagger
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
13
slot tag sequence contextual utterances & current utterance
ht-1 ht+1ht
V V V
W W W W
wt-1 wt+1wt
yt-1 yt+1yt
U U U
o
M M M
Automatically figure out the attention distribution without explicit supervision
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
14
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
EXPERIMENTS
• Dataset: Cortana communication session data– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
15
Model Training SetKnowledge Encoding
Sentence Encoder
First Turn Other Overall
RNN Taggersingle-turn x x 60.6 16.2 25.5
The model trained on single-turn data performs worse for non-first turns due to mismatched training data
EXPERIMENTS
• Dataset: Cortana communication session data– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
16
Model Training SetKnowledge Encoding
Sentence Encoder
First Turn Other Overall
RNN Taggersingle-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Treating multi-turn data as single-turn for training performs reasonable
EXPERIMENTS
• Dataset: Cortana communication session data– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
17
Model Training SetKnowledge Encoding
Sentence Encoder
First Turn Other Overall
RNN Taggersingle-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder-Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5
Encoding current and history utterances improves the performance but increases the training time
EXPERIMENTS
• Dataset: Cortana communication session data– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
18
Model Training SetKnowledge Encoding
Sentence Encoder
First Turn Other Overall
RNN Taggersingle-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder-Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1
Applying memory networks significantly outperforms all approaches with much less training time
EXPERIMENTS
• Dataset: Cortana communication session data– GRU for all RNN
– adam optimizer
– embedding dim=150
– hidden unit=100
– dropout=0.5
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
19
Model Training SetKnowledge Encoding
Sentence Encoder
First Turn Other Overall
RNN Taggersingle-turn x x 60.6 16.2 25.5
multi-turn x x 55.9 45.7 47.4
Encoder-Tagger
multi-turn current utt (c) RNN 57.6 56.0 56.3
multi-turn history + current (x, c) RNN 69.9 60.8 62.5
Proposedmulti-turn history + current (x, c) RNN 73.2 65.7 67.1
multi-turn history + current (x, c) CNN 73.8 66.5 68.0
CNN produces comparable results for sentence encoding with shorter training time
NEW! NOT IN THE PAPER!
Outline
Introduction
Spoken Dialogue System
Spoken/Natural Language Understanding (SLU/NLU)
Contextual Spoken Language Understanding
Model Architecture
End-to-End Training
Experiments
Conclusion & Future Work
20
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
Conclusion
• The proposed end-to-end memory networks store
contextual knowledge, which can be exploited dynamically
based on an attention model for manipulating knowledge
carryover for multi-turn understanding
• The end-to-end model performs the tagging task instead of
classification
• The experiments show the feasibility and robustness of
modeling knowledge carryover through memory networks
21
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
Future Work
• Leveraging not only local observation but also global
knowledge for better language understanding
– Syntax or semantics can serve as global knowledge to guide
the understanding model
– “Knowledge as a Teacher: Knowledge-Guided Structural
Attention Networks,” arXiv preprint arXiv: 1609.03286
22
End
-to-En
d M
emo
ry Netw
orks fo
r Mu
lti-Turn
Spo
ken Lan
guage U
nd
erstand
ing Yu
n-N
un
g (Vivian
) Ch
en
Q & AT H A N K S F O R Y O U R AT T E N T I O N !
23
The code will be available at https://github.com/yvchen/ContextualSLU