End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding

1

Y U N - N U N G ( V I V I A N ) C H E N

H T T P : / / V I V I A N C H E N . I D V . T W

H A K K A N I - T U R , T U R , G A O , D E N G

Outline

Introduction

Spoken Dialogue System

Spoken/Natural Language Understanding (SLU/NLU)

Contextual Spoken Language Understanding

Model Architecture

End-to-End Training

Experiments

Conclusion & Future Work

2

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

Outline

Introduction




Model Architecture

End-to-End Training

Experiments


3

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

Spoken Dialogue System (SDS)

• Spoken dialogue systems are intelligent agents that are able to help

users finish tasks more efficiently via spoken interactions.

• Spoken dialogue systems are being incorporated into various devices

(smart-phones, smart TVs, in-car navigating system, etc).

4

Good intelligent assistants help users to organize and access information conveniently

JARVIS – Iron Man’s Personal Assistant Baymax – Personal Healthcare Companion

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

Dialogue System Pipeline

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

5

ASRLanguage Understanding (LU)• User Intent Detection• Slot Filling

Dialogue Management (DM)• Dialogue State Tracking• Policy Decision

Output Generation

Hypothesisare there any action movies to see this weekend

Semantic Frame (Intents, Slots)request_moviegenre=actiondate=this weekend

System Actionrequest_locaion

Text responseWhere are you located?

Screen Displaylocation?

Text InputAre there any action movies to see this weekend?

Speech Signal

LU Importance

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

6

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

15

30

45

60

75

90

10

5

12

0

13

5

15

0

16

5

18

0

19

5

21

0

22

5

24

0

25

5

27

0

28

5

30

0

31

5

33

0

34

5

36

0

37

5

39

0

40

5

42

0

43

5

45

0

46

5

48

0

49

5

Succ

ess

Rat

e

Simulation Epoch

Learning Curve of System Performance

Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05

RL Agent w/o LU errors

Rule Agent w/o LU errors

LU Importance

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

15

30

45

60

75

90

10

5

12

0

13

5

15

0

16

5

18

0

19

5

21

0

22

5

24

0

25

5

27

0

28

5

30

0

31

5

33

0

34

5

36

0

37

5

39

0

40

5

42

0

43

5

45

0

46

5

48

0

49

5

Succ

ess

Rat

e

Simulation Epoch

Learning Curve of System Performance

Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05

RL Agent w/o LU errors

RL Agent w/ 5% LU errors

Rule Agent w/o LU errors

Rule Agent w/ 5% LU errors

>5% performance drop

The system performance is sensitive to LU errors, for both rule-based and reinforcement learning agents.

Dialogue System Pipeline

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

8

SLU usually focuses on understanding single-turn utterances

The understanding result is usually influenced by 1) local observations 2) global knowledge.

ASRLanguage Understanding (LU)• User Intent Detection• Slot Filling

Dialogue Management (DM)• Dialogue State Tracking• Policy Decision

Output Generation

Hypothesisare there any action movies to see this weekend

Semantic Frame (Intents, Slots)request_moviegenre=actiondate=this weekend

System Actionrequest_locaion

Text responseWhere are you located?

Screen Displaylocation?

Text InputAre there any action movies to see this weekend?

Speech Signal

current bottleneck error propagation

Spoken Language Understanding

9

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

just sent email to bob about fishing this weekend

O O O OB-contact_name

O

B-subject I-subject I-subject

U

S

I send_email

D communication

send_email(contact_name=“bob”, subject=“fishing this weekend”)

are we going to fish this weekend

U1

S2

send_email(message=“are we going to fish this weekend”)

send email to bob

U2

send_email(contact_name=“bob”)

B-messageI-message

I-message I-message I-messageI-message I-message

B-contact_nameS1

Domain Identification Intent Prediction Slot Filling

Outline

Introduction




Model Architecture

End-to-End Training

Experiments


10

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

MODEL ARCHITECTURE

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

11

u

Knowledge Attention Distributionpi

mi

Memory Representation

Weighted Sum

h

∑ Wkg

oKnowledge Encoding

Representation

history utterances {xi}

current utterance

c

Inner Product

Sentence Encoder

RNNin

x1 x2 xi…

Contextual Sentence Encoder

x1 x2 xi…

RNNmem

slot tagging sequence y

ht-1 ht

V V

W W W

wt-1 wt

yt-1 yt

U U

RNN Tagger

M M

Idea: additionally incorporating contextual knowledge during slot tagging

Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.

1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding

http://www.csie.ntu.edu.tw/~yvchen/doc/IS16_ContextualSLU.pdf

MODEL ARCHITECTURE

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

12

u

Knowledge Attention Distributionpi

mi

Memory Representation

Weighted Sum

h

∑ Wkg

oKnowledge Encoding

Representation

history utterances {xi}

current utterance

c

Inner Product

Sentence Encoder

RNNin

x1 x2 xi…

Contextual Sentence Encoder

x1 x2 xi…

RNNmem

slot tagging sequence y

ht-1 ht

V V

W W W

wt-1 wt

yt-1 yt

U U

RNN Tagger

M M

Idea: additionally incorporating contextual knowledge during slot tagging

Chen, et al., “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding,” in Interspeech, 2016.

1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding

CNN

CNN

http://www.csie.ntu.edu.tw/~yvchen/doc/IS16_ContextualSLU.pdf

END-TO-END TRAINING

• Tagging Objective

• RNN Tagger

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

13

slot tag sequence contextual utterances & current utterance

ht-1 ht+1ht

V V V

W W W W

wt-1 wt+1wt

yt-1 yt+1yt

U U U

o

M M M

Automatically figure out the attention distribution without explicit supervision

Outline

Introduction




Model Architecture

End-to-End Training

Experiments


14

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

EXPERIMENTS

• Dataset: Cortana communication session data– GRU for all RNN

– adam optimizer

– embedding dim=150

– hidden unit=100

– dropout=0.5

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

15

Model Training SetKnowledge Encoding

Sentence Encoder

First Turn Other Overall

RNN Taggersingle-turn x x 60.6 16.2 25.5

The model trained on single-turn data performs worse for non-first turns due to mismatched training data

EXPERIMENTS


– adam optimizer


– hidden unit=100

– dropout=0.5

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

16


Sentence Encoder



multi-turn x x 55.9 45.7 47.4

Treating multi-turn data as single-turn for training performs reasonable

EXPERIMENTS


– adam optimizer


– hidden unit=100

– dropout=0.5

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

17


Sentence Encoder



multi-turn x x 55.9 45.7 47.4

Encoder-Tagger

multi-turn current utt (c) RNN 57.6 56.0 56.3

multi-turn history + current (x, c) RNN 69.9 60.8 62.5

Encoding current and history utterances improves the performance but increases the training time

EXPERIMENTS


– adam optimizer


– hidden unit=100

– dropout=0.5

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

18


Sentence Encoder



multi-turn x x 55.9 45.7 47.4

Encoder-Tagger


multi-turn history + current (x, c) RNN 69.9 60.8 62.5Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1

Applying memory networks significantly outperforms all approaches with much less training time

EXPERIMENTS


– adam optimizer


– hidden unit=100

– dropout=0.5

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

19


Sentence Encoder



multi-turn x x 55.9 45.7 47.4

Encoder-Tagger


multi-turn history + current (x, c) RNN 69.9 60.8 62.5

Proposedmulti-turn history + current (x, c) RNN 73.2 65.7 67.1

multi-turn history + current (x, c) CNN 73.8 66.5 68.0

CNN produces comparable results for sentence encoding with shorter training time

NEW! NOT IN THE PAPER!

Outline

Introduction




Model Architecture

End-to-End Training

Experiments


20

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

Conclusion

• The proposed end-to-end memory networks store

contextual knowledge, which can be exploited dynamically

based on an attention model for manipulating knowledge

carryover for multi-turn understanding

• The end-to-end model performs the tagging task instead of

classification

• The experiments show the feasibility and robustness of

modeling knowledge carryover through memory networks

21

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

Future Work

• Leveraging not only local observation but also global

knowledge for better language understanding

– Syntax or semantics can serve as global knowledge to guide

the understanding model

– “Knowledge as a Teacher: Knowledge-Guided Structural

Attention Networks,” arXiv preprint arXiv: 1609.03286

22

End

-to-En

d M

emo

ry Netw

orks fo

r Mu

lti-Turn

Spo

ken Lan

guage U

nd

erstand

ing Yu

n-N

un

g (Vivian

) Ch

en

http://arxiv.org/abs/1609.03286

Q & AT H A N K S F O R Y O U R AT T E N T I O N !

23

The code will be available at https://github.com/yvchen/ContextualSLU

End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding

Technology

Transcript of End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding