Speaker Role Contextual Modeling for Language ...yvchen/doc/IJCNLP17_Role.pdf · tiﬁcial...

Speaker Role Contextual Modeling for Language Understandingand Dialogue Policy Learning

Ta-Chung Chi? Po-Chun Chen? Shang-Yu Su† Yun-Nung Chen??Department of Computer Science and Information Engineering

†Graduate Institute of Electrical EngineeringNational Taiwan University

{b02902019,r06922028,r05921117}@ntu.edu.tw [email protected]

Abstract

Language understanding (LU) and dia-logue policy learning are two essentialcomponents in conversational systems.Human-human dialogues are not well-controlled and often random and unpre-dictable due to their own goals and speak-ing habits. This paper proposes a role-based contextual model to consider differ-ent speaker roles independently based onthe various speaking patterns in the multi-turn dialogues. The experiments on thebenchmark dataset show that the proposedrole-based model successfully learns role-specific behavioral patterns for contextualencoding and then significantly improveslanguage understanding and dialogue pol-icy learning tasks1.

1 Introduction

Spoken dialogue systems that can help users tosolve complex tasks such as booking a movieticket become an emerging research topic in the ar-tificial intelligence and natural language process-ing area. With a well-designed dialogue systemas an intelligent personal assistant, people can ac-complish certain tasks more easily via natural lan-guage interactions. Today, there are several vir-tual intelligent assistants, such as Apple’s Siri,Google’s Home, Microsoft’s Cortana, and Ama-zon’s Echo. Recent advance of deep learning hasinspired many applications of neural models to di-alogue systems. Wen et al. (2017), Bordes et al.(2017), and Li et al. (2017) introduced network-based end-to-end trainable task-oriented dialoguesystems.

1The source code is available at: https://github.com/MiuLab/Spk-Dialogue.

A key component of the understanding sys-tem is a language understanding (LU) module—it parses user utterances into semantic frames thatcapture the core meaning, where three main tasksof LU are domain classification, intent determi-nation, and slot filling (Tur and De Mori, 2011).A typical pipeline of LU is to first decide the do-main given the input utterance, and based on thedomain, to predict the intent and to fill associatedslots corresponding to a domain-specific semantictemplate. Recent advance of deep learning has in-spired many applications of neural models to nat-ural language processing tasks. With the powerof deep learning, there are emerging better ap-proaches of LU (Hakkani-Tur et al., 2016; Chenet al., 2016b,a; Wang et al., 2016). However, mostof above work focused on single-turn interactions,where each utterance is treated independently.

The contextual information has been shownuseful for LU (Bhargava et al., 2013; Xu andSarikaya, 2014; Chen et al., 2015; Sun et al.,2016). For example, the Figure 1 shows conver-sational utterances, where the intent of the high-lighted tourist utterance is to ask about locationinformation, but it is difficult to understand with-out contexts. Hence, it is more likely to estimatethe location-related intent given the contextual ut-terance about location recommendation. Contex-tual information has been incorporated into the re-current neural network (RNN) for improved do-main classification, intent prediction, and slot fill-ing (Xu and Sarikaya, 2014; Shi et al., 2015; We-ston et al., 2015; Chen et al., 2016c). The LUoutput is semantic representations of users’ behav-iors, and then flows to the downstream dialoguemanagement component in order to decide whichaction the system should take next, as called dia-logue policy. It is intuitive that better understand-ing could improve the dialogue policy learning,so that the dialogue management can be further

https://github.com/MiuLab/Spk-Dialogue

https://github.com/MiuLab/Spk-Dialogue

Task 1: Language Understanding (User Intents)

Task 2: Dialogue Policy Learning (System Actions)

Guide: so you of course %uh you can have dinner there and %uh of course you also can do sentosa , if you want to for the song of the sea , right ?

Tourist: yah .

Tourist: what 's the song in the sea ?

Guide: a song of the sea in fact is %uh laser show inside sentosa

FOL_RECOMMEND:FOOD;

QST_CONFIRM:LOC;

QST_RECOMMEND:LOC

RES_CONFIRM

QST_WHAT:LOC Task 1

FOL_EXPLAIN:LOC Task 2

Figure 1: The human-human conversational utterances and their associated semantics from DSTC4.

boosted through interactions (Li et al., 2017).Most of previous dialogue systems did not take

speaker roles into consideration. However, we dis-cover that different speaker roles can cause no-table variance in speaking habits and later af-fect the system performance differently (Chenet al., 2017). From Figure 1, the benchmark dia-logue dataset, Dialogue State Tracking Challenge4 (DSTC4) (Kim et al., 2016)2, contains two spe-cific roles, a tourist and a guide. Under the sce-nario of dialogue systems and the communicationpatterns, we take the tourist as a user and the guideas the dialogue agent (system). During conversa-tions, the user may focus on not only reasoning(user history) but also listening (agent history), sodifferent speaker roles could provide various cuesfor better understanding and policy learning.

This paper focuses on LU and dialogue pol-icy learning, which targets the understanding oftourist’s natural language (LU; language under-standing) and the prediction of how the systemshould respond (SAP; system action prediction)respectively. In order to comprehend what thetourist is talking about and predict how the guidereacts to the user, this work proposes a role-based contextual model by modeling role-specificcontexts differently for improving system perfor-mance.

2 Proposed Approach

The model architecture is illustrated in Figure 2.First, the previous utterances are fed into the con-textual model to encode into the history summary,and then the summary vector and the current ut-terance are integrated for helping LU and dia-logue policy learning. The whole model is trainedin an end-to-end fashion, where the history sum-mary vector is automatically learned based on two

2http://www.colips.org/workshop/dstc4/

downstream tasks. The objective of the proposedmodel is to optimize the conditional probabilityp(y | x), so that the difference between the pre-dicted distribution q(yk = z | x) and the targetdistribution q(yk = z | x) can be minimized:

L = −K∑k=1

N∑z=1

q(yk = z | x) log p(yk = z | x),

(1)where the labels y can be either intent tags for un-derstanding or system actions for dialogue policylearning.

Language Understanding (LU) Given the cur-rent utterance x = {wt}T1 , the goal is to predict theuser intents of x, which includes the speech actsand associated attributes shown in Figure 1; for ex-ample, QST WHAT is composed of the speech actQST and the associated attribute WHAT. Note thatwe do not process the slot filling task for extract-ing LOC. We apply a bidirectional long short-termmemory (BLSTM) model (Schuster and Paliwal,1997) to integrate preceding and following wordsto learn the probability distribution of the user in-tents.

vcur = BLSTM(x,Whis · vhis), (2)

o = sigmoid(WLU · vcur), (3)

whereWhis is a dense matrix and vhis is the historysummary vector, vcur is the context-aware vectorof the current utterance encoded by the BLSTM,and o is the intent distribution. Note that this isa multi-label and multi-class classification, so thesigmoid function is employed for modeling thedistribution after a dense layer. The user intent la-bels y are decided based on whether the value ishigher than a threshold θ.

Dialogue Policy Learning For system actionprediction, we also perform similar multi-label

http://www.colips.org/workshop/dstc4/

Role-BasedContextual Model

Role-BasedContextual Model

Dense Layer

Σ

Current Utterancewt wt+1 wT… …

Dense Layer

Language Understanding (Intent Prediction)

Dialogue Policy Learning(System Action Prediction)

Contextual Module

Natural LanguageSemantic Label

Intenth-3

HistorySummary

Intenth-2 Intenth-1Intenth-3

HistorySummary

Intenth-2 Intenth-1

Sentence Encoder

Sentence Encoder

Sentence Encoder

Utterh-3 Utterh-2 Utterh-1

Intermediate Guidance

HistorySummary

u2

u4User (Tourist)

u3

Agent (Guide)

u1

u5

Current

Figure 2: Illustration of the proposed role-based contextual model.

multi-class classification on the context-awarevector vcur from (2) using sigmoid:

o = sigmoid(Wπ · vcur), (4)

and then the system actions can be decided basedon a threshold θ.

2.1 Contextual ModuleIn order to leverage the contextual information,we utilize two types of contexts: 1) semantic la-bels and 2) natural language, to learn history sum-mary representations, vhis in (2). The illustrationis shown in the top-right part of Figure 2.

Semantic Label Given a sequence of annotatedintent tags and associated attributes for each his-tory utterance, we employ a BLSTM to model theexplicit semantics:

vhis = BLSTM(intentt), (5)

where intentt is the vector after one-hot encodingfor representing the annotated intent and the at-tribute features. Note that this model requires theground truth annotations of history utterances fortraining and testing.

Natural Language (NL) Given the natural lan-guage history, a sentence encoder is applied tolearn a vector representation for each prior utter-ance. After encoding, the feature vectors are fedinto a BLSTM to capture temporal information:

vhis = BLSTM(CNN(uttt)), (6)

where the CNN is good at extracting the mostsalient features that can represent the given natural

‘sthe

song

what

in

convolutional layer with multiple filter widths

max-over-time

pooling

thesea

word embeddings dim filter depth

fully-connected

layer

Figure 3: Illustration of the CNN sentence encoderfor the example sentence “what’s the song in thesea”.

language utterances illustrated in Figure 3. Herethe sentence encoder can be replaced into differ-ent encoders3, and the weights of all encoders aretied together.

NL with Intermediate Guidance Consideringthat the semantic labels may provide rich cues,the middle supervision signal is utilized as inter-mediate guidance for the sentence encoding mod-ule in order to guide them to project from inpututterances to a more meaningful feature space.Specifically, for each utterance, we compute thecross entropy loss between the encoder outputsand corresponding intent-attributes shown in Fig-ure 2. Assuming that lt is the encoding loss foruttt in the history, the final objective is to mini-mize (L+

∑t lt). This model does not require the

3In the experiments, CNN achieved slightly better perfor-mance with fewer parameters compared with BLSTM.

ground truth semantics for history when testing,so that it is more practical compared to the abovemodel using semantic labels.

2.2 Speaker Role Modeling

In a dialogue, there are at least two roles com-municating with each other, each individual hashis/her own goal and speaking habit. For example,the tourists have their own desired touring goalsand the guides are try to provide the sufficient tour-ing information for suggestions and assistance.Prior work usually ignored the speaker role in-formation or only modeled a single speaker’s his-tory for various tasks (Chen et al., 2016c; Yanget al., 2017). The performance may be degradeddue to the possibly unstable and noisy input fea-ture space. To address this issue, this work pro-poses the role-based contextual model: instead ofusing only a single BLSTM model for the his-tory, we construct one individual contextual mod-ule for each speaker role. Each role-dependentrecurrent unit BLSTMrolex receives correspondinginputs xi,rolex (i = [1, ..., N ]), which have beenprocessed by an encoder model, we can rewrite (5)and (6) into (7) and (8) respectively:

vhis = BLSTMrolea(intentt,rolea) (7)

+ BLSTMroleb(intentt,roleb).

vhis = BLSTMrolea(CNN(uttt,rolea)) (8)

+ BLSTMroleb(CNN(uttt,roleb))

Therefore, each role-based contextual module fo-cuses on modeling the role-dependent goal andspeaking style, and vcur from (2) is able to carryrole-based contextual information.

3 Experiments

To evaluate the effectiveness of the proposedmodel, we conduct the LU and dialogue policylearning experiments on human-human conversa-tional data.

3.1 Setup

The experiments are conducted on DSTC4, whichconsists of 35 dialogue sessions on touristic infor-mation for Singapore collected from Skype callsbetween 3 tour guides and 35 tourists (Kim et al.,2016). All recorded dialogues with the total lengthof 21 hours have been manually transcribed andannotated with speech acts and semantic labelsat each turn level. The speaker labels are also

annotated. Human-human dialogues contain richand complex human behaviors and bring muchdifficulty to all dialogue-related tasks. Given thefact that different speaker roles behave differently,DSTC4 is a suitable benchmark dataset for evalu-ation.

We choose a mini-batch adam as the optimizerwith the batch size of 128 examples (Kingmaand Ba, 2014). The size of each hidden re-current layer is 128. We use pre-trained 200-dimensional word embeddings GloV e (Penning-ton et al., 2014). We only apply 30 training epochswithout any early stop approach. The sentence en-coder is implemented using a CNN with the fil-ters of size [2, 3, 4], 128 filters each size, and maxpooling over time. The idea is to capture the mostimportant feature (the highest value) for each fea-ture map. This pooling scheme naturally dealswith variable sentence lengths. Please refer to Kim(2014) for more details.

For both tasks, we focus on predicting multiplelabels including speech acts and attributes, so theevaluation metric is average F1 score for balancingrecall and precision in each utterance. Note thatthe final prediction may contain multiple labels.

3.2 Results

The experiments are shown in Table 1, where wereport the average number over five runs. The firstbaseline (row (a)) is the best participant of DSTC4in IWSDS 2016 (Kim et al., 2016), the poor per-formance is probably because tourist intents aremuch more difficult than guide intents (most sys-tems achieved higher than 60% of F1 for guide in-tents but lower than 50% for tourist intents). Thesecond baseline (row (b)) models the current utter-ance without contexts, performing 62.6% for un-derstanding and 63.4% for policy learning.

3.2.1 Language Understanding ResultsWith contextual history, using ground truth se-mantic labels for learning history summary vec-tors greatly improves the performance to 68.2%(row (c)), while using natural language slightlyimproves the performance to 64.2% (row (e)). Thereason may be that NL utterances contain morenoises and the contextual vectors are more difficultto model for LU. The proposed role-based con-textual models applying on semantic labels andNL achieve 69.2% (row (d)) and 65.1% (row (f))on F1 respectively, showing the significant im-provement all model without role modeling. Fur-

Model Language PolicyUnderstanding Learning

Baseline (a) DSTC4-Best 52.1 -(b) BLSTM 62.6 63.4

Contextual-Sem (c) BLSTM 68.2 66.8(d) + Role-Based 69.2† 70.1†

Contextual-NL (e) BLSTM 64.2 66.3(f) + Role-Based 65.1† 66.9†

(g) + Role-Based w/ Intermediate Guidance 65.8† 67.4†

Table 1: Language understanding and dialogue policy learning performance of F-measure on DSTC4(%). † indicates the significant improvement compared to all methods without speaker role modeling.

thermore, adding the intermediate guidance ac-quires additional improvement (65.8% from therow (g)). It is shown that the semantic labelssuccessfully guide the sentence encoder to obtainbetter sentence-level representations, and then thehistory summary vector carrying more accurate se-mantics gives better performance for understand-ing.

3.2.2 Dialogue Policy Learning ResultsTo predict the guide’s next actions, the baselineutilizes intent tags of the current utterance with-out contexts (row (b)). Table 1 shows the similartrend as LU results, where applying either role-based contextual models or intermediate guidancebrings advantages for both semantics-encoded andNL-encoded history.

3.3 DiscussionIn contrast to NL, semantic labels (intent-attributepairs) can be seen as more explicit and concise in-formation for modeling the history, which indeedgains more in our experiments for both LU and di-alogue policy learning. The results of Contextual-Sem can be treated as the upper bound perfor-mance, because they utilizes the ground truth se-mantics of contexts. Among the experiments ofContextual-NL, which are more practical becausethe annotated semantics are not required duringtesting, the proposed approaches achieve 5.1% and6.3% relative improvement compared to the base-line for LU and dialogue policy learning respec-tively.

Between LU and dialogue policy learning tasks,most LU results are worse than dialogue policylearning results. The reason probably is that theguide has similar behavior patterns such as pro-viding information and confirming questions etc.,while the user can have more diverse interac-

tions. Therefore, understanding the user intents isslightly harder than predicting the guide policy inthe DSTC4 dataset.

With the promising improvement for both LUand dialogue policy learning, the idea about mod-eling speaker role information can be further ex-tended to various research topics in the future.

4 Conclusion

This paper proposes an end-to-end role-based con-textual model that automatically learns speaker-specific contextual encoding. Experiments on abenchmark multi-domain human-human dialoguedataset show that our role-based model achievesimpressive improvement in language understand-ing and dialogue policy learning, demonstratingthat different speaker roles behave differently andfocus on different goals.

Acknowledgements

We would like to thank reviewers for their insight-ful comments on the paper. The authors are sup-ported by the Ministry of Science and Technologyof Taiwan, Google Research, Microsoft Research,and MediaTek Inc..

ReferencesAnshuman Bhargava, Asli Celikyilmaz, Dilek

Hakkani-Tur, and Ruhi Sarikaya. 2013. Easycontextual intent prediction and slot detection. InProceedings of 2013 IEEE International Conferenceon Acoustics, Speech and Signal Processing. IEEE,pages 8337–8341.

Antoine Bordes, Y-Lan Boureau, and Jason Weston.2017. Learning end-to-end goal-oriented dialog. InICLR.

Po-Chun Chen, Ta-Chung Chi, Shang-Yu Su, and Yun-Nung Chen. 2017. Dynamic time-aware attention

to speaker roles and contexts for spoken languageunderstanding. In Proceedings of 2017 IEEE Work-shop on Automatic Speech Recognition and Under-standing.

Yun-Nung Chen, Dilek Hakanni-Tur, Gokhan Tur,Asli Celikyilmaz, Jianfeng Guo, and Li Deng.2016a. Syntax or semantics? knowledge-guidedjoint semantic frame parsing. In Proceedings of2016 IEEE Spoken Language Technology Workshop.IEEE, pages 348–355.

Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur,Asli Celikyilmaz, Jianfeng Gao, and Li Deng.2016b. Knowledge as a teacher: Knowledge-guided structural attention networks. arXiv preprintarXiv:1609.03286 .

Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur,Jianfeng Gao, and Li Deng. 2016c. End-to-endmemory networks with knowledge carryover formulti-turn spoken language understanding. In Pro-ceedings of The 17th Annual Meeting of the Inter-national Speech Communication Association. pages3245–3249.

Yun-Nung Chen, Ming Sun, Alexander I. Rudnicky,and Anatole Gershman. 2015. Leveraging behav-ioral patterns of mobile applications for personalizedspoken language understanding. In Proceedings of17th ACM International Conference on MultimodalInteraction. ACM, pages 83–86.

Dilek Hakkani-Tur, Gokhan Tur, Asli Celikyilmaz,Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frameparsing using bi-directional rnn-lstm. In Proceed-ings of The 17th Annual Meeting of the Interna-tional Speech Communication Association. pages715–719.

Seokhwan Kim, Luis Fernando DHaro, Rafael EBanchs, Jason D Williams, and Matthew Henderson.2016. The fourth dialog state tracking challenge. InProceedings of the 7th International Workshop onSpoken Dialogue Systems.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882 .

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .

Xuijun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In Proceedingsof The 8th International Joint Conference on Natu-ral Language Processing.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of EMNLP. vol-ume 14, pages 1532–1543.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks. IEEE Transactionson Signal Processing 45(11):2673–2681.

Yangyang Shi, Kaisheng Yao, Hu Chen, Yi-Cheng Pan,Mei-Yuh Hwang, and Baolin Peng. 2015. Contex-tual spoken language understanding using recurrentneural networks. In Proceedings of 2015 IEEE In-ternational Conference on Acoustics, Speech andSignal Processing. IEEE, pages 5271–5275.

Ming Sun, Yun-Nung Chen, and Alexander I. Rud-nicky. 2016. An intelligent assistant for high-leveltask understanding. In Proceedings of The 21st An-nual Meeting of the Intelligent Interfaces Commu-nity. pages 169–174.

Gokhan Tur and Renato De Mori. 2011. Spoken lan-guage understanding: Systems for extracting seman-tic information from speech. John Wiley & Sons.

Zhangyang Wang, Yingzhen Yang, Shiyu Chang, QingLing, and Thomas S Huang. 2016. Learning a deepl encoder for hashing. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial In-telligence, IJCAI. pages 2174–2180.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic,Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes,David Vandyke, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In Proceedings of EACL.

Jason Weston, Sumit Chopra, and Antoine Bordesa.2015. Memory networks. In Proceedings of Inter-national Conference on Learning Representations.

Puyang Xu and Ruhi Sarikaya. 2014. Contextual do-main classification in spoken language understand-ing systems using recurrent neural network. In Pro-ceedings of 2014 IEEE International Conferenceon Acoustics, Speech and Signal Processing. IEEE,pages 136–140.

Xuesong Yang, Yun-Nung Chen, Dilek Hakkani-Tur,Paul Crook, Xiujun Li, Jianfeng Gao, and Li Deng.2017. End-to-end joint learning of natural lan-guage understanding and dialogue manager. In Pro-ceedings of 2017 IEEE International Conferenceon Acoustics, Speech and Signal Processing. IEEE,pages 5690–5694.

Speaker Role Contextual Modeling for Language ...yvchen/doc/IJCNLP17_Role.pdf · tiﬁcial...

Documents

Transcript of Speaker Role Contextual Modeling for Language ...yvchen/doc/IJCNLP17_Role.pdf · tiﬁcial...