Emotional Chatting Machine: Emotional Conversation ... · Emotional Chatting Machine: Emotional...

9
Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory Hao Zhou , Minlie Huang †* , Tianyang Zhang , Xiaoyan Zhu , Bing Liu State Key Laboratory of Intelligent Technology and Systems, National Laboratory for Information Science and Technology, Dept. of Computer Science and Technology, Tsinghua University, Beijing 100084, PR China Dept. of Computer Science, University of Illinois at Chicago, Chicago, Illinois, USA [email protected] , [email protected] , [email protected], [email protected], [email protected] Abstract Perception and expression of emotion are key factors to the success of dialogue systems or conversational agents. How- ever, this problem has not been studied in large-scale conver- sation generation so far. In this paper, we propose Emotional Chatting Machine (ECM) that can generate appropriate re- sponses not only in content (relevant and grammatical) but also in emotion (emotionally consistent). To the best of our knowledge, this is the first work that addresses the emotion factor in large-scale conversation generation. ECM addresses the factor using three new mechanisms that respectively (1) models the high-level abstraction of emotion expressions by embedding emotion categories, (2) captures the change of implicit internal emotion states, and (3) uses explicit emo- tion expressions with an external emotion vocabulary. Exper- iments show that the proposed model can generate responses appropriate not only in content but also in emotion. Introduction As a vital part of human intelligence, emotional intelligence is defined as the ability to perceive, integrate, understand, and regulate emotions (Mayer and Salovey 1997). It has been a long-term goal of artificial intelligence to enable a machine to understand affect and emotion (Picard and Pi- card 1997). To create a chatbot capable of communicating with a user at the human level, it is necessary to equip the machine with the ability of perceiving and expressing emo- tions. Existing studies show that addressing affect and emotion in dialogue systems or conversational agents can enhance user satisfaction (Prendinger, Mori, and Ishizuka 2005), and lead to fewer breakdowns in dialogues (Martinovski and Traum 2003). Some initial work has also been done on adjusting dialogue behaviors to suit users’ emotional states (Polzin and Waibel 2000), and on generating responses to users’ utterances at both the content- and affect-related lev- els (Skowron 2010). However, these studies, mostly inspired by psychology findings, are either rule-based or limited to small-scale data. Recently, neural models trained on large-scale data * Corresponding author: Minlie Huang, [email protected] .cn Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. have advanced open-domain conversation generation sig- nificantly (Ritter, Cherry, and Dolan 2011; Vinyals and Le 2015; Shang, Lu, and Li 2015; Serban et al. 2016). Most of these models aim to improve the content quality of conversation generation (Gu et al. 2016; Li et al. 2016a; Xing et al. 2017; Mou et al. 2016; Li et al. 2016b). To the best of our knowledge, the emotion factor has not been ad- dressed in existing neural models for conversation genera- tion. Table 1 shows some examples with/without consider- ing emotions. We can see that our ECM model is emotion- ally involved and more empathetic. User: Worst day ever. I arrived late because of the traffic. Basic Seq2Seq: You were late. ECM (Like): I am always here to support you. ECM (Happy): Keep smiling! Things will get better. ECM (Sad): It’s depressing. ECM (Disgust): Sometimes life just sucks. ECM (Angry): The traffic is too bad! Table 1: Conversations with/without considering emotion. There are several challenges in addressing the emotion factor in large-scale conversation generation. First, high- quality emotion-labeled data are difficult to obtain in a large- scale corpus, as emotion annotation is a fairly subjective task and emotion classification is also challenging. Second, it is difficult to consider emotions in a natural and coherent way because we need to balance grammaticality and expressions of emotions, as argued in (Ghosh et al. 2017). Last, sim- ply embedding emotion information in existing neural mod- els, as shown in our experiments, cannot produce desirable emotional responses but just hard-to-perceive general ex- pressions (which contain only common words that are quite implicit or ambiguous about emotions, and amount to 73.7% of all emotional responses in our dataset). In this paper, we address the problem of generating emo- tional responses in open-domain conversational systems and propose an emotional chatting machine (ECM for short). To obtain large-scale emotion-labeled data for ECM, we train a neural classifier on a manually annotated corpus. The clas- sifier is used to annotate large-scale conversation data auto- matically for the training of ECM. To express emotion natu- rally and coherently in a sentence, we design a sequence- arXiv:1704.01074v4 [cs.CL] 1 Jun 2018

Transcript of Emotional Chatting Machine: Emotional Conversation ... · Emotional Chatting Machine: Emotional...

Emotional Chatting Machine: Emotional Conversation Generation with Internaland External Memory

Hao Zhou†, Minlie Huang†∗, Tianyang Zhang†, Xiaoyan Zhu†, Bing Liu‡†State Key Laboratory of Intelligent Technology and Systems,National Laboratory for Information Science and Technology,

Dept. of Computer Science and Technology, Tsinghua University, Beijing 100084, PR China‡Dept. of Computer Science, University of Illinois at Chicago, Chicago, Illinois, USA

[email protected] , [email protected] ,[email protected], [email protected], [email protected]

Abstract

Perception and expression of emotion are key factors to thesuccess of dialogue systems or conversational agents. How-ever, this problem has not been studied in large-scale conver-sation generation so far. In this paper, we propose EmotionalChatting Machine (ECM) that can generate appropriate re-sponses not only in content (relevant and grammatical) butalso in emotion (emotionally consistent). To the best of ourknowledge, this is the first work that addresses the emotionfactor in large-scale conversation generation. ECM addressesthe factor using three new mechanisms that respectively (1)models the high-level abstraction of emotion expressions byembedding emotion categories, (2) captures the change ofimplicit internal emotion states, and (3) uses explicit emo-tion expressions with an external emotion vocabulary. Exper-iments show that the proposed model can generate responsesappropriate not only in content but also in emotion.

IntroductionAs a vital part of human intelligence, emotional intelligenceis defined as the ability to perceive, integrate, understand,and regulate emotions (Mayer and Salovey 1997). It hasbeen a long-term goal of artificial intelligence to enable amachine to understand affect and emotion (Picard and Pi-card 1997). To create a chatbot capable of communicatingwith a user at the human level, it is necessary to equip themachine with the ability of perceiving and expressing emo-tions.

Existing studies show that addressing affect and emotionin dialogue systems or conversational agents can enhanceuser satisfaction (Prendinger, Mori, and Ishizuka 2005),

and lead to fewer breakdowns in dialogues (Martinovskiand Traum 2003). Some initial work has also been done onadjusting dialogue behaviors to suit users’ emotional states(Polzin and Waibel 2000), and on generating responses tousers’ utterances at both the content- and affect-related lev-els (Skowron 2010).

However, these studies, mostly inspired by psychologyfindings, are either rule-based or limited to small-scaledata. Recently, neural models trained on large-scale data

∗Corresponding author: Minlie Huang, [email protected] c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

have advanced open-domain conversation generation sig-nificantly (Ritter, Cherry, and Dolan 2011; Vinyals and Le2015; Shang, Lu, and Li 2015; Serban et al. 2016). Mostof these models aim to improve the content quality ofconversation generation (Gu et al. 2016; Li et al. 2016a;Xing et al. 2017; Mou et al. 2016; Li et al. 2016b). To thebest of our knowledge, the emotion factor has not been ad-dressed in existing neural models for conversation genera-tion. Table 1 shows some examples with/without consider-ing emotions. We can see that our ECM model is emotion-ally involved and more empathetic.

User: Worst day ever. I arrived late because of the traffic.Basic Seq2Seq: You were late.ECM (Like): I am always here to support you.ECM (Happy): Keep smiling! Things will get better.ECM (Sad): It’s depressing.ECM (Disgust): Sometimes life just sucks.ECM (Angry): The traffic is too bad!

Table 1: Conversations with/without considering emotion.

There are several challenges in addressing the emotionfactor in large-scale conversation generation. First, high-quality emotion-labeled data are difficult to obtain in a large-scale corpus, as emotion annotation is a fairly subjective taskand emotion classification is also challenging. Second, it isdifficult to consider emotions in a natural and coherent waybecause we need to balance grammaticality and expressionsof emotions, as argued in (Ghosh et al. 2017). Last, sim-ply embedding emotion information in existing neural mod-els, as shown in our experiments, cannot produce desirableemotional responses but just hard-to-perceive general ex-pressions (which contain only common words that are quiteimplicit or ambiguous about emotions, and amount to 73.7%of all emotional responses in our dataset).

In this paper, we address the problem of generating emo-tional responses in open-domain conversational systems andpropose an emotional chatting machine (ECM for short). Toobtain large-scale emotion-labeled data for ECM, we train aneural classifier on a manually annotated corpus. The clas-sifier is used to annotate large-scale conversation data auto-matically for the training of ECM. To express emotion natu-rally and coherently in a sentence, we design a sequence-

arX

iv:1

704.

0107

4v4

[cs

.CL

] 1

Jun

201

8

to-sequence generation model equipped with new mecha-nisms for emotion expression generation, namely, emotioncategory embedding for capturing high-level abstraction ofemotion expressions, an internal emotion state for balanc-ing grammaticality and emotion dynamically, and an exter-nal emotion memory to help generate more explicit and un-ambiguous emotional expressions.

In summary, this paper makes the following contributions:

• It proposes to address the emotion factor in large-scaleconversation generation. To the best of our knowledge,this is the first work on the topic.

• It proposes an end-to-end framework (called ECM) to in-corporate the emotion influence in large-scale conversa-tion generation. It has three novel mechanisms: emotioncategory embedding, an internal emotion memory, and anexternal memory.

• It shows that ECM can generate responses with highercontent and emotion scores than the traditional seq2seqmodel. We believe that future work such as the empatheticcomputer agent and the emotion interaction model can becarried out based on ECM.

Related WorkIn human-machine interactions, the ability to detect signsof human emotions and to properly react to them can en-rich communication. For example, display of empatheticemotional expressions enhanced users’ performance (Partalaand Surakka 2004), and led to an increase in user satisfac-tion (Prendinger, Mori, and Ishizuka 2005). Experiments in(Prendinger and Ishizuka 2005) showed that an empatheticcomputer agent can contribute to a more positive perceptionof the interaction. In (Martinovski and Traum 2003), the au-thors showed that many breakdowns could be avoided if themachine was able to recognize the emotional state of theuser and responded to it sensitively. The work in (Polzin andWaibel 2000) presented how dialogue behaviors can be ad-justed to users’ emotional states. Skowron (2010) proposedconversational systems, called affect listeners, that can re-spond to users’ utterances both at the content- and affect-related level.

These works, mainly inspired by psychological findings,are either rule-based, or limited to small data, making themdifficult to apply to large-scale conversation generation. Re-cently, sequence-to-sequence generation models (Sutskever,Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2014)have been successfully applied to large-scale conversationgeneration (Vinyals and Le 2015), including neural respond-ing machine (Shang, Lu, and Li 2015), hierarchical recurrentmodels (Serban et al. 2015), and many others. These mod-els focus on improving the content quality of the generatedresponses, including diversity promotion (Li et al. 2016a),considering additional information (Xing et al. 2017; Mouet al. 2016; Li et al. 2016b; Herzig et al. 2017), and handingunknown words (Gu et al. 2016).

However, no work has addressed the emotion factor inlarge-scale conversation generation. There are several stud-ies that generate text from controllable variables. (Hu et al.

2017) proposed a generative model which can generate sen-tences conditioned on certain attributes of the language suchas sentiment and tenses. Affect Language Model was pro-posed in (Ghosh et al. 2017) to generate text conditionedon context words and affect categories. (Cagan, Frank, andTsarfaty 2017) incorporated the grammar information togenerate comments for a document using sentiment and top-ics. Our work is different in two main aspects: 1) prior stud-ies are heavily dependent on linguistic tools or customizedparameters in text generation, while our model is fully data-driven without any manual adjustment; 2) prior studies areunable to model multiple emotion interactions between theinput post and the response, instead, the generated text sim-ply continues the emotion of the leading context.

Emotional Chatting MachineBackground: Encoder-decoder FrameworkOur model is based on the encoder-decoder frameworkof the general sequence-to-sequence (seq2seq for short)model (Sutskever, Vinyals, and Le 2014). It is imple-mented with gated recurrent units (GRU) (Cho et al. 2014;Chung et al. 2014). The encoder converts the post sequenceX = (x1, x2, · · · , xn) to hidden representations h =(h1,h2, · · · ,hn), which is defined as:

ht = GRU(ht−1, xt). (1)

The decoder takes as input a context vector ct and theembedding of a previously decoded word e(yt−1) to updateits state st using another GRU:

st = GRU(st−1, [ct; e(yt−1)]), (2)

where [ct; e(yt−1)] is the concatenation of the two vectors,serving as the input to the GRU cell. The context vectorct is designed to dynamically attend on key information ofthe input post during decoding (Bahdanau, Cho, and Ben-gio 2014). Once the state vector st is obtained, the decodergenerates a token by sampling from the output probabilitydistribution ot computed from the decoder’s state st as fol-lows:

yt ∼ ot = P (yt | y1, y2, · · · , yt−1, ct), (3)= softmax(Wost). (4)

Task Definition and OverviewOur problem is formulated as follows: Given a post X =(x1, x2, · · · , xn) and an emotion category e of the responseto be generated (explained below), the goal is to generatea response Y = (y1, y2, · · · , ym) that is coherent withthe emotion category e. Essentially, the model estimatesthe probability: P (Y |X, e) =

∏mt=1 P (yt|y<t,X, e). The

emotion categories are {Angry, Disgust, Happy, Like, Sad,Other}, adopted from a Chinese emotion classification chal-lenge task.1

1The taxonomy comes from http://tcci.ccf.org.cn/confere-nce/2014/dldoc/evatask1.pdf

ECMEncoder

EmotionEmbedding

InternalMemory

ExternalMemory

Decoder

Emotion Classifier Emotional Responses

Corpus

Post2 Response2

Post3 Response3

……Post4 Response4

Post1 Response1

Training Data

Post3 Response3 Like

Post1 Response1 Happy

Post4 Response4 Sad

Post2 Response2 Disgust

……

Post

Training Inference

Like Happy SadDisgustAngry

I am always here to support you.Keep smiling! Things will get better.It's depressing.Sometimes life just sucks.The traffic is too bad!

Worst day ever. I arrived late because of the traffic.

Figure 1: Overview of ECM (the grey unit). The pink units are used to model emotion factors in the framework.

In our problem statement, we assume that the emotion cat-egory of the to-be-generated response is given, because emo-tions are highly subjective. Given a post, there may be mul-tiple emotion categories that are suitable for its response, de-pending on the attitude of the respondent. For example, for asad story, someone may respond with sympathy (as a friend),someone may feel angry (as an irritable stranger), yet some-one else may be happy (as an enemy). Flexible emotioninteractions between a post and a response are an impor-tant difference from the previous studies (Hu et al. 2017;Ghosh et al. 2017; Cagan, Frank, and Tsarfaty 2017), whichuse the same emotion or sentiment for response as that in theinput post.

Thus, due to this subjectivity of emotional responses, wechoose to focus on solving the core problem: generating anemotional response given a post and an emotion category ofthe response. Our model thus works regardless the responseemotion category. Note that there can be multiple ways toenable a chatbot to choose an emotion category for response.One way is to give the chatbot a personality and some back-ground knowledge. Another way is to use the training datato find the most frequent response emotion category for theemotion in the given post and use that as the response emo-tion. This method is reasonable as it reflects the general emo-tion of the people. We leave this study to our future work.

Building upon the generation framework discussed in theprevious section, we propose the Emotional Chatting Ma-chine (ECM) to generate emotion expressions using threemechanisms: First, since the emotion category is a high-level abstraction of an emotion expression, ECM embeds theemotion category and feeds the emotion category embed-ding to the decoder. Second, we assume that during decod-ing, there is an internal emotion state, and in order to capturethe implicit change of the state and to balance the weightsbetween the grammar state and the emotion state dynami-cally, ECM adopts an internal memory module. Third, anexplicit expression of an emotion is modeled through an ex-plicit selection of a generic (non-emotion) or emotion wordby an external memory module.

An overview of ECM is given in Figure 1. In the train-ing process, the corpus of post-response pairs is fed to anemotion classifier to generate the emotion label of each re-sponse, and then ECM is trained on the data of triples: posts,

responses and emotion labels of responses. In the inferenceprocess, a post is fed to ECM to generate emotional re-sponses conditioned on different emotion categories.

Emotion Category EmbeddingSince an emotion category (for instance, Angry, Disgust,Happy) provides a high-level abstraction of an emotion ex-pression, the most intuitive approach to modeling emotion inresponse generation is to take as additional input the emotioncategory of a response to be generated. Each emotion cate-gory is represented by a real-valued, low dimensional vec-tor. For each emotion category e, we randomly initialize thevector of an emotion category ve, and then learn the vectorsof the emotion category through training. The emotion cat-egory embedding ve, along with word embedding e(yt−1),and the context vector ct, are fed into the decoder to updatethe decoder’s state st:

st = GRU(st−1, [ct; e(yt−1);ve]). (5)

Based on st, the decoding probability distribution can becomputed accordingly by Eq. 4 to generate the next tokenyt.

Internal MemoryThe method presented in the preceding section is ratherstatic: the emotion category embedding will not change dur-ing the generation process which may sacrifice grammaticalcorrectness of sentences as argued in (Ghosh et al. 2017).Inspired by the psychological findings that emotional re-sponses are relatively short lived and involve changes (Gross1998; Hochschild 1979), and the dynamic emotion situationin emotional responses (Alam, Danieli, and Riccardi 2017),we design an internal memory module to capture the emo-tion dynamics during decoding. We simulate the process ofexpressing emotions as follows: there is an internal emotionstate for each category before the decoding process starts;at each step the emotion state decays by a certain amount;once the decoding process is completed, the emotion stateshould decay to zero indicating the emotion is completelyexpressed.

The detailed process of the internal memory module is il-lustrated in Figure 2. At each step t, ECM computes a read

Decoder

Attention

GRU

InternalMemory

ct

st

ot

st-1

e( yt-1 ) yt

IMe,tIg t

r

g tw

Mr,tI

Me,t+1

Read Gate

Write Gate[e( yt-1 ) ; st-1 ; ct]

Read Write

word vector

state vector

next word

Figure 2: Data flow of the decoder with an internal mem-ory. The internal memory M I

e,t is read with the read gategrt by an amount M I

r,t to update the decoder’s state, and thememory is updated to M I

e,t+1 with the write gate gwt .

gate grt with the input of the word embedding of the previ-

ously decoded word e(yt−1), the previous state of the de-coder st−1, and the current context vector ct. A write gategwt is computed on the decoder’s state vector st. The read

gate and write gate are defined as follows:

grt = sigmoid(Wr

g[e(yt−1); st−1; ct]), (6)

gwt = sigmoid(Ww

g st). (7)

The read and write gates are then used to read fromand write into the internal memory, respectively. Hence, theemotion state is erased by a certain amount (by gw

t ) at eachstep. At the last step, the internal emotion state will decay tozero. This process is formally described as below:

M Ir,t = gr

t ⊗M Ie,t, (8)

M Ie,t+1 = gw

t ⊗M Ie,t, (9)

where ⊗ is element-wise multiplication, r/w denotesread/write respectively, and I means Internal. GRU updatesits state st conditioned on the previous target word e(yt−1),the previous state of the decoder st−1, the context vector ct,and the emotion state update M I

r,t, as follows:

st = GRU(st−1, [ct; e(yt−1);MIr,t]). (10)

Based on the state, the word generation distribution ot

can be obtained with Eq. 4, and the next word yt can besampled. After generating the next word, M I

e,t+1 is writtenback to the internal memory. Note that if Eq. 9 is executedmany times, it is equivalent to continuously multiplying thematrix, resulting in a decay effect since 0 ≤ sigmoid(·) ≤ 1.This is similar to a DELETE operation in memory net-works (Miller et al. 2016).

External MemoryIn the internal memory module, the correlation between thechange of the internal emotion state and selection of a wordis implicit and not directly observable. As the emotionexpressions are quite distinct with emotion words (Xu etal. 2008) contained in a sentence, such as lovely and awe-some, which carry strong emotions compared to generic

(non-emotion) words, such as person and day, we proposean external memory module to model emotion expressionsexplicitly by assigning different generation probabilities toemotion words and generic words. Thus, the model canchoose to generate words from an emotion vocabulary ora generic vocabulary.

GRU

what

a

GRU

a

lovely

… …

TypeSelector

GenericSoftmax

EmotionSoftmax

person

α

1-α

st

MeE

ytExternalMemory

Figure 3: Data flow of the decoder with an external mem-ory. The final decoding probability is weighted between theemotion softmax and the generic softmax, where the weightis computed by the type selector.

The decoder with an external memory is illustrated in Fig-ure 3. Given the current state of the decoder st, the emotionsoftmax Pe(yt = we) and the generic softmax Pg(yt = wg)are computed over the emotion vocabulary which is readfrom the external memory and generic vocabulary, respec-tively. The type selector αt controls the weight of generat-ing an emotion or a generic word. Finally, the next word ytis sampled from the next word probability, the concatena-tion of the two weighted probabilities. The process can beformulated as follows:

αt = sigmoid(vu>st), (11)

Pg(yt = wg) = softmax(Wogst), (12)

Pe(yt = we) = softmax(Woest), (13)

yt ∼ ot = P (yt) =

[(1− αt)Pg(yt = wg)

αtPe(yt = we)

], (14)

where αt ∈ [0, 1] is a scalar to balance the choice betweenan emotion word we and a generic word wg , Pg/Pe is thedistribution over generic/emotion words respectively, andP (yt) is the final word decoding distribution. Note that thetwo vocabularies have no intersection, and the final distribu-tion P (yt) is a concatenation of two distributions.

Loss FunctionThe loss function is the cross entropy error between thepredicted token distribution ot and the gold distribution pt

in the training corpus. Additionally, we apply two regular-ization terms: one on the internal memory, enforcing thatthe internal emotion state should decay to zero at the endof decoding, and the other on the external memory, con-straining the selection of an emotional or generic word.

The loss on one sample < X,Y > (X = x1, x2, ..., xn,Y = y1, y2, ..., ym) is defined as:

L(θ) = −m∑t=1

ptlog(ot)−m∑t=1

qtlog(αt)+ ‖M Ie,m ‖,

(15)where M I

e,m is the internal emotion state at the last stepm, αt is the probability of choosing an emotion word ora generic word, and qt ∈ {0, 1} is the true choice of anemotion word or a generic word in Y . The second term isused to supervise the probability of selecting an emotion orgeneric word. And the third term is used to ensure that theinternal emotion state has been expressed completely oncethe generation is completed.

Data PreparationSince there is no off-the-shelf data to train ECM, we firstlytrained an emotion classifier using the NLPCC emotion clas-sification dataset and then used the classifier to annotate theSTC conversation dataset (Shang, Lu, and Li 2015) to con-struct our own experiment dataset. There are two steps in thedata preparation process:

1. Building an Emotion Classifier. We trained severalclassifiers on the NLPCC dataset and then chose the bestclassifier for automatic annotation. This dataset was used inchallenging tasks of emotion classification in NLPCC20132

and NLPCC20143, consisting of 23,105 sentences collectedfrom Weibo. It was manually annotated with 8 emotion cat-egories: Angry, Disgust, Fear, Happy, Like, Sad, Surprise,and Other. After removing the infrequent classes (Fear(1.5%) and Surprise (4.4%)), we have six emotion cate-gories, i.e., Angry, Disgust, Happy, Like, Sad and Other.

We then partitioned the NLPCC dataset into training, val-idation, and test sets with the ratio of 8:1:1. Several emo-tion classifiers were trained on the filtered dataset, includinga lexicon-based classifier (Liu 2012) (we used the emotionlexicon in (Xu et al. 2008)), RNN (Mikolov et al. 2010),LSTM (Hochreiter and Schmidhuber 1997), and Bidirec-tional LSTM (Bi-LSTM) (Graves, Fernandez, and Schmid-huber 2005). Results in Table 2 show that all neural clas-sifiers outperform the lexicon-based classifier, and the Bi-LSTM classifier obtains the best accuracy of 0.623.

Method AccuracyLexicon-based 0.432RNN 0.564LSTM 0.594Bi-LSTM 0.623

Table 2: Classification accuracy on the NLPCC dataset.

2. Annotating STC with Emotion. We applied the bestclassifier, Bi-LSTM, to annotate the STC Dataset with thesix emotion categories. After annotation, we obtained an

2http://tcci.ccf.org.cn/conference/2013/3http://tcci.ccf.org.cn/conference/2014/

emotion-labeled dataset, which we call the Emotional STC(ESTC) Dataset. The statistics of the ESTC Dataset areshown in Table 3. Although the emotion labels for ESTCDataset are noisy due to automatic annotation, this datasetis good enough to train the models in practice. As futurework, we will study how the classification errors influenceresponse generation.

Training

Posts 217,905

Responses

Angry 234,635Disgust 689,295Happy 306,364Like 1,226,954Sad 537,028Other 1,365,371

Validation Posts 1,000Test Posts 1,000

Table 3: Statistics of the ESTC Dataset.

ExperimentsImplementation DetailsWe used Tensorflow4 to implement the proposed model5.The encoder and decoder have 2-layer GRU structures with256 hidden cells for each layer and use different sets of pa-rameters respectively. The word embedding size is set to100. The vocabulary size is limited to 40,000. The embed-ding size of emotion category is set to 100. The internalmemory is a trainable matrix of size 6×256 and the externalmemory is a list of 40,000 words containing generic wordsand emotion words (but emotion words have different mark-ers). To generate diverse responses, we adopted beam searchin the decoding process of which the beam size is set to 20,and then reranked responses by the generation probabilityafter removing those containing UNKs, unknown words.

We used the stochastic gradient descent (SGD) algorithmwith mini-batch. Batch size and learning rate are set to 128and 0.5, respectively. To accelerate the training process, wetrained a seq2seq model on the STC dataset with pre-trainedword embeddings. And we then trained our model on theESTC Dataset with parameters initialized by the parametersof the pre-trained seq2seq model. We ran 20 epoches, andthe training stage of each model took about a week on aTitan X GPU machine.

BaselinesAs aforementioned, this paper is the first work to address theemotion factor in large-scale conversation generation. Wedid not find closely-related baselines in the literature. Affect-LM (Ghosh et al. 2017) cannot be our baseline because it isunable to generate responses of different emotions for thesame post. Instead, it simply copies and uses the emotionof the input post. Moreover, it depends heavily on linguisticresources and needs manual parameter adjustments.

4https://github.com/tensorflow/tensorflow5https://github.com/tuxchow/ecm

Nevertheless, we chose two suitable baselines: a generalseq2seq model (Sutskever, Vinyals, and Le 2014), and anemotion category embedding model (Emb) created by uswhere the emotion category is embedded into a vector, andthe vector serves as an input to every decoding position, sim-ilar to the idea of user embedding in (Li et al. 2016b). Asemotion category is a high-level abstraction of emotion ex-pressions, this is a proper baseline for our model.

Automatic EvaluationMetrics: As argued in (Liu et al. 2016), BLEU is not suit-able for measuring conversation generation due to its lowcorrelation with human judgment. We adopted perplexity toevaluate the model at the content level (whether the contentis relevant and grammatical). To evaluate the model at theemotion level, we adopted emotion accuracy as the agree-ment between the expected emotion category (as input tothe model) and the predicted emotion category of a gener-ated response by the emotion classifier.

Method Perplexity AccuracySeq2Seq 68.0 0.179Emb 62.5 0.724ECM 65.9 0.773w/o Emb 66.1 0.753w/o IMem 66.7 0.749w/o EMem 61.8 0.731

Table 4: Objective evaluation with perplexity and accuracy.

Results: The results are shown in Table 4. As can be seen,ECM obtains the best performance in emotion accuracy, andthe performance in perplexity is better than Seq2Seq butworse than Emb. This may be because the loss function ofECM is supervised not only on perplexity, but also on the se-lection of generic or emotion words (see Eq.15). In practice,emotion accuracy is more important than perplexity con-sidering that the generated sentences are already fluent andgrammatical with the perplexity of 68.0.

In order to investigate the influence of different modules,we conducted ablation tests where one of the three moduleswas removed from ECM each time. As we can see, ECMwithout the external memory achieves the best performancein perplexity. Our model can generate responses without sac-rificing grammaticality by introducing the internal memory,where the module can balance the weights between grammarand emotion dynamically. After removing the external mem-ory, the emotion accuracy decreases the most, indicating theexternal memory leads to a higher emotion accuracy since itexplicitly chooses the emotion words. Note that the emotionaccuracy of Seq2Seq is extremely low because it generatesthe same response for different emotion categories.

Manual EvaluationIn order to better understand the quality of the generated re-sponses from the content and emotion perspectives, we per-formed manual evaluation. Given a post and an emotion cat-

egory, responses generated from all the models were ran-domized and presented to three human annotators.

Metrics: Annotators were asked to score a response interms of Content (rating scale is 0,1,2) and Emotion (ratingscale is 0,1), and also to state a preference between any twosystems. Content is defined as whether the response is ap-propriate and natural to a post and could plausibly have beenproduced by a human, which is a widely accepted metricadopted by researchers and conversation challenging tasks,as proposed in (Shang, Lu, and Li 2015). Emotion is definedas whether the emotion expression of a response agrees withthe given emotion category.

Annotation Statistics: We randomly sampled 200 postsfrom the test set. For each model we generated 1,200 re-sponses in total: for Seq2Seq, we generated the top 6 re-sponses for each post, and for Emb and ECM, we generatedthe top responses corresponding to the 6 emotion categories.

We calculated the Fleiss’ kappa (Fleiss 1971) to measureinter-rater consistency. Fleiss’ kappa for Content and Emo-tion is 0.441 and 0.757, indicating “Moderate agreement”and “Substantial agreement” respectively.

Method (%) 2-1 1-1 0-1 2-0 1-0 0-0Seq2Seq 9.0 5.1 1.1 37.6 28.0 19.2Emb 22.8 9.3 4.3 27.1 19.1 17.4ECM 27.2 10.8 4.4 24.2 15.5 17.9

Table 5: The percentage of responses in manual evaluationwith the score of Content-Emotion. For instance, 2-1 meanscontent score is 2 and emotion score is 1.

Results: The results are shown in Table 6. ECM with alloptions outperforms the other methods in both metrics sig-nificantly (2-tailed t-test, p < 0.05 for Content, and p <0.005 for Emotion). After incorporating the internal mem-ory and the external memory modules, the performance ofECM in Emotion is improved comparing to Emb, indicatingour model can generate more explicit expressions of emo-tion. Besides, the performance in Content is improved from1.256 of Emb to 1.299 of ECM, which shows the abilityof ECM to control the weight of emotion and generate re-sponses appropriate in content.

For all emotion categories, the performance of ECM inEmotion outperforms the other methods. However, the per-formances of ECM in Content is worse than baselines inDisgust and Angry categories, due to the fact that there arenot sufficient training data for the two categories. For in-stance, the Angry category has 234,635 responses in ourESTC Dataset, much less than the other categories.

To evaluate whether ECM can generate responses thatare appropriate not only in content but also in emotion, wepresent results in Table 5 by considering content and emo-tion scores simultaneously6. As we can see, 27.2% of theresponses generated by ECM have a Content score of 2 andan Emotion score of 1, while only 22.8% for Emb and 9.0%

6Note that Content and Emotion are two independent metrics.

Post Model Response(Original) Response(Translated)

A promise is what a cheater says to a fool.

Seq2Seq A promise is what a cheater says to a fool.

ECM

Like Promise is a single-minded fool

Happy I'm just a happy fool

Sad I am a sad fool

Disgust A cheater says to a fool

Angry Whatever you believe or not, I trust in it!

Knowing the truth, I laughed myself to tears

Seq2Seq After knowing the truth, I burst into tears.

ECM

Like Too lovely, it’s too lovely.

Happy I can’t help laugh myself to tears.

Sad After knowing the truth, I burst into tears.

Disgust Is this true?

Angry What do you mean? I don’t understand.

Figure 4: Sample responses generated by Seq2Seq and ECM (original Chinese and English translation, the colored words arethe emotion words corresponding to the given emotion category). The corresponding posts did not appear in the training set.

Method Overall Like Sad Disgust Angry HappyCont. Emot. Cont. Emot. Cont. Emot. Cont. Emot. Cont. Emot. Cont. Emot.

Seq2Seq 1.255 0.152 1.308 0.337 1.270 0.077 1.285 0.038 1.223 0.052 1.223 0.257Emb 1.256 0.363 1.348 0.663 1.337 0.228 1.272 0.157 1.035 0.162 1.418 0.607ECM 1.299 0.424 1.460 0.697 1.352 0.313 1.233 0.193 0.98 0.217 1.428 0.700

Table 6: Manual evaluation of the generated responses in terms of Content (Cont.) and Emotion (Emot.) .

for Seq2Seq. These indicate that ECM is better in generatinghigh-quality responses in both content and emotion.

Pref. (%) Seq2Seq Emb ECMSeq2Seq - 38.8 38.6Emb 60.2 - 43.1ECM 61.4 56.9 -

Table 7: Pairwise preference of the three systems.

Preference Test: In addition, emotion models (Emb andECM) are much more preferred than Seq2Seq, and ECM isalso significantly (2-tailed t-test, p < 0.001) preferred byannotators against other methods as shown in Table 7. Thediverse emotional responses are more attractive to users thanthe generic responses generated by the Seq2Seq model. Andwith the explicitly expressions of emotions as well as theappropriateness in content, ECM is much more preferred.

Analysis of Emotion Interaction and Case StudyFigure 5 visualizes the emotion interaction patterns of theposts and responses in the ESTC Dataset. An emotion in-teraction pattern (EIP) is defined as < ep, er >, the pairof emotion categories of the post and its response. Thevalue of an EIP is the conditional probability P (er|ep) =P (er, ep)/P (ep). An EIP marked with a darker color oc-curs more frequently than a lighter color. From the figure, wecan make a few observations. First, frequent EIPs show thatthere are some major responding emotions given a post emo-

Other Like Sad Disgust Angry Happy

Other

Like

Sad

Disgust

Angry

Happy

0.05

0.1

0.15

0.2

0.25

0.3

0.35Other

Like

Sad

Disgust

Angry

Happy

Other Like Sad Disgust Angry Happy

0.35

0.3

0.25

0.2

0.15

0.1

0.05

Figure 5: Visualization of emotion interaction.

tion category. For instance, when a post expresses Happy,the responding emotion is typically Like or Happy. Second,the diagonal patterns indicate emotional empathy, a commontype of emotion interaction. Third, there are also other EIPsfor a post, indicating that emotion interactions in conversa-tion are quite diverse, as mentioned earlier. Note that classOther has much more data than other classes (see Table 3),indicating that EIPs are biased toward this class (the firstcolumn of Figure 5), due to the data bias and the emotionclassification errors.

We present some examples in Figure 4. As can be seen,for a given post, there are multiple emotion categories thatare suitable for its response in conversation. Seq2Seq gener-ates a response with a random emotion. ECM can generate

emotional responses conditioned on every emotion category.All these responses are appropriate to the post, indicating theexistence of multiple EIPs and the reason why an emotioncategory should be specified as an input to our system.

We can see that ECM can generate appropriate responsesif the pre-specified emotion category and the emotion of thepost belong to one of the frequent EIPs. Colored words showthat ECM can explicitly express emotion by applying the ex-ternal memory which can choose a generic (non-emotion)or emotion word during decoding. For low-frequency EIPssuch as < Happy ,Disgust > and < Happy ,Angry > asshown in the last two lines of Figure 4, responses are not ap-propriate to the emotion category due to the lack of trainingdata and/or the errors caused by the emotion classifier.

Conclusion and Future WorkIn this paper, we proposed the Emotional Chatting Machine(ECM) to model the emotion influence in large-scale con-versation generation. Three mechanisms were proposed tomodel the emotion factor, including emotion category em-bedding, internal emotion memory, and external memory.Objective and manual evaluation show that ECM can gen-erate responses appropriate not only in content but also inemotion.

In our future work, we will explore emotion interactionswith ECM: instead of specifying an emotion class, the modelshould decide the most appropriate emotion category for theresponse. However, this may be challenging since such atask depends on the topics, contexts, or the mood of the user.

AcknowledgmentsThis work was partly supported by the National ScienceFoundation of China under grant No.61272227/61332007,and a joint project with Sogou. We would like to thank ourcollaborators, Jingfang Xu and Haizhou Zhao.

References[Alam, Danieli, and Riccardi 2017] Alam, F.; Danieli, M.;and Riccardi, G. 2017. Annotating and modeling empathyin spoken conversations. CoRR abs/1705.04839.

[Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.;and Bengio, Y. 2014. Neural machine translation by jointlylearning to align and translate. CoRR abs/1409.0473.

[Cagan, Frank, and Tsarfaty 2017] Cagan, T.; Frank, S. L.;and Tsarfaty, R. 2017. Data-driven broad-coverage gram-mars for opinionated natural language generation (onlg). InACL, volume 1, 1331–1341.

[Cho et al. 2014] Cho, K.; Van Merrienboer, B.; Gulcehre,C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Ben-gio, Y. 2014. Learning phrase representations using rnnencoder-decoder for statistical machine translation. CoRRabs/1406.1078.

[Chung et al. 2014] Chung, J.; Gulcehre, C.; Cho, K.; andBengio, Y. 2014. Empirical evaluation of gated re-current neural networks on sequence modeling. CoRRabs/1412.3555.

[Fleiss 1971] Fleiss, J. L. 1971. Measuring nominal scaleagreement among many raters. Psychological bulletin76(5):378.

[Ghosh et al. 2017] Ghosh, S.; Chollet, M.; Laksana, E.;Morency, L.; and Scherer, S. 2017. Affect-lm: A neurallanguage model for customizable affective text generation.In ACL, 634–642.

[Graves, Fernandez, and Schmidhuber 2005] Graves, A.;Fernandez, S.; and Schmidhuber, J. 2005. Bidirectionallstm networks for improved phoneme classification andrecognition. In ICANN, 799–804. Springer.

[Gross 1998] Gross, J. J. 1998. The emerging field of emo-tion regulation: An integrative review. Review of generalpsychology 2(3):271.

[Gu et al. 2016] Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016.Incorporating copying mechanism in sequence-to-sequencelearning. In ACL, 1631–1640.

[Herzig et al. 2017] Herzig, J.; Shmueli-Scheuer, M.; Sand-bank, T.; and Konopnicki, D. 2017. Neural response gen-eration for customer service based on personality traits. InProceedings of the 10th International Conference on Natu-ral Language Generation, 252–256.

[Hochreiter and Schmidhuber 1997] Hochreiter, S., andSchmidhuber, J. 1997. Long short-term memory. Neuralcomputation 9(8):1735–1780.

[Hochschild 1979] Hochschild, A. R. 1979. Emotion work,feeling rules, and social structure. American journal of so-ciology 551–575.

[Hu et al. 2017] Hu, Z.; Yang, Z.; Liang, X.; Salakhutdinov,R.; and Xing, E. P. 2017. Toward controlled generation oftext. In ICML, 1587–1596.

[Li et al. 2016a] Li, J.; Galley, M.; Brockett, C.; Gao, J.; andDolan, B. 2016a. A diversity-promoting objective functionfor neural conversation models. In NAACL, 110–119.

[Li et al. 2016b] Li, J.; Galley, M.; Brockett, C.; Sp-ithourakis, G.; Gao, J.; and Dolan, W. B. 2016b. A persona-based neural conversation model. In ACL, 994–1003.

[Liu et al. 2016] Liu, C.; Lowe, R.; Serban, I.; Noseworthy,M.; Charlin, L.; and Pineau, J. 2016. How NOT to eval-uate your dialogue system: An empirical study of unsuper-vised evaluation metrics for dialogue response generation.In EMNLP, 2122–2132.

[Liu 2012] Liu, B. 2012. Sentiment analysis and opinionmining. Morgan & Claypool Publishers.

[Martinovski and Traum 2003] Martinovski, B., and Traum,D. 2003. Breakdown in human-machine interaction: theerror is the clue. In Proceedings of the ISCA tutorial andresearch workshop, 11–16.

[Mayer and Salovey 1997] Mayer, J. D., and Salovey, P.1997. What is emotional intelligence? Emotional Devel-opment and Emotional Intelligence 3–31.

[Mikolov et al. 2010] Mikolov, T.; Karafiat, M.; Burget, L.;Cernocky, J.; and Khudanpur, S. 2010. Recurrent neuralnetwork based language model. In Interspeech, volume 2,3.

[Miller et al. 2016] Miller, A. H.; Fisch, A.; Dodge, J.;Karimi, A.; Bordes, A.; and Weston, J. 2016. Key-value memory networks for directly reading documents. InEMNLP, 1400–1409.

[Mou et al. 2016] Mou, L.; Song, Y.; Yan, R.; Li, G.; Zhang,L.; and Jin, Z. 2016. Sequence to backward and for-ward sequences: A content-introducing approach to gener-ative short-text conversation. In COLING, 3349–3358.

[Partala and Surakka 2004] Partala, T., and Surakka,V. 2004. The effects of affective interventions inhuman–computer interaction. Interacting with computers16(2):295–309.

[Picard and Picard 1997] Picard, R. W., and Picard, R. 1997.Affective computing, volume 252. MIT press Cambridge.

[Polzin and Waibel 2000] Polzin, T. S., and Waibel, A. 2000.Emotion-sensitive human-computer interfaces. In ISCA Tu-torial and Research Workshop (ITRW) on Speech and Emo-tion, 201–206.

[Prendinger and Ishizuka 2005] Prendinger, H., andIshizuka, M. 2005. The empathic companion: Acharacter-based interface that addresses users’affectivestates. Applied Artificial Intelligence 19(3-4):267–285.

[Prendinger, Mori, and Ishizuka 2005] Prendinger, H.; Mori,J.; and Ishizuka, M. 2005. Using human physiology to eval-uate subtle expressivity of a virtual quizmaster in a math-ematical game. International journal of human-computerstudies 62(2):231–245.

[Ritter, Cherry, and Dolan 2011] Ritter, A.; Cherry, C.; andDolan, W. B. 2011. Data-driven response generation in so-cial media. In EMNLP, 583–593.

[Serban et al. 2015] Serban, I. V.; Sordoni, A.; Bengio, Y.;Courville, A. C.; and Pineau, J. 2015. Hierarchical neu-ral network generative models for movie dialogues. CoRRabs/1507.04808.

[Serban et al. 2016] Serban, I. V.; Sordoni, A.; Bengio, Y.;Courville, A. C.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neuralnetwork models. In AAAI, 3776–3784.

[Shang, Lu, and Li 2015] Shang, L.; Lu, Z.; and Li, H. 2015.Neural responding machine for short-text conversation. InACL, 1577–1586.

[Skowron 2010] Skowron, M. 2010. Affect listeners: Acqui-sition of affective states by means of conversational systems.In Development of Multimodal Interfaces: Active Listeningand Synchrony. Springer. 169–181.

[Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.;and Le, Q. V. 2014. Sequence to sequence learning with neu-ral networks. In Advances in neural information processingsystems, 3104–3112.

[Vinyals and Le 2015] Vinyals, O., and Le, Q. 2015. A neu-ral conversational model. CoRR abs/1506.05869.

[Xing et al. 2017] Xing, C.; Wu, W.; Wu, Y.; Liu, J.; Huang,Y.; Zhou, M.; and Ma, W. 2017. Topic aware neural responsegeneration. In AAAI, 3351–3357.

[Xu et al. 2008] Xu, L.; Lin, H.; Pan, Y.; Ren, H.; and Chen,

J. 2008. Affective lexicon ontology. Journal of information27(2):180–185.