Memory Grounded Conversational Reasoning · can also make proactive suggestions to sur-face related...

Proceedings of the 2019 EMNLP and the 9th IJCNLP (System Demonstrations), pages 145–150Hong Kong, China, November 3 – 7, 2019. c©2019 Association for Computational Linguistics

145

Memory Grounded Conversational Reasoning

Seungwhan Moon, Pararth Shah, Anuj Kumar, Rajen SubbaFacebook Assistant

{shanemoon, pararths, anujk, rasubba@}fb.com

AbstractWe demonstrate a conversational systemwhich engages the user through a multi-modal,multi-turn dialog over the user’s memories.The system can perform QA over memoriesby responding to user queries to recall spe-cific attributes and associated media (e.g. pho-tos) of past episodic memories. The systemcan also make proactive suggestions to sur-face related events or facts from past memo-ries to make conversations more engaging andnatural. To implement such a system, we col-lect a new corpus of memory grounded con-versations, which comprises human-to-humanrole-playing dialogs given synthetic memorygraphs with simulated attributes. Our proof-of-concept system operates on these syntheticmemory graphs, however it can be trained andapplied to real-world user memory data (e.g.photo albums, etc.) We present the architec-ture of the proposed conversational system,and example queries that the system supports.

1 Introduction

In the last few decades, people have been stor-ing an increasing amount of their life’s memoriesin the form of digital multimedia, e.g. photos,videos and textual posts. Retrieving one’s mem-ories from these memory banks and reminiscingabout events from one’s personal and professionallife is a prevalent desire among many users. Tra-ditionally, the interfaces to access these memoriesare either (i) keyword based search systems whichdemand specific keyword combinations to iden-tify and retrieve the correct memories, or (ii) cat-alog based browsing systems that allow scrollingthrough memories across a single dimension, mostcommonly, time of creation. However, we positthat a more natural way of interacting with one’smemories is through a flexible interface that cansupport fuzzy queries by referencing memoriesthrough various attributes, such as events, people,

Figure 1: Memory Grounded Conversational Rea-soning between a user and the assistant with a parallel(a) dialog and (b) memory graph pair. Dialog transi-tions can be captured as walks over a memory graph.(c) Some of the memories can be inferred from pho-tos or other media, which are surfaced to the user whenthey are available and relevant.

locations or activities associated with them, andthat can enable the user to explore other relevantmemories connected through one of many dimen-sions, e.g. same group of people, same location,etc., thereby not being restricted to browsing onlytemporally adjacent memories.

We present a conversational system that pro-vides a natural interface for retrieving and brows-ing through one’s memories. The system sup-ports conversational QA capability for open-endedquerying of memories, and also supports the abil-ity to proactively surface related memories that theuser would naturally be interested in consuming.An important element of such an open-ended di-alog system is its ability to ground conversationswith past memories of users, making the interac-tions more personal and engaging.

146

Figure 2: Memory Walker Chatbot UI for memory grounded conversations between a user and the assistant.

Figure 1 shows an example interaction sup-ported by the demonstrated system, spanning mul-tiple episodic memories that are represented as agraph composed of memory nodes and related en-tity nodes connected via relational edges. The ex-ample above shows three key novel features ofthe demonstrated system: 1) the ability of query-ing a personal database to answer various com-plex user queries (memory recall QA), 2) surfac-ing photos most relevant to the dialog, and 3) iden-tifying other memories to surface that are relevantto conversational contexts, resulting in increasedengagement and coherent interactions.

Our system consists of several components:First, we use the Memory Graph Networks (MGN)model, which learns natural graph paths amongepisodic memory nodes, conditioned over dialogcontexts. MGN can introduce new memory nodesrelevant to conversational contexts when memorynodes are activated as a result of graph walks. Sec-ond, the QA module takes as input user queryutterances and infers correct answers given can-didate memory graph nodes activated with theMGN model. Specifically, we utilize multiplesub-module nets such as CHOOSE, COUNT, etc.,to support discrete reasoning questions that can-not be handled directly via graph networks. Fi-nally, the photo recommender module then uses

the MGN graph node embeddings and the gener-ated attention scores to retrieve the most relevantphotos for each dialog response.

Section 2 provides a detailed description of thesystem’s user interface, method and data collec-tion setup. Section 3 provides an analysis ofdemonstrations performed via the system, andSection 4 lists related work.

2 Memory Grounded Conversations

2.1 User Interface

The goal of the demonstrated system is to establisha natural user interface (UI) for interacting withmemories. Figure 2 shows the UI of the demon-strated system, composed of two main sections:the media section (left) and the chat section (right;highlighted yellow: assistant, blue: user). This is asimple yet powerful interface for interacting withmemories, for the following reasons:

Flexible. The UI enables natural language QAqueries through text or voice input. This UI can bedeployed in a desktop, web or mobile application,or on a connected home device like smart TVs.For each user memory recall query, the systemprovides a corresponding answer from the mem-ory graph. In Figure 2 part 1, the user retrievesa memory related to an activity (skiing) through atextual question.

147

Figure 3: Memory Dialog Dataset Collection Interface, with an example. (a) User-playing agent is provided withpartial memory information to query about. (b) Assistant-playing agent has the ground-truth information about thetarget memory as well as all related memories. (c) The two agents generate dialogs, grounded on the syntheticmemories that they are presented with.

Visual. The UI can display visual content con-nected to the memories, and allow for furtherqueries into the content of the image or video. Au-thors’ personal photos are used in the demonstra-tion, attached with synthetically generated mem-ory graphs.

Contextual. The system keeps track of the con-versational context within a user session, allowingthe user to refer to entities present in the dialog ormedia. In Figure 2 part 2, the user refers to theevent mentioned in the previous system responseand asks a further question regarding who attendedthe event.

Proactive. The system can insert conver-sational recommendations for exploring relatedmemories based on the system’s model of whichmemories are naturally interesting for users toconsume in a particular context. In Figure 2 part 2,the system suggests to the user other memory in-stances that share the same activity and set of peo-ple. The system can make the suggestions morepersonalized by learning the sequences in whichusers like to explore memories, from the user’spast sessions.

2.2 Dataset: Memory Dialog

Synthetic Memory Graph: For the proposed sys-tem, we first bootstrap the large-scale memory col-lection through a synthetic memory graph gen-erator, which creates multiple artificial episodicmemory graph nodes with connections to real en-tities appearing on common-fact KGs (e.g. loca-

tions, events, public entities). We first build a syn-thetic social graph of users, where each user is as-signed to a random interest profile as a probabilitydistribution over the ‘activity’ space. We then iter-atively generate a memory node and its associatedattributes and entities by sampling activities, par-ticipants, locations, entities, time, etc. each from amanually defined ontology. Through the realisticmemory graph that is synthetically generated, weavoid the need for extracting memory graphs fromother structured sources (e.g. photo albums) whichare often private or limited in size. Note also thatby representing the memory database in a graphformat, it allows for flexible operations requiredfor complex QA and conversational reasoning.Wizard-of-Oz Setup: We collect the Memory Di-alog dataset in a Wizard-of-Oz setting (Shah et al.,2018) by connecting two crowd-workers to en-gage in a role-playing chat session either as auser or an assistant, with the joint goal of creat-ing natural and engaging dialogs (Figure 3). Theuser-playing agent is given a memory node fromthe synthetic memory graph with some of the at-tributes hidden, and asked to initiate a conversa-tion about those missing attributes to simulate amemory recall query. The assistant-playing agentis provided with a set of memories and photosrelevant to user’s questions, and is instructed touse one or more of these memories to answer thequestion and frame a free-form conversational re-sponse. In addition, the assistant agent is en-couraged to proactively bring up any other memo-

148

Figure 4: Overall architecture of the demonstrated system. Candidate memory nodes m = {m(k)} are providedas input memory slots for each query q. The Memory Graph Network then traverses the memory graph to expandthe initial memory slots and activate other relevant entity and memory nodes. The output paths of MGN arethen used to trigger proactive memory reference, if relevant. The Answer Module executes the predicted neuralprograms to decode answers given intermediate network outputs. The photo recommender is then called to retrieverelevant photos (e.g. photos of the reference memory that includes the answer to a query).

ries relevant to conversational contexts that wouldmake the interaction more engaging and interest-ing (e.g. memories with the same group of peo-ple, at the same location as the reference mem-ory, etc.). The two agents continue this process toexplore the given memory graph, until one of theagents decides to end the conversation.

2.3 Method

Figure 4 illustrates the overall architecture and themodel components of the demonstrated system.Input Module: For a given query q, its rele-vant memory nodes m = {m(k)}Kk=1 for slot sizeK are provided as initial memory slots via graphsearches. The Query Encoder then encodes the in-put query with a language model. Specifically, werepresent each textual query with a state-of-the-artattention-based Bi-LSTM language model (Con-neau et al., 2017). The Memory Encoder then en-codes each memory slot based on both its struc-tural features (graph embeddings) and contextualmulti-modal features from its neighboring nodes(e.g. attribute values). We construct memorygraph embeddings to encode structural contexts ofeach memory node via the graph embeddings pro-jection approaches (Bordes et al., 2013), in whichsemantically similar nodes are distributed closer inthe embeddings space.

Memory Graph Networks: To utilize encodedcandidate memory nodes for memory recall QAand proactive memory reference, we first uti-lize the Memory Graph Networks (MGN) (Moonet al., 2019b). MGN stores memory graph nodesas initial memory slots, where additional con-texts and answer candidates can be succinctly ex-panded and reached via graph traversals. For each(q,m(k)) pair, MGN predicts optimal memory slotexpansion steps: p(k) = {[p(k)

e,t ;p(k)n,t ]}Tt=1 for

edge paths pe and corresponding node paths pn.The LSTM-based sequence model is trained tolearn the optimal path with ground-truth node andrelation paths. The attended memory nodes arethen used to answer user memory recall queries(QA module), and to predict relevant memorynodes to surface as a response to the previous con-versational contexts (Recommender Module).QA Modules: An estimated answer a =QA(m,q) is predicted given a query and MGNgraph path output from initial memory slots.

We then define the memory attention to attenu-ate or amplify all activated memory nodes basedon their compatibility with query, formulated asfollows:

β = MLP(q, {m(k)}, {p(k)}) (1)

α = Softmax(W>β β) ∈ RK (2)

149

Question and Answer Model Preidction

Top-k Answers Other Relevant Memory Nodes

Q: Where did Emma and I go after we watched Neptune Oyster {Star Wars IV, John, Mar 2014, ...}the new Star Wars movie? // A: Neptune Oyster AMC Theatre {Emma, hiking, July 2017, ...}

Q: When did I last go skiing with Emily? Feb 2017 {skiing, Emily, Dec 2016, ...}A: Feb 2017 Jan 2016 {Emily, soccer, ...}

Q: Show me the photos of when I went to the [photo 1] {baseball, Mia, Dodgers, May 2016, ...}Dodgers game with Mia this year. // A: [photo 1] [photo 2] {Mia, movie, April 2018, ...}

Table 1: Example output: Model predictions the top-k answers and the attended memory nodes are partiallyshown for each question and ground-truth answer pair.

Next, the model outputs a module program{u(k)} for several sub-module networks (e.g.CHOOSE, COUNT, ...) via the multi-layer percep-tion module selector network, which outputs themodule label probability {u(k)} for each memorynode:

{u(k)} = Softmax(MLP(q, {m(k)})) (3)

Each module network produces an answer vec-tor, the aggregated result of which determines thetop-k answers to be returned.Photo Memory Recommender: The memory at-tention values (α) and the path outputs are thenused to proactively recommend relevant photos as-sociated with each activated memory node m(k):

{i(k)} = Softmax(MLP(α(k),p(k)))∀k (4)

where the output score is used to rank the can-didate photos. Photos with the top score (abovethreshold) are then finally surfaced along withtheir memory nodes. We leave this threshold asa tunable hyper-parameter that can determine theproactive behavior of the system in introducingother relevant memories given previous conversa-tional contexts.

3 Demonstration

Table 1 shows some of the example output fromthe demonstrated system given the input queryand memory graph nodes. It can be seen that themodel is able to predict answers by combininganswer contexts from multiple components (walkpath, node attention, neural modules, etc.) In gen-eral, the model successfully explores the respec-tive single-hop or multi-hop relations within thememory graph. The activated nodes via graphtraversals are then used as input for each neuralmodule, the aggregated results of which are the fi-nal top-k answer predictions. The model also at-tends on other relevant memory nodes which often

have some of the key attributes shared with the tar-get reference memory (e.g. same activity, people,location, etc.). At test time, we can proactivelypresent these relevant memories to users alongwith their associated media contents for more en-gaging memory-grounded conversations.

4 Related Work

End-to-end dialog systems: There have beena number of studies on end-to-end dialog sys-tems, often focused on task or goal oriented dialogsystems such as conversational recommendations(Bordes et al., 2017; Sun and Zhang, 2018), infor-mation querying (Williams et al., 2017; de Vrieset al., 2018; Reddy et al., 2018), etc. Many ofthe public datasets are collected via bootstrappedsimulations (Bordes et al., 2017), Wizard-of-Ozsetup (Zhang et al., 2018; Wei et al., 2018; Moonet al., 2019a), or online corpus (Li et al., 2016).In our work, we propose a unique setup for di-alog systems called memory-grounded conversa-tions, where the focus is on grounding human con-versations with past user memories for both thegoal-oriented task (memory recall QA) and themore open-ended dialogs (proactive memory ref-erence). Our Memory Dialog dataset uses the pop-ular Wizard-of-Oz setup between role-playing hu-man annotators, where the reference memories arebootstrapped through memory graph generator.QA Systems: Structured QA systems have beenvery popular due to the popularity of the fact-retrieval assistant products, which solve fact-retrieval QA queries with large-scale common factknowledge graphs (Bordes et al., 2015; Xu et al.,2016; Dubey et al., 2018). Most of the work typ-ically utilize an entity linking system and a QAmodel for predicting graph operations e.g. throughtemplate matching approaches, etc. For QA sys-tems with unstructured knowledge sources (e.g.machine reading comprehension), the approaches

150

that utilize Memory Networks with explicit mem-ory slots (Weston et al., 2014; Sukhbaatar et al.,2016) are widely used for their capability of tran-sitive reasoning. In our work, we utilize Mem-ory Graph Networks (MGN) (Moon et al., 2019b)to store graph nodes as memory slots and expandslots via graph traversals, to effectively handlecomplex memory recall queries and to identify rel-evant memories to surface next.Visual QA systems answer queries based on thecontexts from provided images (Antol et al., 2015;Wang et al., 2018; Wu et al., 2018). Jiang et al.(2018) propose the visual memex QA task whichaddresses similar domains given a dataset com-posed of multiple photo albums. We extend theproblem domain to the conversational settingswhere the focus is the increased engagement withusers through natural multi-modal interactions.Our work also extends the QA capability by utiliz-ing semantic and structural contexts from memoryand knowledge graphs, instead of relying solely onmeta information and multi-modal content avail-able in photo albums.

ReferencesStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-

garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual Question An-swering. In ICCV.

Antoine Bordes, Y-Lan Boureau, and Jason Weston.2017. Learning end-to-end goal-oriented dialog.ICLR.

Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple questionanswering with memory network. arxiv.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In NIPS.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In EMNLP.

Mohnish Dubey, Debayan Banerjee, Debanjan Chaud-huri, and Jens Lehmann. 2018. Earl: Joint entity andrelation linking for question answering over knowl-edge graphs. ESWC.

Lu Jiang, Junwei Liang, Liangliang Cao, Yannis Kalan-tidis, Sachin Farfade, and Alexander Hauptmann.2018. Memexqa: Visual memex question answer-ing. arxiv.

Jiwei Li, Michel Galley, Chris Brockett, Georgios PSpithourakis, Jianfeng Gao, and Bill Dolan. 2016.A persona-based neural conversation model. ACL.

Seungwhan Moon, Pararth Shah, Anuj Kumar, and Ra-jen Subba. 2019a. Opendialkg: Explainable conver-sational reasoning with attention-based walks overknowledge graphs. ACL.

Seungwhan Moon, Pararth Shah, Anuj Kumar, and Ra-jen Subba. 2019b. Walk the memory: Memorygraph networks for explainable memory-groundedquestion answering. CoNLL.

Siva Reddy, Danqi Chen, and Christopher D Manning.2018. Coqa: A conversational question answeringchallenge. arXiv preprint arXiv:1808.07042.

Pararth Shah, Dilek Hakkani-Tur, Bing Liu, andGokhan Tur. 2018. Bootstrapping a neural conversa-tional agent with dialogue self-play, crowdsourcingand on-line reinforcement learning. In NAACL.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2016. End-to-end memory net-works. NIPS.

Yueming Sun and Yi Zhang. 2018. Conversational rec-ommender system. SIGIR.

Harm de Vries, Kurt Shuster, Dhruv Batra, DeviParikh, Jason Weston, and Douwe Kiela. 2018.Talk the walk: Navigating new york city throughgrounded dialogue. ECCV.

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, andAnton van den Hengel. 2018. Fvqa: Fact-based vi-sual question answering. PAMI.

Wei Wei, Quoc Le, Andrew Dai, and Jia Li. 2018.Airdialogue: An environment for goal-oriented di-alogue research. In EMNLP.

Jason Weston, Sumit Chopra, and Antoine Bor-des. 2014. Memory networks. arXiv preprintarXiv:1410.3916.

Jason D Williams, Kavosh Asadi, and Geoffrey Zweig.2017. Hybrid code networks: practical and efficientend-to-end dialog control with supervised and rein-forcement learning. ACL.

Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, andAnton van den Hengel. 2018. Image captioning andvisual question answering based on attributes andexternal knowledge. PAMI.

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang,and Dongyan Zhao. 2016. Question answering onfreebase via relation extraction and textual evidence.ACL.

Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018. Per-sonalizing dialogue agents: I have a dog, do youhave pets too? ACL.

Memory Grounded Conversational Reasoning · can also make proactive suggestions to sur-face related...

Documents

Transcript of Memory Grounded Conversational Reasoning · can also make proactive suggestions to sur-face related...