Scaling Early Achievements to Robust Methods Workshop ...List/Enumeration 11.7% Tracking, retaining,...

EMNLP 2016

Workshop on Uphill Battles in Language Processing:Scaling Early Achievements to Robust Methods

Workshop Proceedings

November 5, 2016Austin, Texas, USA

c©2016 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Curran Associates57 Morehouse LaneRed Hook, New York 12571USATel: +1-845-758-0400Fax: [email protected]

ISBN 978-1-945626-30-2

ii

Introduction

Welcome to the EMNLP 2016 Workshop on Uphill Battles in Language Processing: Scaling EarlyAchievements to Robust Methods.

Early researchers in Natural Language Processing had lofty goals, including getting computers tounderstand stories, engage in natural, cooperative dialogues with people, and translate text and speechfluently and accurately from one human language to another. While there were significant earlyachievements (including systems such as SHRDLU, LUNAR and COOP), the knowledge they werebased on and the techniques they employed could not be scaled up for practical use.

While much of what early researchers set out to achieve has been either forgotten or sidelined in favorof what can be done by exploiting large data sets and processing power, its potential value has not goneaway: There is much to be gained from recognizing not just what was said, but why; from identifyingconclusions naturally drawn from what has been said and what hasn’t; and from representing domainsin a sufficiently rich way to reduce reliance on only what a text makes explicit. As such, we believethere can be a broad and positive impact of reviving early aspirations in the current context of largedata sets and “deep" and probabilistic methods.

The workshop program is split into four panel sessions and a poster session. Each panel leads adiscussion on a different area of natural language processing: document understanding, naturallanguage generation, dialogue and speech, and language grounding. Each panel session consists offour short (10 minute) presentations, two by established researchers who carried out early work in thearea, and two by more junior researchers who are known for their work on specific problems in thearea. Following the presentations, workshop participants are invited to discuss challenges and potentialapproaches for challenges in that field. In addition, the program includes twelve research abstracts thatwere selected out of 16 submissions. These abstracts are presented as poster boasters at the workshop,as well as in a poster session.

Our program committee consisted of 25 researchers who provided constructive and thoughtful reviews.This workshop would not have been possible without their hard work. Many thanks to you all. We alsothank the U.S. National Science Foundation for financial support. Finally, a huge thank you to all theauthors who submitted abstracts to this workshop and made it a big success.

Annie, Michael, Bonnie, Mike and Luke

iii

Organizers:

Annie Louis, University of EssexMichael Roth, University of Illinois Urbana-Champaign / Saarland UniversityBonnie Webber, University of EdinburghMichael White, The Ohio State UniversityLuke Zettlemoyer, University of Washington

Program Committee:

Omri Abend, The Hebrew University of JerusalemTimothy Baldwin, University of MelbourneNate Chambers, United States Naval AcademyAnn Copestake, University of CambridgeVera Demberg, Saarland UniversityAnette Frank, Heidelberg UniversityAurelie Herbelot, University of TrentoGraeme Hirst, University of TorontoEduard Hovy, Carnegie Mellon UniversityBeata Beigman Klebanov, Educational Testing ServiceAnna Korhonen, University of CambridgeDaniel Marcu, Information Sciences Institute, USCKatja Markert, Heidelberg UniversityRada Mihalcea, University of MichiganHwee Tou Ng, National University of SingaporeAlexis Palmer, Heidelberg UniversityManfred Pinkal, Saarland UniversitySameer Pradhan, Boulder LearningSabine Schulte im Walde, University of StuttgartJan Snajder, University of ZagrebSwapna Somasundaran, Educational Testing ServiceManfred Stede, University of PotsdamJoel Tetreault, Yahoo! LabsSimone Teufel, University of CambridgeLucy Vanderwende, Microsoft Research

Invited Speakers:

James Allen, University of Rochester / IHMCJoyce Chai, Michigan State UniversityYejin Choi, University of WashingtonHal Daumé III, University of Maryland, College ParkMarie-Catherine de Marneffe, Ohio State UniversityDavid DeVault, University of Southern California

v

Andrew Kehler, University of California, San DiegoIoannis Konstas, University of WashingtonMark Liberman, University of PennsylvaniaDiane Litman, University of PittsburghChris Manning, Stanford UniversityKathleen McKeown, Columbia UniversityMargaret Mitchell, Microsoft ResearchDonia Scott, University of SussexMark Steedman, University of EdinburghAmanda Stent, Bloomberg

vi

Table of Contents

An Analysis of Prerequisite Skills for Reading ComprehensionSaku Sugawara and Akiko Aizawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Bridging the gap between computable and expressive event representations in Social MediaDarina Benikova and Torsten Zesch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Statistical Script Learning with Recurrent Neural NetworksKarl Pichotta and Raymond Mooney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Moving away from semantic overfitting in disambiguation datasetsMarten Postma, Filip Ilievski, Piek Vossen and Marieke van Erp . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Unsupervised Event Coreference for Abstract WordsDheeraj Rajagopal, Eduard Hovy and Teruko Mitamura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Towards Broad-coverage Meaning Representation: The Case of Comparison StructuresOmid Bakhshandeh and James Allen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

DialPort: A General Framework for Aggregating Dialog SystemsTiancheng Zhao, Kyusong Lee and Maxine Eskenazi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

C2D2E2: Using Call Centers to Motivate the Use of Dialog and Diarization in Entity ExtractionKen Church, Weizhong Zhu and Jason Pelecanos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Visualizing the Content of a Children’s Story in a Virtual World: Lessons LearnedQuynh Ngoc Thi Do, Steven Bethard and Marie-Francine Moens . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Stylistic Transfer in Natural Language Generation Systems Using Recurrent Neural NetworksJad Kabbara and Jackie Chi Kit Cheung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Using Language Groundings for Context-Sensitive Text PredictionTimothy Lewis, Cynthia Matuszek, Amy Hurst and Matthew Taylor . . . . . . . . . . . . . . . . . . . . . . . . 48

Towards a continuous modeling of natural language domainsSebastian Ruder, Parsa Ghaffari and John G. Breslin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

vii

Workshop Program

Saturday, November 5, 2016

09:00–10:20 Session S1: Text Understanding

09:00–10:20 Invited talks, followed by discussionHal Daume III, Andrew Kehler, Chris Manning, Marie-Catherine de Marneffe

10:20–10:30 Session S2: Poster Boasters

An Analysis of Prerequisite Skills for Reading ComprehensionSaku Sugawara and Akiko Aizawa

Bridging the gap between computable and expressive event representations in SocialMediaDarina Benikova and Torsten Zesch

Statistical Script Learning with Recurrent Neural NetworksKarl Pichotta and Raymond Mooney

Moving away from semantic overfitting in disambiguation datasetsMarten Postma, Filip Ilievski, Piek Vossen and Marieke van Erp

Unsupervised Event Coreference for Abstract WordsDheeraj Rajagopal, Eduard Hovy and Teruko Mitamura

Towards Broad-coverage Meaning Representation: The Case of Comparison Struc-turesOmid Bakhshandeh and James Allen

10:30–11:00 Coffee break

ix

Saturday, November 5, 2016 (continued)

11:00–12:20 Session S3: Natural Language Generation

11:00–12:20 Invited talks, followed by discussionIoannis Konstas, Kathleen McKeown, Margaret Mitchell, Donia Scott

12:20–12:30 Session S4: Poster Boasters

DialPort: A General Framework for Aggregating Dialog SystemsTiancheng Zhao, Kyusong Lee and Maxine Eskenazi

C2D2E2: Using Call Centers to Motivate the Use of Dialog and Diarization inEntity ExtractionKen Church, Weizhong Zhu and Jason Pelecanos

Visualizing the Content of a Children’s Story in a Virtual World: Lessons LearnedQuynh Ngoc Thi Do, Steven Bethard and Marie-Francine Moens

Stylistic Transfer in Natural Language Generation Systems Using Recurrent NeuralNetworksJad Kabbara and Jackie Chi Kit Cheung

Using Language Groundings for Context-Sensitive Text PredictionTimothy Lewis, Cynthia Matuszek, Amy Hurst and Matthew Taylor

Towards a continuous modeling of natural language domainsSebastian Ruder, Parsa Ghaffari and John G. Breslin

12:30–14:00 Lunch break

x

Saturday, November 5, 2016 (continued)

14:00–15:20 Session S5: Dialogue and Speech

14:00–15:20 Invited talks, followed by discussionDavid DeVault, Mark Liberman, Diane Litman, Amanda Stent

15:20–16:00 Coffee break + poster session

16:00–16:30 Session S6: Poster session (continued)

16:30–17:50 Session S7: Grounded Language

16:30–17:50 Invited talks, followed by discussionJames Allen, Joyce Chai, Yejin Choi, Mark Steedman

xi

Proceedings of EMNLP 2016 Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods, pages 1–5,Austin, TX, November 5, 2016. c©2016 Association for Computational Linguistics

An Analysis of Prerequisite Skills for Reading Comprehension

Saku SugawaraThe University of Tokyo7-3-1 Hongo, Bunkyo-ku

Tokyo, [email protected]

Akiko AizawaNational Institute of Informatics2-1-2 Hitotsubashi, Chiyoda-ku

Tokyo, [email protected]

Abstract

In this paper, we focus on the synthetic un-derstanding of documents, specifically read-ing comprehension (RC). A current problemwith RC is the need for a method of analyz-ing the RC system performance to realize fur-ther development. We propose a methodol-ogy for examining RC systems from multi-ple viewpoints. Our methodology consists ofthree steps: define a set of basic skills used forRC, manually annotate questions of an exist-ing RC task, and show the performances foreach skill of existing systems that have beenproposed for the task. We demonstrated theproposed methodology by annotating MCTest,a freely available dataset for testing RC. Theresults of the annotation showed that answer-ing RC questions requires combinations ofmultiple skills. In addition, our defined RCskills were found to be useful and promisingfor decomposing and analyzing the RC pro-cess. Finally, we discuss ways to improve ourapproach based on the results of two extra an-notations.

1 Introduction

Reading comprehension (RC) tasks require ma-chines to understand passages and respond to ques-tions about them. For the development of RC sys-tems, precisely identifying what systems can andcannot understand is important. However, a criti-cal problem is that the RC process is so complicatedthat it is not easy to examine the performances of RCsystems.

Our present goal is to construct a general evalua-tion methodology that decomposes the RC process

and elucidates the fine-grained performance frommultiple points of view rather than based only onaccuracy, which is the approach used to date. Ourmethodology has three steps:

1. Define a set of prerequisite skills that are re-quired for understanding documents (Section2.1)

2. Annotate questions of an RC task with theskills (Section 2.2)

3. Analyze the performances of existing RC sys-tems for the annotated questions to grasp thedifferences and limitations of their individualperformances (Section 2.3)

In Section 2, we present an example of ourmethodology, where we annotated MCTest (MC160development set) (Richardson et al., 2013)1 for Step2 and analyzed systems by Smith et al. (2015) forStep 3. In Section 3, we present two additional an-notations in order to show the outlook for the devel-opment of our methodology in terms of the classifi-cation of skills and finer categories for each skill. InSection 4, we discuss our conclusions.

2 Approach

2.1 Reading Comprehension SkillsWe investigated existing tasks for RC and defined aset of basic prerequisite skills, which we refer to asRC skills. These are presented in Table 1.

The RC skills were defined to understand the re-lations between multiple clauses. Here, we assumed

1http://research.microsoft.com/en-us/um/redmond/projects/mctest/

1

RC skills Freq. Descriptions or examplesSmith

no RTESmithRTE

List/Enumeration 11.7% Tracking, retaining, and list/enumeration of entities or states 78.6% 71.4%Mathematical operations 4.2% Four basic operations and geometric comprehension 20.0% 20.0%Coreference resolution 57.5% Detection and resolution of coreferences 65.2% 69.6%Logical reasoning 0.0% Induction, deduction, conditional statement, and quantifier - -Analogy 0.0% Trope in figures of speech, e.g., metaphor - -Spatiotemporal relations∗ 28.3% Spatial and/or temporal relations of events 70.6% 76.5%Causal relations∗ 18.3% Why, because, the reason, etc. 63.6% 68.2%Commonsense reasoning 49.2% Taxonomic/qualitative knowledge, action and event change 59.3% 64.4%Complex sentences∗ 15.8% Coordination or subordination of clauses 52.6% 68.4%Special sentence structure∗ 10.0% Scheme in figures of speech, constructions, and punctuation marks 50.0% 50.0%

- - (Accuracy in all 120 questions) 67.5% 70.0%

Table 1: Reading comprehension skills, their frequencies (in percentage) in MCTest (MC160 development set, 120questions), their descriptions or examples, and the accuracies of the two systems (Smith et al., 2015) for each skill.The asterisks (∗) with items represent “understanding of.”

ID: MC160.dev.29 (1) multiple:C1: The princess climbed out the window of the high

tower and climbed down the south wall when hermother was sleeping.

C2: She wandered out a good ways.C3: Finally she went into the forest where there are no

electric poles but where there are some caves.Q: Where did the princess wander to after escaping?A: Forest

Coreference resolution:· She in C2 = the princess in C1· She in C3 = the princess in C1

Temporal relations:· the actions in C1→ wandered out ... in C2→ went into... in C3

Complex sentence and special sentence structure:· C1 = the princess climbed out...

and [the princess] climbed down... (ellipsis)Commonsense reasoning:· escaping in Q⇒ the actions in C1· wandered out in C2 and went into the forest

in C3⇒ wander to the forest in Q and A

Figure 1: Example of task sentences in MCTest and an-notations with comments for verification (itemized).

that, when an RC system uses an RC skill, it mustalready recognize individual facts described in thoseclauses to which the skill relates.

There are two exceptional RC skills:

Complex sentences target the understanding ofrelations between clauses in one sentence (exceptthose having spatiotemporal or causal meanings).Such relations have schematic or rhetorical mean-ings. For example, the words “and” and “or” intro-duce coordinating clauses (we regard them as hav-

ing schematic relations). In addition, the word “al-though” introduces a subordinate clause that rep-resents concession (i.e., it modifies the rhetoricalmeaning).

Special sentence structure is defined as recogniz-ing linguistic symbols or structures in a sentence andintroducing their interpretations as new facts. Forexample, if we take “scheme” in figures of speech,this skill deals with apposition, ellipsis, transposi-tion, and so on. This skill also targets linguistic con-structions and punctuation marks.

These two skills target a single sentence, whilethe other skills target multiple clauses and sentences.We did not list the skill of recognizing textual en-tailment (TE) because we assumed that TE involvesa broad range of knowledge and inferences and istherefore a generic task itself (Dagan et al., 2006).

2.2 Annotation of RC QuestionsWe manually annotated the questions of the MC160development set (120 questions) with the RC skillsthat are required to answer each question. In the an-notation, we allow multiple labeling.

Because the RC skills are intended for under-standing relations between multiple clauses, we ex-cluded sentences that had no relations with othersand required only simple rules for answering (e.g.,mc160.dev.2 (3) Context: Todd lived in a town out-side the city. Q: Where does Todd live in? A: in atown). These questions were considered to requireno skills.

An example of the annotations is shown in Fig-ure 1. The percentages of the questions in which RC

2

skills appear are in the second column of Table 1.Some of the questions are annotated with multiplelabels. The number of skills required in each ques-tion is 0 for 9.2% of the questions, 1 for 27.5%, 2for 30.0%, 3 for 26.7%, 4 for 5.8%, and 5 for 0.8%.

2.3 Analysis of Existing Systems

The accuracies of the system by Smith et al. (2015)and its extension with RTE (Stern and Dagan, 2011)are represented in the last two columns of Table 1.

The results showed that adding RTE to the Smithet al. (2015)’s original system provided the mosteffective contribution to the skill of complex sen-tences; however, it did not affect the skills of mathoperations and special sentence structure. AddingRTE had a relatively small contribution to the skillof causal relations. This did not exactly meet ourexpectation because we still do not have sufficientnumber of annotations to determine the differencesbetween combinations of skills.

3 Additional Annotations

In order to improve our methodology, we consideredtwo questions: (i) What is the difference betweendistributions of RC skills in two RC tasks? (ii) CanRC skills be broken up into finer categories?

To answer these questions, here we present twoadditional annotations. The first treated SQuAD(Rajpurkar et al., 2016). We counted the frequenciesof RC skills required in that task and compared theirdistribution with that of MCTest. This gave clues forestablishing the ideal categorization of RC skills.

For the second, we divided the skill of common-sense reasoning into three subcategories and usedthem to annotate MCTest. This should help for asharper definition of common sense.

3.1 SQuAD with RC Skills

SQuAD2 is an RC task based on a set of Wikipediaarticles. The questions are made by crowdworkers,and their answers are sure to appear in the contextas a word sequence. We chose 80 questions overseven articles from the development set (v1.1) andannotated them with RC skills. Figure 2 shows anexample of the annotations.

2http://stanford-qa.com

RC skillsFrequencySQuAD

FrequencyMCTest

List/Enumeration 5.0% 11.7%Mathematical operations 0.0% 4.2%Coreference resolution 6.2% 57.5%Logical reasoning 1.2% 0.0%Analogy 0.0% 0.0%Spatiotemporal relations 2.5% 28.3%Causal relations 6.2% 18.3%Commonsense reasoning 86.2% 49.2%Complex sentences 20.0% 15.8%Special sentence structure 25.0% 10.0%

Table 2: Reading comprehension skills and their frequen-cies (in percentage) in SQuAD and MCTest (MC160 de-velopment set).

The annotation results are presented in Table2. Most questions require commonsense reason-ing. This is because the crowdworkers were askedto avoid copying words from their context as muchas possible. That is, most questions require under-standing of paraphrases. Compared with MCTest,the frequencies were generally low except for a fewskills. This was due to the task formulation ofSQuAD. For example, because SQuAD does not in-volve multiple choice (a candidate answer can con-tain multiple entities), the skill of list/enumerationis not required. Additionally, except for articleson a particular person or historical event, there arefewer descriptions that require spatiotemporal re-lations than in MCTest, whose datasets mainly de-scribe tales about characters and events for youngchildren. On the other hand, complex sentences andspacial sentence structure appear more frequentlyin SQuAD than in MCTest because the documentsof SQuAD are written for adults. In this way, byannotating RC tasks and comparing the results, wecan see the difference in characteristics among thosetasks.

3.2 MCTest with Commonsense Types

By referring to Davis and Marcus (2015), we definedthe following three types of common sense, as givenin Table 3, and annotated the MC160 developmentset while allowing multiple labeling. We found threequestions that required multiple types.

Lexical knowledge focuses on relations of wordsor phrases, e.g., synonyms and antonyms, as inWordNet. This includes hierarchical relations of

3

ID: Civil disobedience, paragraph 1, question 1C1: One of its earliest massive implementations was

brought about by Egyptians against the British occu-pation in the 1919 Revolution.

C2: Civil disobedience is one of the many ways peoplehave rebelled against what they deem to be unfair laws.

Q: What is it called when people in society rebel againstlaws they think are unfair?

A: Civil disobedience

Coreference resolution:· they in C2 = people in C2 (different clauses)· they in Q = people in Q (different clauses)

Temporal relation:· people have rebelled... in C2→ when people in society rebel... in Q

Complex sentences:· C2 = one of the many ways people have (relative clause)· C2 = Civil disobedience is... against [the object]

and [it is] what they deem to... (relative clause)· Q = What is it called... laws

and they think [the laws] unfair?Commonsense reasoning:· What is it called in Q⇒ Civil disobedience is· laws they think... in C2 = what they deem to in Q

Figure 2: Example of task sentences (excerpted) in the de-velopment set of SQuAD and their annotations with com-ments for verification (itemized).

content words. Therefore, this knowledge is taxo-nomic and categorical.

Qualitative knowledge targets various relations ofevents, including “about the direction of change ininterrelated quantities” (Davis and Marcus, 2015).In addition, this knowledge deals with implicitcausal relations such as physical law and theory ofmind. Note that these relations are semantic, so thistype of knowledge ignores the understanding of syn-tactic relations, i.e., the skills of spatiotemporal re-lations and causal relations.

The skill of known facts targets named entitiessuch as proper nouns, locations, and dates. Davisand Marcus (2015) did not mention this type ofknowledge. However, we added this just in case be-cause we considered the first two types as unable totreat facts such as a proper noun indicating the nameof a character in a story.

Table 3 presents the frequencies of these types andaccuracies of Smith et al. (2015)’s RTE system. Be-cause MCTest was designed to test the capability ofyoung children’s reading, known facts were hardlyrequired. Although not reported here, we found that

Commonsense type FrequencyAccuracy

Smith RTE

Lexical knowledge 19.2% 67.2%Qualitative knowledge 30.8% 67.6%Known facts 2.5% 33.3%

Table 3: Commonsense types, their frequencies (in per-centage) in MCTest (MC160 development set), and accu-racies by Smith et al. (2015)’s RTE system.

understanding them was more required in MC500(e.g., days of a week). While the frequencies ofthe first two types were relatively high, their accu-racies were comparable. Unfortunately, this meantthat they were inadequate for revealing the weak-ness of the system on this matter. We concluded thatfiner classification is needed. However, the distribu-tion of the frequencies showed that even these com-monsense types can characterize a dataset in termsof the knowledge types required in that task.

4 Discussion and Conclusion

As discussed in Section 2.3, our methodology hasthe potential to reveal differences in system perfor-mances in terms of multiple aspects. We believe thatit is necessary to separately test and analyze new andexisting RC systems on each RC skill in order tomake each system more robust. We will continue toannotate other datasets of MCTest and RC tasks andanalyze the performances of other existing systems.

From the observations presented in this paper,we may be able to make a stronger claim that re-searchers of RC tasks (more generally, natural lan-guage understanding) should also provide the fre-quencies of RC skills. This will help in developinga standard approach to error analysis so that systemscan be investigated for their strengths and weak-nesses in specific skill categories. We can determinethe importance of each skill by weighting them ac-cording to their frequencies in the test set.

Acknowledgments

We are grateful to the anonymous reviewers for theirthoughtful comments and the workshop organizers.This work was supported by JSPS KAKENHI GrantNumber15H0275.

4

ReferencesIdo Dagan, Oren Glickman, and Bernardo Magnini.

2006. The PASCAL recognising textual entailmentchallenge. In Machine learning challenges. evaluat-ing predictive uncertainty, visual object classification,and recognising tectual entailment, pages 177–190.Springer.

Ernest Davis and Gary Marcus. 2015. Commonsensereasoning and commonsense knowledge in artificialintelligence. Communications of the ACM, 58(9):92–103.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100, 000+ ques-tions for machine comprehension of text. CoRR,abs/1606.05250.

Matthew Richardson, J.C. Christopher Burges, and ErinRenshaw. 2013. MCTest: A challenge dataset for theopen-domain machine comprehension of text. In Pro-ceedings of the 2013 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP), pages193–203.

Lenhart K Schubert. 2015. What kinds of knowledge areneeded for genuine understanding? In IJCAI 2015Workshop on Cognitive Knowledge Acquisition andApplications (Cognitum 2015).

Ellery Smith, Nicola Greco, Matko Bosnjak, and AndreasVlachos. 2015. A strong lexical matching methodfor the machine comprehension test. In Proceed-ings of the 2015 Conference on Empirical Methods inNatural Language Processing (EMNLP), pages 1693–1698. Association for Computational Linguistics.

Asher Stern and Ido Dagan. 2011. A confidence modelfor syntactically-motivated entailment proofs. In Pro-ceedings of the International Conference Recent Ad-vances in Natural Language Processing 2011, pages455–462, Hissar, Bulgaria, September. RANLP 2011Organising Committee.

Jason Weston, Antoine Bordes, Sumit Chopra, andTomas Mikolov. 2015. Towards AI-complete ques-tion answering: a set of prerequisite toy tasks. In Pro-ceedings of the International Conference on LearningRepresentations (ICLR).

5


Bridging the gap betweencomputable and expressive event representations in Social Media

Darina BenikovaLanguage Technology Lab

University of Duisburg-EssenDuisburg, Germany

[email protected]

Torsten ZeschLanguage Technology Lab

University of Duisburg-EssenDuisburg, Germany

[email protected]

Abstract

An important goal in text understanding ismaking sense of events. However, there isa gap between computable representations onthe one hand and expressive representationson the other hand. We aim to bridge this gapby inducing distributional semantic clusters aslabels in a frame structural representation.

1 Introduction

We experience events in our everyday life, throughwitnessing, reading, hearing, and seeing what is hap-pening. However, representing events computation-ally is still an unsolved challenge even when deal-ing with well-edited text. As soon as non-standardtext varieties such as Social Media are targeted,representations need to be robust against alterna-tive spellings, information compression, and neol-ogisms.

The aim of our envisioned project is bridging thegap between argument-level representations whichare robustly computable, but less expressive andframe-level representations which are highly expres-sive, but not robustly computable. The distinctionand the gap between the two main representationtypes is presented in Figure 1. On the argument-level, the event give and all its arguments are identi-fied, whereas on the frame-level additional semanticrole labels are assigned.

We envision a representation that enables opera-tions such as equivalence, entailment, and contradic-tion. In this paper, we will focus on the equivalenceoperation due to space constraints. These operationsare not only necessary to compress the amount of

information, which is especially important in high-volume, high redundancy Social Media posts, butalso for other tasks such as to analyze and under-stand events efficiently.

We plan to achieve building a robustly com-putable and expressive representation that is suitedto perform the discussed operations by using SocialMedia domain specific clusters and topic labelingmethods for the frame-labeling. We intend to eval-uate the validity of our representation and approachextrinsically and application-based.

2 Types of representations

We distinguish between representations on two lev-els: (i) argument-level, which can be robustly im-plemented more easily, and (ii) frame-level, whichis highly expressive.

Argument-level As shown in Figure 1, most ar-gument representations consist of an event trig-ger, which is mostly a verb, and its correspond-ing arguments (Banarescu et al., 2013; Kingsburyand Palmer, 2003). Argument-level representationsbased on Social Media posts are used in applicationssuch as e.g. creating event calendars for concerts andfestivals (Becker et al., 2012) or creating overviewsof important events on Twitter (Ritter et al., 2012).

Frame-level On this level, events are representedas frame structures such as proposed by Fillmore(1976) that built upon the argument-level, i.e. the ar-guments are labeled with semantic roles.

A well-known frame semantic tagger isSEMAFOR (Das, 2014).

6

Figure 1: Representations of an exemplary event on argument and frame-level

3 Challenges

The main challenge is to bridge the gap between ar-gument and frame-level representation.

3.1 Performance of operations

Our goal is to develop a representation that is bothcomputable even on noisy Social Media text and ex-pressive enough to support all required operationslike equivalence. A semantically equivalent sen-tence for our examplary sentence “The witch gavean apple to Snow White” would be “Snow White re-ceived an apple from the witch.”, as receive is theantonym of give and the roles of Giver and Receiverare inverted.

On the argument-level, it remains a hard problemto establish the equivalence between the two sen-tences, while that would be easy on the frame-level.However, getting to the frame-level is an equallyhard problem and frame representations suffer fromlow coverage (Palmer and Sporleder, 2010).

3.2 Coverage

Palmer and Sporleder (2010) categorized and eval-uated the coverage gaps in FrameNet (Baker et al.,2003). Coverage, whether of undefined units, lem-mas, or senses, is of special importance when deal-ing with non-standard text that contains spellingvariations and neologisms that need to be dealt with.

In our opinion, the lack of undefined units is anespecially problematic issue in Social Media texts.Furthermore, it may contain innovative, informal orincomplete use of frames, due to space restrictionssuch as presented by Twitter. Also by cause of spacerestrictions, which lead to a lack of context, and con-sidering the variety of topics that is addressed in So-

cial Media, it is more challenging to find a fittingframe out of an existing frame repository (Ritter etal., 2012; Li and Ji, 2016).

Giuglea and Moschitti (2006) and Mujdricza-Mayd et al. (2016) tried to bridge the gap bycombing repositories on frame and argument leveland representing them based on Intersective LevinClasses (ILC) (Kipper et al., 2006). ILC, whichare used in VerbNet (Kipper et al., 2006), are morefine-grained than classic Levin verb classes, formedaccording to alternations of the grammatical ex-pression of their arguments (Levin, 1993). ClassicLevin verb classes were used for measuring seman-tic evidence between verbs (Baker and Ruppenhofer,2002).

However, these approaches also have to deal withcoverage problems due to their reliance on manuallycrafted frame repositories.

4 Approach

According to Modi et al. (2012) frame semanticparsing conceptually consists of 4 stages:

1. Identification of frame-evoking elements2. Identification of their arguments3. Labeling of frames4. Labeling of roles

We summarize these tasks in groups of two, namelyidentification and labeling, and discuss our approachtowards them in the following subsections.

4.1 Identification of frame-evoking elementsand their arguments

We regard the first two tasks as tasks of theargument-level, which we plan to solve with part-of-speech tagging and dependency parsing, by extract-

7

ing all main verbs to solve the first task and consid-ering all its noun dependencies as arguments in thesecond task. This is similar to the approach of Modiet al. (2012).

4.2 Labeling of predicates and their argumentsLike Modi et al. (2012), we focus on the last twotasks, which we regard as tasks of the frame-level.We observe this task under the aspect of fitting therealization of operation tasks as discussed earlier. Aswe only regard predicate frames and their argumentsfor the role labeling, we will use predicate as a termfor the unlabeled form of frame and argument as theunlabeled form of role.

Pre-defined frame labels There have been at-tempts to bridge the gap on Social Media texts byprojecting ontological information in the form ofcomputed event types on the event trigger on theargument-level (Ritter et al., 2012; Li et al., 2010;Li and Ji, 2016) in order to solve the task of framelabeling. However, according to Ritter et al. (2012)the automatic mapping of pre-defined event typesis insufficient for providing semantically meaning-ful information on the event.

We aim to augment those approaches by inducingframe-like structures based on distributional seman-tics. Moreover, we want to use similarity clustersfor the labeling of arguments in frames. We seekto compute the argument labels by the use of super-sense tagging, similarly to the approach presentedby Coppola et al. (2009). They successfully used theWordNet supersense labels (Miller, 1995) for verbsand nouns as a pre-processing step for the automaticlabeling of frames and their arguments.

Approaches using Levin classes, ILC, or WordNetsupersenses tackle the same tasks, namely labelingthe frame and their corresponding roles. However,all of these suffer from the discussed coverage prob-lem.

Clusters as labels To circumvent the coverage is-sue, there have been approaches using clusters sim-ilarly to frame labels. Directly labeling predicatesand their arguments has been performed by Modi etal. (2012), who iteratively clustered verbal frameswith their arguments.

As our main goal is to perform operations onevent representations, we do not need human-

Figure 2: Representations of our approach to bridge the gap

readable frames as proposed by FrameNet, but alevel that is semantically equivalent to it, thus ourfirst goal is to compute domain specific clusters forthe labeling.

In contrast to Modi et al. (2012), we plan to clus-ter the verbal predicates and the arguments sepa-rately. Although this might seem less intuitive, webelieve that due to the difficulties with Social Mediadata, the structures of full frames are less repetitiveand are more difficult to cluster. Thus, by dividingthe two tasks of predicate and argument clustering,we hope to achieve better results in our setting.

Furthermore, in order to deal with the issues ofthe previously discussed peculiarities of the SocialMedia domain, we plan to train clusters on largeamounts of Tweets.

An example of our envisioned representation isshown in Figure 2, which was produced using theTwitter Bigram model of JoBimViz (Ruppert et al.,2015). Figure 2 shows the clustering for finding thecorrect sense in the labeling task, for both the pred-icate and its arguments. As the example shows, thisrepresentation has some flaws that need to be dealtwith. It should be mentioned that the model usedfor this computation is pruned due to performancereasons, which is a cause for some of the flaws.

For example, Snow White is not recognized as aNamed Entity or a multi-word expression. To dealwith the issue of the false Named Entity representa-tion of Snow White presented in the exemplary rep-resentation, we plan to experiment with multi-wordor Named Entity recognizers. Thus, we plan to traina similar model on a larger set of Tweets, withoutpruning due to performance reasons.

The main flaw is that the wrong sense cluster ofgive is selected. To improve the issues that occurredabove, we plan to use soft-clustering for the stepof finding the correct sense cluster. Allowing soft

8

classes not only facilitates disambiguation (Riedl,2016), but may also be helpful when identifying theargument role in the frame and thus allow for thepreviously described operations of equality. By pro-viding only one cluster per predicate Modi et al.(2012) put the task of disambiguation aside, whichwe want to tackle as mentioned above.

Furthermore, aiming at representations that aresuited for operations such as equality, the knownproblems of antonyms being in the same clusterneeds to be solved. Similarly to Lobanova et al.(2010), who automatically extracted antonyms intext, we plan to solve this issue with a pattern-basedapproach.

Topic-clustered labels After succeeding in theclustering task, we plan to experiment with human-readable frame clusters. In contrast to using pre-defined WordNet supersenses and mapping these toframes, we want to solve the task of finding labelsfor the clusters by using supersenses computed fromdomain-specific clusters to directly label the framesand their arguments.

Our hypothesis is that by using more and softclusters for the supersense tagging, the role labelsof the event arguments become semantically richer,because more specific semantic information on thearguments and their context in the event is encoded.

Thus, we plan to use the supersense tagging byusing an LDA extension, in which a combination ofcontext and language features is used, as describedby Riedl (2016).

5 Evaluation plan

We plan to evaluate our approach in an extrinsic,application-based way on a manual gold standardcontaining event paraphrases. In order to test howwell our approach performs in comparison to state-of-the-art approaches of both argument and framerepresentations, such as Das (2014) or Li and Ji(2016) in the task of equivalence computation, wewill compare the results of all approaches.

For this purpose, we plan to develop a dataset thatis similar to Roth and Frank (2012), but tailored tothe Social Media domain. They produced a corpusof alignments between semantically similar predi-cates and their arguments from news texts on thesame event.

6 Summary

In this paper we present our vision on a new eventrepresentation that enables the use of operationssuch as equivalence. We plan to use pre-processingto get the predicates and their arguments. The mainfocus of the work will be using sense clusteringmethods on domain-specific text and to apply theseclusters on text. We plan to evaluate this applicationin an extrinsic, application-based way.

Further on, we plan to tackle tasks such as: topic-model based frame labeling on the computed clus-ters; pattern-based antonym detection in the clustersfor enabling the operation of contradiction and im-prove the task of equivalence; and experiment withNamed Entity and multiword recognizers in order toimprove the results in argument recognition.

7 Acknowledgements

This work is supported by the German ResearchFoundation (DFG) under grant No. GRK 2167, Re-search Training Group ”User-Centred Social Me-dia”.

We also kindly thank the EMNLP for granting usa travel support.

ReferencesCollin F. Baker and Josef Ruppenhofer. 2002.

FrameNet’s frames vs. Levin’s verb classes. In Pro-ceedings of the 28th annual meeting of the BerkeleyLinguistics Society, pages 27–38.

Collin F. Baker, Charles J. Fillmore, and Beau Cronin.2003. The structure of the framenet database. Inter-national Journal of Lexicography, 16(3):281–296.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract Meaning Representationfor Sembanking. In Proceedings of the 7th LinguisticAnnotation Workshop and Interoperability with Dis-course, pages 178–186, Sofia, Bulgaria.

Hila Becker, Dan Iter, Mor Naaman, and Luis Gravano.2012. Identifying content for planned events acrosssocial media sites. In Proceedings of the fifth ACMinternational conference on Web search and data min-ing, pages 533–542, Seattle, WA, USA.

Bonaventura Coppola, Aldo Gangemi, Alfio Gliozzo,Davide Picca, and Valentina Presutti. 2009. Frame de-tection over the Semantic Web. In European SemanticWeb Conference, pages 126–142. Springer.

9

Dipanjan Das. 2014. Statistical models for frame-semantic parsing. In Proceedings of Frame Seman-tics in NLP: A Workshop in Honor of Chuck Fillmore,pages 26–29, Baltimore, MD, USA.

Charles J. Fillmore. 1976. Frame Semantics and the Na-ture of Language. Annals of the New York Academy ofSciences, pages 20–33.

Ana-Maria Giuglea and Alessandro Moschitti. 2006.Semantic Role Labeling via FrameNet, VerbNet andPropBank. In Proceedings of the 21st InternationalConference on Computational Linguistics and the 44thannual meeting of the Association for ComputationalLinguistics, pages 929–936, Sydney, Australia.

Paul Kingsbury and Martha Palmer. 2003. PropBank:the Next Level of TreeBank. In Proceedings of theSecond Workshop on Treebanks and Linguistic Theo-ries, pages 105–116, Vaxjo, Sweden.

Karin Kipper, Anna Korhonen, Neville Ryant, andMartha Palmer. 2006. Extending VerbNet with novelverb classes. In Proceedings of Language Resourcesand Evaluation, pages 1027–1032, Genoa, Italy.

Beth Levin. 1993. English verb classes and alternations:A preliminary investigation. University of Chicagopress.

Hao Li and Heng Ji. 2016. Cross-genre Event Extrac-tion with Knowledge Enrichment. In The 15th An-nual Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 1158–1162, SanDiego, CA, USA.

Hao Li, Xiang Li, Heng Ji, and Yuval Marton. 2010.Domain-Independent Novel Event Discovery andSemi-Automatic Event Annotation. In Proceedingsof the 24th Pacific Asia Conference on Language, In-formation and Computation, pages 233–242, Sendai,Japan.

Anna Lobanova, Tom Van der Kleij, and Jennifer Spe-nader. 2010. Defining antonymy: A corpus-basedstudy of opposites by lexico-syntactic patterns. Inter-national Journal of Lexicography, 23(1):19–53.

George A Miller. 1995. WordNet: a lexical database forEnglish. Communications of the ACM, 38(11):39–41.

Ashutosh Modi, Ivan Titov, and Alexandre Klemen-tiev. 2012. Unsupervised induction of frame-semanticrepresentations. In Proceedings of the NAACL-HLTWorkshop on the Induction of Linguistic Structure,pages 1–7, Montral, Canada. Association for Compu-tational Linguistics.

Eva Mujdricza-Mayd, Silvana Hartmann, IrynaGurevych, and Anette Frank. 2016. Combiningsemantic annotation of word sense & semantic roles:A novel annotation scheme for verbnet roles ongerman language data. In Proceedings of the Tenth

International Conference on Language Resources andEvaluation (LREC 2016), pages 3031–3038, Portoroz,Slovenia, May.

Alexis Palmer and Caroline Sporleder. 2010. EvaluatingFrameNet-style semantic parsing: the role of coveragegaps in FrameNet. In Proceedings of the 23rd interna-tional conference on computational linguistics, pages928–936, Uppsala, Sweden.

Martin Riedl. 2016. Unsupervised Methods for Learn-ing and Using Semantics of Natural Language. Ph.D.thesis, TU Darmstadt.

Alan Ritter, Oren Etzioni, and Sam Clark. 2012. Opendomain event extraction from twitter. In Proceedingsof the 18th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 1104–1112, Beijing, China.

Michael Roth and Anette Frank. 2012. Aligning pred-icate argument structures in monolingual comparabletexts: A new corpus for a new task. In Proceedingsof the First Joint Conference on Lexical and Compu-tational Semantics-Volume 1: Proceedings of the mainconference and the shared task, and Volume 2: Pro-ceedings of the Sixth International Workshop on Se-mantic Evaluation, pages 218–227, Montral, Canada.

Eugen Ruppert, Manuel Kaufmann, Martin Riedl, andChris Biemann. 2015. Jobimviz: A web-based visual-ization for graph-based distributional semantic mod-els. In The Annual Meeting of the Association forComputational Linguistics (ACL) System Demonstra-tions, pages 103–108.

10


Statistical Script Learning with Recurrent Neural Networks

Karl Pichotta and Raymond J. MooneyDepartment of Computer ScienceThe University of Texas at Austin

{pichotta,mooney}@cs.utexas.edu

Abstract

We describe some of our recent efforts inlearning statistical models of co-occurringevents from large text corpora using RecurrentNeural Networks.

1 Introduction

Natural language scripts are structured models ofstereotypical sequences of events used for documentunderstanding. For example, a script model may en-code the information that from Smith landed in Bei-jing, one may presumably infer Smith flew in an air-plane to Beijing, Smith got off the plane at the Bei-jing airport, etc. The world knowledge encoded insuch event co-occurrence models is intuitively use-ful for a number of semantic tasks, including Ques-tion Answering, Coreference Resolution, DiscourseParsing, and Semantic Role Labeling.

Script learning and inference date back to AIresearch from the 1970s, in particular the semi-nal work of Schank and Abelson (1977). In thiswork, events are formalized as quite complex hand-encoded structures, and the structures encodingevent co-occurrence are non-statistical and hand-crafted based on appeals to the intuitions of theknowledge engineer. Mooney and DeJong (1985)give an early non-statistical method of automaticallyinducing models of co-occurring events from docu-ments, but their methods are non-statistical.

There is a growing body of more recent workinvestigating methods of learning statistical mod-els of event sequences from large corpora of rawtext. These methods admit scaling models up to be

much larger than hand-engineered ones, while beingmore robust to noise than automatically learned non-statistical models. Chambers and Jurafsky (2008)describe a statistical co-occurrence model of (verb,dependency) pair events that is trained on a largecorpus of documents and can be used to infer im-plicit events from text. A number of other sys-tems following similar paradigm have also been pro-posed (Chambers and Jurafsky, 2009; Jans et al.,2012; Rudinger et al., 2015). These approachesachieve generalizability and computational tractabil-ity on large corpora, but do so at the expense of de-creased representational complexity: in place of therich event structures found in Schank and Abelson(1977), these systems model and infer structurallysimpler events.

In this extended abstract, we will briefly sum-marize a number of statistical script-related systemswe have described in previous publications (Pichottaand Mooney, 2016a; Pichotta and Mooney, 2016b),place them within the broader context of related re-search, and remark on future directions for research.

2 Methods and results

In Pichotta and Mooney (2016a), we present a sys-tem that uses Long Short-Term Memory (LSTM)Recurrent Neural Nets (RNNs) (Hochreiter andSchmidhuber, 1997) to model sequences of events.In this work, events are defined to be verbs with in-formation about their syntactic arguments (either thenoun identity of the head of an NP phrase relating tothe verb, the entity identity according to a corefer-ence resolution engine, or both). For example, thesentence Smith got off the plane at the Beijing air-

11

port would be represented as (get off, smith, plane,(at, airport)). This event representation was investi-gated in Pichotta and Mooney (2014) in the contextof count-based co-occurrence models. Balasubra-manian et al. (2013), Modi and Titov (2014), andGranroth-Wilding and Clark (2016) describe sys-tems for related tasks with similar event formula-tions.

In Pichotta and Mooney (2016a), we train anRNN sequence model by inputting one componentof an event tuple at each timestep, representing se-quences of events as sequences of event compo-nents. Standard methods for learning RNN sequencemodels are applied to learning statistical models ofsequences of event components. To infer probableunobserved events from documents, we input ob-served document events in sequence, one event com-ponent per timestep, and then search over the com-ponents of a next event to be inferred using a beamsearch. That is, the structured prediction problem ofevent inference is reduced to searching over proba-ble RNN output sequences. This is similar in spiritto a number of recent systems using RNN models forstructured prediction (Vinyals et al., 2015; Luong etal., 2016; Dong and Lapata, 2016).

While the count-based event co-occurrence sys-tem we investigated in Pichotta and Mooney (2014)treats events as atomic—for example, the plane flewand the plane flew over land are unrelated eventswith completely independent statistics—this methoddecomposes events into components, and the twooccurrences of the verb flew in the above exam-ples have the same representation. Further, a low-dimensional embedding is learned for every eventcomponent, so flew and soared can get similar rep-resentations, allowing for generalization beyond thelexical level. Given the combinatorial number ofevent types,1 decomposing structured events intocomponents, rather than treating them as atomic, iscrucial to scaling up the number of events a scriptsystem is capable of inferring. In fact, the systempresented in Pichotta and Mooney (2014) does notuse noun information about event arguments for thisreason, instead using only coreference-based entity

1With a vocabulary of V verb types, N noun types, Ppreposition types, and event tuples of arity k, there are aboutV PNk−1 event types. For V = N = 10000, P = 50, andk = 4, this is 5× 1017.

information.

System Recall at 25 HumanUnigram 0.101 -Bigram 0.124 2.21LSTM 0.152 3.67

Table 1: Next event prediction results in Pichotta and Mooney

(2016a). Partial credit is out of 1, and human evaluations are

out of 5 (higher is better for both). More results can be found in

the paper.

Table 1 gives results comparing a naive baseline(“Unigram,” which always deterministically guessesthe most common events), a co-occurrence basedbaseline (“Bigram,” similar to the system of Pichottaand Mooney (2014)) and the LSTM system. Themetric “Recall at 25” holds an event out from a testdocument and judges a system by its recall of thegold-standard event in its list of top 25 inferences.The “Human” metric is average crowdsourced judg-ments of inferences on a scale from 0 to 5, withsome post hoc quality-control filtering applied. TheLSTM system outperforms the other systems. Moreresults and details can be found in Pichotta andMooney (2016a).

These results indicate that RNN sequence mod-els can be fruitfully applied to the task of predictingheld-out events from text, by modeling and inferringevents comprising a subset of the document’s syn-tactic dependency structure. This naturally raisesthe question of to what extent, within the currentregime of event-inferring systems trained on doc-uments, explicit syntactic dependencies are neces-sary as a mediating representation. In Pichotta andMooney (2016b), we compare event RNN models,of the sort described above, with RNN models thatoperate at the raw text level. In particular, we inves-tigate the performance of a text-level sentence en-coder/decoder similar to the skip-thought system ofKiros et al. (2015) on the task. In this setup, dur-ing inference, instead of encoding events and de-coding events, we encode raw text, decode raw text,and then parse inferred text to get its dependencystructure.2 This system does not obviously encodeevent co-occurrence structure in the way that the

2We use the Stanford dependency parser (Socher et al.,2013).

12

previous one does, but can still in principle infer im-plicit events from text, and does not require a parser(and can be therefore be used for low-resource lan-guages).

System Accuracy BLEU 1G PUnigram 0.002 - -Copy/paste - 1.88 22.6Event LSTM 0.023 0.34 19.9Text LSTM 0.020 5.20 30.9

Table 2: Prediction results in Pichotta and Mooney (2016b).

More results can be found in the paper.

Table 2 gives a subset of results from Pichotta andMooney (2016b), comparing an event LSTM witha text LSTM. The “Copy/paste” baseline determin-istically predicts a sentence as its own successor.The “Accuracy” metric measures what percentage ofargmax inferences were equal to the gold-standardheld-out event. The “BLEU” column gives BLEUscores (Papineni et al., 2002) for raw text inferredby systems (either directly, or via an intermediatetext-generation step in the case of the Event LSTMoutput). The “1G P” column gives unigram preci-sion against the gold standard, which is one of thecomponents of BLEU. Figure 1, reproduced fromPichotta and Mooney (2016b), gives some examplenext-sentence predictions. Despite the fact that it isvery difficult to predict the next sentence in naturaltext, the text-level encoder/decoder system is capa-ble of learning learning some aspects of event co-occurrence structure in documents.

These results indicate that modeling text directlydoes not appear to appreciably harm the ability toinfer held-out events, and greatly helps in inferringheld-out text describing those events.

3 Related Work

There are a number of related lines of research inves-tigating different approaches to statistically model-ing event co-occurrence. There is, first of all, a bodyof work investigating systems which infer eventsfrom text (including the above work). Chambers andJurafsky (2008) give a method of modeling and in-ferring simple (verb, dependency) pair-events. Janset al. (2012) describe a model of the same sorts ofevents which gives superior performance on the task

of held-out event prediction; Rudinger et al. (2015)follow this line of inquiry, concluding that the taskof inferring held-out (verb, dependency) pairs fromdocuments is best handled as a language modelingtask.

Second, there is a body of work focusing on au-tomatically inducing structured collections of events(Chambers, 2013; Cheung et al., 2013; Nguyen etal., 2015; Ferraro and Van Durme, 2016), typicallymotivated by Information Extraction tasks.

Third, there is a body of work investigating high-precision models of situations as they occur in theworld (as opposed to how they are described in text)from smaller corpora of event sequences (Regneri etal., 2010; Li et al., 2012; Frermann et al., 2014; Orret al., 2014).

Fourth, there is a recent body of work investi-gating the automatic induction of event structure indifferent modalities. Kim and Xing (2014) give amethod of modeling sequences of images from or-dered photo collections on the web, allowing them toperform, among other things, sequential image pre-diction. Huang et al. (2016) describe a new datasetof photos in temporal sequence scraped from webalbums, along with crowdsourced story-like descrip-tions of the sequences (and methods for automati-cally generating the latter from the former). Bosse-lut et al. (2016) describe a system which learns amodel of prototypical event co-occurrence from on-line photo albums with their natural language cap-tions. Incorporating learned event co-occurrencestructure from large-scale natural datasets of differ-ent modalities could be an exciting line of future re-search.

Finally, there are a number of alternative waysof evaluating learned script models that have beenproposed. Motivated by the shortcomings of eval-uation via held-out event inference, Mostafazadehet al. (2016) recently introduced a corpus of crowd-sourced short stories with plausible “impostor” end-ings alongside the real endings; script systemscan be evaluated on this corpus by their abilityto discriminate the real ending from the impostorone. This corpus is not large enough to train ascript system, but can be used to evaluate a pre-trained one. Hard coreference resolution problems(so-called “Winograd schema challenge” problems(Rahman and Ng, 2012)) provide another possible

13

Input: As of October 1 , 2008 , 〈OOV〉 changed its company name to Panasonic Corporation.Gold: 〈OOV〉 products that were branded “National” in Japan are currently marketed under

the “Panasonic” brand.Predicted: The company’s name is now 〈OOV〉.Input: White died two days after Curly Bill shot him.Gold: Before dying, White testified that he thought the pistol had accidentally discharged

and that he did not believe that Curly Bill shot him on purpose.Predicted: He was buried at 〈OOV〉 Cemetery.Input: The foundation stone was laid in 1867.Gold: The members of the predominantly Irish working class parish managed to save £700

towards construction, a large sum at the time.Predicted: The 〈OOV〉 was founded in the early 20th century.Input: Soldiers arrive to tell him that 〈OOV〉 has been seen in camp and they call for his

capture and death.Gold: 〈OOV〉 agrees .Predicted: 〈OOV〉 is killed by the 〈OOV〉.

Figure 1: Examples of next-sentence text predictions, reproduced from Pichotta and Mooney (2016b). 〈OOV〉 is the out-of-

vocabulary pseudo-token, which frequently replaces proper names.

alternative evaluation for script systems.

4 Future Work and Conclusion

The methods described above were motivated by theutility of event inferences based on world knowl-edge, but, in order to leverage large text corpora,actually model documents rather than scenarios inthe world per se. That is, this work operates un-der the assumption that modeling event sequencesin documents is a useful proxy for modeling eventsequences in the world. As mentioned in Section 3,incorporating information from multiple modalitiesis one possible approach to bridging this gap. In-corporating learned script systems into other usefulextrinsic evaluations, for example coreference reso-lution or question-answering, is another.

For the task of inferring verbs and arguments ex-plicitly present in documents, as presented above,we have described some evidence that, in the con-text of standard RNN training setups, modeling rawtext yields fairly comparable performance to explic-itly modeling syntactically mediated events. The ex-tent to which this is true for other extrinsic tasksis an empirical issue that we are currently explor-ing. Further, the extent to which representationsof more complex event properties (such as thosehand-encoded in Schank and Abelson (1977)) canbe learned automatically (or happen to be encodedin the learned embeddings and dynamics of neuralscript models) is an open question.

Acknowledgments

This research was supported in part by the DARPADEFT program under AFRL grant FA8750-13-2-0026.

References

Niranjan Balasubramanian, Stephen Soderland, Mausam,and Oren Etzioni. 2013. Generating coherent eventschemas at scale. In Proceedings of the 2013 Confer-ence on Empirical Methods in Natural Language Pro-cessing (EMNLP-2013).

Antoine Bosselut, Jianfu Chen, David Warren, HannanehHajishirzi, and Yejin Choi. 2016. Learning prototyp-ical event structure from photo albums. In Proceed-ings of the 54th Annual Meeting of the Association forComputational Linguistics (ACL-16).

Nathanael Chambers and Daniel Jurafsky. 2008. Un-supervised learning of narrative event chains. In Pro-ceedings of the 46th Annual Meeting of the Associationfor Computational Linguistics (ACL-08), pages 789–797.

Nathanael Chambers and Dan Jurafsky. 2009. Unsuper-vised learning of narrative schemas and their partici-pants. In Proceedings of the 47th Annual Meeting ofthe Association for Computational Linguistics (ACL-09), pages 602–610.

Nathanael Chambers. 2013. Event schema inductionwith a probabilistic entity-driven model. In Proceed-ings of the 2013 Conference on Empirical Methods inNatural Language Processing (EMNLP-2013).

14

Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Vander-wende. 2013. Probabilistic frame induction. In Pro-ceedings of the 2013 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies (NAACL-13).

Li Dong and Mirella Lapata. 2016. Language to logicalform with neural attention. In Proceedings of the 54thAnnual Meeting of the Association for ComputationalLinguistics (ACL-16).

Francis Ferraro and Benjamin Van Durme. 2016. A uni-fied Bayesian model of scripts, frames and language.In Proceedings of the 30th AAAI Conference on Artifi-cial Intelligence (AAAI-16).

Lea Frermann, Ivan Titov, and Manfred Pinkal. 2014. Ahierarchical Bayesian model for unsupervised induc-tion of script knowledge. In Proceedings of the 14thConference of the European Chapter of the Associa-tion for Computational Linguistics (EACL-14), pages49–57.

Mark Granroth-Wilding and Stephen Clark. 2016. Whathappens next? Event prediction using a compositionalneural network model. In Proceedings of the 30thAAAI Conference on Artificial Intelligence (AAAI-16).

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Computation, 9(8):1735–1780.

Ting-Hao K. Huang, Francis Ferraro, NasrinMostafazadeh, Ishan Misra, Jacob Devlin, AishwaryaAgrawal, Ross Girshick, Xiaodong He, PushmeetKohli, Dhruv Batra, Larry Zitnick, Devi Parikh, LucyVanderwende, Michel Galley, and Margaret Mitchell.2016. Visual storytelling. In Proceedings of the 15thAnnual Conference of the North American Chapterof the Association for Computational Linguistics(NAACL-16).

Bram Jans, Steven Bethard, Ivan Vulic, andMarie Francine Moens. 2012. Skip n-gramsand ranking functions for predicting script events.In Proceedings of the 13th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics (EACL-12), pages 336–344.

Gunhee Kim and Eric P. Xing. 2014. Reconstructingstoryline graphs for image recommendation from webcommunity photos. In Proceedings of the 2014 IEEEConference on Computer Vision and Pattern Recogni-tion (CVPR-14).

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard S. Zemel, Antonio Torralba, Raquel Urtasun,and Sanja Fidler. 2015. Skip-thought vectors. InProceedings of the 29th Annual Conference on NeuralInformation Processing Systems (NIPS-15).

Boyang Li, Stephen Lee-Urban, Darren Scott Appling,and Mark O Riedl. 2012. Crowdsourcing narrativeintelligence. Advances in Cognitive Systems, 2:25–42.

Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, OriolVinyals, and Lukasz Kaiser. 2016. Multi-task se-quence to sequence learning. In Proceedings of the4th International Conference on Learning Represen-tations (ICLR-16).

Ashutosh Modi and Ivan Titov. 2014. Inducing neu-ral models of script knowledge. In Proceedings ofthe Eighteenth Conference on Computational NaturalLanguage Learning (CoNLL-2014), Baltimore, MD,USA.

Raymond J. Mooney and Gerald F. DeJong. 1985.Learning schemata for natural language processing. InProceedings of the Ninth International Joint Confer-ence on Artificial Intelligence (IJCAI-85), pages 681–687.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A corpusand cloze evaluation for deeper understanding of com-monsense stories. In Proceedings of the 2016 Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics: Human LanguageTechnologies (NAACL-16).

Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, andRomaric Besancon. 2015. Generative event schemainduction with entity disambiguation. In Proceed-ings of the 53rd Annual Meeting of the Association forComputational Linguistics (ACL-15).

J Walker Orr, Prasad Tadepalli, Janardhan Rao Doppa,Xiaoli Fern, and Thomas G Dietterich. 2014. Learn-ing scripts as Hidden Markov Models. In Proceedingsof the 28th AAAI Conference on Artificial Intelligence(AAAI-14).

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation. In Proceedings of the40th Annual Meeting of the Association for Computa-tional Linguistics (ACL-02), pages 311–318.

Karl Pichotta and Raymond J. Mooney. 2014. Statis-tical script learning with multi-argument events. InProceedings of the 14th Conference of the EuropeanChapter of the Association for Computational Linguis-tics (EACL 2014), pages 220–229.

Karl Pichotta and Raymond J. Mooney. 2016a. Learn-ing statistical scripts with LSTM recurrent neural net-works. In Proceedings of the 30th AAAI Conferenceon Artificial Intelligence (AAAI-16).

Karl Pichotta and Raymond J. Mooney. 2016b. Usingsentence-level LSTM language models for script in-ference. In Proceedings of the 54th Annual Meeting of

15

the Association for Computational Linguistics (ACL-16), Berlin, Germany.

Altaf Rahman and Vincent Ng. 2012. Resolving com-plex cases of definite pronouns: the Winograd schemachallenge. In Proceedings of the 2012 Conferenceon Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL-12), pages 777–789.

Michaela Regneri, Alexander Koller, and ManfredPinkal. 2010. Learning script knowledge with webexperiments. In Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguistics(ACL-10), Uppsala, Sweden, July.

Rachel Rudinger, Pushpendre Rastogi, Francis Ferraro,and Benjamin Van Durme. 2015. Script induction aslanguage modeling. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP-15).

Roger C. Schank and Robert P. Abelson. 1977. Scripts,Plans, Goals and Understanding: An Inquiry into Hu-man Knowledge Structures. Lawrence Erlbaum andAssociates.

Richard Socher, John Bauer, Christopher D. Manning,and Andrew Y. Ng. 2013. Parsing with compositionalvector grammars. In Proceedings of the 51st AnnualMeeting of the Association for Computational Linguis-tics (ACL-13).

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-mar as a foreign language. In Proceedings of the 29thAnnual Conference on Neural Information ProcessingSystems (NIPS-15), pages 2755–2763.

16


Moving away from semantic overfitting in disambiguation datasets

Marten Postma and Filip Ilievski and Piek Vossen and Marieke van ErpVrije Universiteit Amsterdam

m.c.postma,f.ilievski,piek.vossen,[email protected]

Abstract

Entities and events in the world have no fre-quency, but our communication about themand the expressions we use to refer to themdo have a strong frequency profile. Languageexpressions and their meanings follow a Zip-fian distribution, featuring a small amount ofvery frequent observations and a very longtail of low frequent observations. Since ourNLP datasets sample texts but do not sam-ple the world, they are no exception to Zipf’slaw. This causes a lack of representativenessin our NLP tasks, leading to models that cancapture the head phenomena in language, butfail when dealing with the long tail. We there-fore propose a referential challenge for seman-tic NLP that reflects a higher degree of ambi-guity and variance and captures a large rangeof small real-world phenomena. To performwell, systems would have to show deep under-standing on the linguistic tail.

1 Introduction

Semantic processing addresses the relation betweennatural language and a representation of a world,to which language makes reference. A challeng-ing property of this relation is the context-boundcomplex interaction between lexical expressions andworld meanings.1 Like many natural phenomena,the distribution of expressions and their meaningsfollows a power law such as Zipf’s law (Newman,2005) , with a few very frequent observations and a

1We use meaning as an umbrella term for both concepts orlexical meanings and instances or entities, and lexical expres-sion as a common term for both lemmas and surface forms.

very long tail of low frequent observations.2 Still,the world itself has no frequency. All entities andevents in the world appear to us with a frequencyof 1. Nevertheless, we dominantly talk about onlya few instances in the world and refer to them witha small set of expressions, which can only be ex-plained by the contextual constraints within a lan-guage community, a topic, a location, and a periodof time. Without taking these into account, it is im-possible to fully determine meaning.

Given that instances in the world do not have fre-quencies, language and our writing about the worldis heavily skewed, selective, and biased with respectto that world. A name such as Ronaldo can have aninfinite amount of references and in any world (realor imaginary) each Ronaldo is equally present. Ourdatasets, however, usually make reference to onlyone Ronaldo. The problem, as we see it, is thatour NLP datasets sample texts but do not sample theworld. This causes lack of representativeness in ourNLP tasks, that has big consequences for languagemodels: they tend to capture the head phenomenain text without considering the context constraintsand thus fail when dealing with less dominant worldphenomena. As a result, there is little awarenessof the full complexity of the task in relation to thecontextual realities, given language as a system ofexpressions and the possible interpretations withincontexts of time, location, community, and topic.People, however, have no problem to handle localreal-world situations that are referenced to in text.

We believe it is time to create a task that encour-

2We acknowledge that there also exist many long tail phe-nomena in syntactic processing, e.g. syntactic parsing.

17

ages systems to model the full complexity of disam-biguation by enriched context awareness. We hencepropose a semantic referential challenge, event-based Question Answering (QA), that reflects a highdegree of ambiguity and variance and captures awide range of small real-world phenomena. Thistask requires a deeper semantic understanding of thelinguistic tail of several disambiguation challenges.

2 Related work

We present related work on the representativeness ofthe disambiguation datasets (Section 2.1), as well asthe representativeness of the QA task (Section 2.2).

2.1 The Long Tail in disambiguation tasks

Past work tried to improve the disambiguation com-plexity. Vossen et al. (2013) created a balancedcorpus, DutchSemCor, for the Word Sense Disam-biguation (WSD) task in which each sense gets anequal number of examples. Guha et al. (2015) cre-ated the QuizBowl dataset for Entity Coreference(EnC), while Cybulska and Vossen (2014) extendedthe existing Event Coreference (EvC) dataset ECBto ECB+, both efforts resulting in notably greaterambiguity and temporal diversity. Although all thesedatasets increase the complexity for disambiguation,they still contain a limited amount of data which isfar from approximating realistic tasks.

Properties of existing disambiguation datasetshave been examined for individual tasks. For WSD,the correct sense of a lemma is shown to often co-incide with the most frequent sense (Preiss, 2006).Van Erp et al. (2016) conclude that Entity Linking(EL) datasets contain very little referential ambi-guity and focus on well-known entities, i.e. enti-ties with high PageRank (Page et al., 1999) values.Moreover, the authors note a considerable overlap ofentities across datasets. Cybulska and Vossen (2014)and Guha et al. (2015) both stress the low ambiguityin the current datasets for the tasks of EvC and EnC.

In Ilievski et al. (2016), we measure the proper-ties of existing disambiguation datasets for the tasksof EL, WSD, EvC, EnC, and Semantic Role Label-ing (SRL), through a set of generic representationmetrics applicable over tasks. The analyzed datasetsshow a notable bias with respect to aspects of ambi-guity, variance, dominance, and time, thus exposing

a strong semantic overfitting to a very limited, andwithin that, popular part of the world.

The problem of overfitting to a limited set of testdata has been addressed by the field of domain adap-tation (Daume III, 2007; Carpuat et al., 2013; Jiangand Zhai, 2007). In addition, unsupervised domain-adversarial approaches attempt to build systems thatgeneralize beyond the specifics of a given dataset,e.g. by favoring features that apply to both thesource and target domains (Ganin et al., 2016). Byevaluating on another domain than the training one,these efforts have provided valuable insights intosystem performance. Nevertheless, this researchhas not addressed the aspects of time and location.Moreover, to our knowledge, no approach has beenproposed to generalize the problem of reference tounseen domains, which may be due to the enormousamount of references that exist in the world leadingto an almost infinite amount of possible classes.

2.2 The Long Tail in QA tasks

The sentence selection datasets WikiQA (Yang etal., 2015) and QASent (Wang et al., 2007) consistof questions that are collected from validated userquery logs, while the answers are annotated man-ually from automatically selected Wikipedia pages.WIKIREADING (Hewlett et al., 2016) is a re-cent large-scale dataset that is based on the struc-tured information from Wikidata (Vrandecic andKrotzsch, 2014) and the unstructured informationfrom Wikipedia. Following a smart fully-automateddata acquisition strategy, this dataset contains ques-tions about 884 properties of 4.7 million instances.While these datasets require semantic text process-ing of the questions and the candidate answers, thereis a finite set of answers, many of which representpopular interpretations from the world, as a directconsequence of using Wikipedia. To our knowledge,no QA task has been created to deliberately addressthe problem of (co)reference to long tail instances,where the list of potential interpretations is enor-mous, largely ambiguous, and only relevant withina specific context. The long tail aspect could be em-phasized by an event-driven QA task, since the ref-erential ambiguity of events in the world is muchhigher than the ambiguity of entities. No event-driven QA task has been proposed in past work. AsWikipedia only represents a tiny and popular subset

18

of all world events, the Wikipedia-based approachescould not be applied to create such a task, thus sig-naling the need for a novel data acquisition approachto create an event-driven QA task for the long tail.

Weston et al. (2015) propose 20 skill sets for acomprehensive QA system. This work presents rea-soning categories (e.g. spatial and temporal rea-soning) and requires within-document coreference,which are very relevant skills for understanding lin-guistic phenomena of the long tail. However, the an-swer in these tasks is usually mentioned in the text,thus not addressing the referential complexity of thelong tail phenomena in the world.

3 Moving away from semantic overfitting

Current datasets only cover a small portion of thefull complexity of the disambiguation task, focusingmostly on the head. This has encouraged systemsto overfit on the head and largely ignore the linguis-tic tail. Due to this lack of representativeness, weare not able to determine to which degree systemsachieve language understanding of the long tail.

As described in the previous Section, the chal-lenge of semantic overfitting has been recognized bypast work. QA datasets, such as WIKIREADING,have increased the complexity of interpretation byusing a large number of entities and questions, alsoallowing for subsets to be sampled to tackle spe-cific tasks. The skill sets presented by Weston et al.(2015) include long tail skills that are crucial in or-der to interpret language in various micro-contexts.None of these approaches has yet created a task thataddresses the long tail explicitly and recognizes thefull referential complexity of disambiguation. Con-sidering the competitiveness of the field, such task isnecessary to motivate systems that can deal with thelong tail and adapt to new contexts.

We therefore advocate a task that requires a deepsemantic processing linked to both the head and thelong tail. It is time to create a high-level referentialchallenge for semantic NLP that reflects a higher de-gree of ambiguity and variation and captures a widerange of small real-world phenomena. This taskcan not be solved by only capturing the head phe-nomena of the disambiguation tasks in any sampletext collection. For maximum complexity, we pro-pose an event-driven QA task that also represents lo-

cal events, thus capturing phenomena from both thehead and the long tail. Also, these events should bedescribed across multiple documents that exhibit anatural topical spread over time, providing informa-tion bit-by-bit as it becomes available.

4 Task requirements

We define five requirements that should be satisfiedby an event-driven QA task in order to maximizeconfusability, to challenge systems to deal with thetail of the Zipfian distribution, and to adapt to newcontexts. These requirements apply to a single eventtopic, e.g. murder. Each event topic should contain:R1 Multiple event instances per event topic, e.g. themurder of John Doe and the murder of Jane Roe.R2 Multiple event mentions per event instancewithin the same document.R3 Multiple documents with varying document cre-ation times in which the same event instances aredescribed to capture topical information over time.R4 Event confusability by combining one or multi-ple confusion factors:a) ambiguity of event surface forms, e.g. John Smithfires a gun, and John Smith fires an employee.b) variance of event surface forms, e.g. John Smithkills John Doe, and John Smith murders John Doe.c) time, e.g. murder A that happened in January1993, and murder B in October 2014.d) participants, e.g. murder A committed by JohnDoe, and murder B committed by the Roe couple.e) location, e.g. murder A that happened in Okla-homa, and murder B in Zaire.R5 Representation of non-dominant events and enti-ties, i.e. instances that receive little media coverage.Hence, the entities would not be restricted to celebri-ties and the events not to general elections.

5 Proposal

We propose a semantic task that represents the lin-guistic long tail. The task will consist of one high-level challenge (QA), for which an understanding ofthe long tail of several disambiguation tasks (EL,WSD, EvC, EnC) is needed in order to performwell on the high-level challenge. The QA taskwould feature two levels of event-oriented ques-tions: instance-level questions (e.g. Who was killed

19

last summer in Vienna?) and aggregation-level ques-tions (e.g. How many white people have beenpoisoned in the last 2 years?). The setup wouldbe such that the QA challenge could in theory beaddressed without performing any disambiguation(e.g. using enhanced Information Retrieval), butdeeper processing, especially on the disambiguationtasks, would be almost necessary in practice to beable to come up with the correct answers.

To some extent, the requirements in Section 4 aresatisfied by an existing corpus, ECB+ (Cybulska andVossen, 2014), which contains 43 event topics. Foreach event topic in the corpus, there are at least 2different seminal events (R1). Since the corpus con-tains 7,671 intra-document coreference links, on av-erage 7.8 per document, we can assume that require-ment R2 is satisfied to a large extent. Although thereare multiple news articles per event instance, theyare not spread over time, which means that R3 isnot satisfied. Furthermore, the event confusabilityfactors (R4) are not fully represented, since the am-biguity and variance of the event surface forms andthe participants are still very low, whereas the domi-nance is quite standard (R5), which is not surprisinggiven that these aspects were not considered duringthe corpus assembly period. Additionally, only 1.8sentences per document were annotated on average.Potential references in the remaining sentences needto be validated as well.

We will start with the ECB+ corpus and expand itby following an event topic-based strategy:3

1) Pick a subset of ECB+ topics, by favoring:a) seminal events (e.g. murder) whose surface formshave a low lexical ambiguity, but can be referred toby many different surface forms (execute, slay, kill)b) combinations of two or more seminal events thatcan be referred to by the same polysemous form (e.g.firing).2) Select one or more confusability factors from R4,e.g. by choosing participants and variance. Thisstep can be repeated for different combinations ofthe confusability factors.3) Increase the amount of events for an event topic(to satisfy R1). We add new events based on theconfusability factors chosen in step 2 and from local

3The same procedure can be followed for an entity-centricexpansion approach.

news sources to ensure low dominance (R5). Theseevents can come from different documents or thesame document.4) Retrieve multiple event mentions for each eventbased on the decision from the confusability factors(R4). We use local news sources to ensure low domi-nance (R5). They originate from the same document(R2) and from different documents with (slightly)different creation times (R3).

In order to facilitate the expansion described insteps 3 and 4, we will add documents to ECB+from The Signal Media One-Million News ArticlesDataset (Signal1M) (Corney et al., 2016). This willassist in satisfying requirement R5, since the Sig-nal1M Dataset is a collection of mostly local news.For the expansion, active learning will be appliedon the Signal1M Dataset, guided by the decisions instep 2, to decide which event mentions are corefer-ential and which are not. By following our four-stepacquisition strategy and by using the active learn-ing method, we expect to obtain a high accuracy onEvC. As we do not expect perfect accuracy of EvCeven within this smart acquisition, we will validatethe active learning output. The validation will leadto a reliable set of events on a semantic level forwhich we would be able to pose both instance-leveland aggregation-level questions, as anticipated ear-lier in this Section. As the task of QA does not re-quire full annotation of all disambiguation tasks, wewould be able to avoid excessive annotation work.

6 Conclusions

This paper addressed the issue of semantic over-fitting to disambiguation datasets. Existing dis-ambiguation datasets expose lack of representative-ness and bias towards the head interpretation, whilelargely ignoring the rich set of long tail phenomena.Systems are discouraged to consider the full com-plexity of the disambiguation task, since the mainincentive lies in modelling the head phenomena. Toaddress this issue, we defined a set of requirementsthat should be satisfied by a semantic task in orderto inspire systems that can deal with the linguistictail and adapt to new contexts. Based on these re-quirements, we proposed a high-level task, QA, thatrequires a deep understanding of each disambigua-tion task in order to perform well.

20

References[Carpuat et al.2013] Marine Carpuat, Hal Daume III,

Katharine Henry, Ann Irvine, Jagadeesh Jagarlamudi,and Rachel Rudinger. 2013. Sensespotting: Neverlet your parallel data tie you to an old domain. InProceedings of the Association for Computational Lin-guistics (ACL). Citeseer.

[Corney et al.2016] David Corney, Dyaa Albakour,Miguel Martinez, and Samir Moussa. 2016. Whatdo a million news articles look like? In Proceedingsof the First International Workshop on Recent Trendsin News Information Retrieval co-located with 38thEuropean Conference on Information Retrieval (ECIR2016), Padua, Italy, March 20, 2016., pages 42–47.

[Cybulska and Vossen2014] Agata Cybulska and PiekVossen. 2014. Using a sledgehammer to crack a nut?lexical diversity and event coreference resolution. InProceedings of the International Conference on Lan-guage Resources and Evaluation (LREC), pages 4545–4552.

[Daume III2007] Hal Daume III. 2007. Frustratinglyeasy domain adaptation. In Proceedings of the45th Annual Meeting of the Association of Computa-tional Linguistics (ACL 2007), pages 256–263, Prague,Czech Republic. Association for Computational Lin-guistics.

[Ganin et al.2016] Yaroslav Ganin, Evgeniya Ustinova,Hana Ajakan, Pascal Germain, Hugo Larochelle,Francois Laviolette, Mario Marchand, and VictorLempitsky. 2016. Domain-adversarial training of neu-ral networks. Journal of Machine Learning Research,17(59):1–35.

[Guha et al.2015] Anupam Guha, Mohit Iyyer, DannyBouman, Jordan Boyd-Graber, and Jordan Boyd.2015. Removing the training wheels: A coreferencedataset that entertains humans and challenges comput-ers. In North American Association for ComputationalLinguistics (NAACL).

[Hewlett et al.2016] Daniel Hewlett, Alexandre Lacoste,Llion Jones, Illia Polosukhin, Andrew Fandrianto, JayHan, Matthew Kelcey, and David Berthelot. 2016.Wikireading: A novel large-scale language under-standing task over wikipedia. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1535–1545. Association for Computational Linguis-tics.

[Ilievski et al.2016] Filip Ilievski, Marten Postma, andPiek Vossen. 2016. Semantic overfitting: what‘world’ do we consider when evaluating disambigua-tion of text? In proceedings of COLING.

[Jiang and Zhai2007] Jing Jiang and ChengXiang Zhai.2007. Instance weighting for domain adaptation in

NLP. In Proceedings of the 45th Annual Meetingof the Association of Computational Linguistics (ACL2007), volume 7, pages 264–271.

[Newman2005] Mark EJ Newman. 2005. Power laws,pareto distributions and zipf’s law. Contemporaryphysics, 46(5):323–351.

[Page et al.1999] Lawrence Page, Sergey Brin, RajeevMotwani, and Terry Winograd. 1999. The PageRankcitation ranking: bringing order to the web.

[Preiss2006] Judita Preiss. 2006. A detailed comparisonof WSD systems: an analysis of the system answersfor the Senseval-2 English all words task. NaturalLanguage Engineering, 12(03):209–228.

[van Erp et al.2016] Marieke van Erp, Pablo Mendes,Heiko Paulheim, Filip Ilievski, Julien Plu, GiuseppeRizzo, and Jorg Waiterlonis. 2016. Evaluating en-tity linking: An analysis of current benchmark datasetsand a roadmap for doing a better job. In Proceed-ings of the International Conference on Language Re-sources and Evaluation (LREC).

[Vossen et al.2013] Piek Vossen, Ruben Izquierdo, andAtilla Gorog. 2013. DutchSemCor: in quest of theideal sense-tagged corpus. In Proceedings of RecentAdvances in Natural Language Processing (RANLP),pages 710–718. INCOMA Ltd. Shoumen, Bulgaria.

[Vrandecic and Krotzsch2014] Denny Vrandecic andMarkus Krotzsch. 2014. Wikidata: a free collabo-rative knowledgebase. Communications of the ACM,57(10):78–85.

[Wang et al.2007] Mengqiu Wang, Noah A Smith, andTeruko Mitamura. 2007. What is the jeopardy model?a quasi-synchronous grammar for qa. In EMNLP-CoNLL, volume 7, pages 22–32.

[Weston et al.2015] Jason Weston, Antoine Bordes, SumitChopra, Alexander M Rush, Bart van Merrienboer, Ar-mand Joulin, and Tomas Mikolov. 2015. Towards ai-complete question answering: A set of prerequisite toytasks. arXiv preprint arXiv:1502.05698.

[Yang et al.2015] Yi Yang, Wen-tau Yih, and Christo-pher Meek. 2015. Wikiqa: A challenge dataset foropen-domain question answering. In Proceedings ofEMNLP, pages 2013–2018. Citeseer.

21


Unsupervised Event Coreference for Abstract Words

Dheeraj Rajagopal and Eduard Hovy and Teruko MitamuraLanguage Technologies Institute

Carnegie Mellon [email protected], [email protected], [email protected]

Abstract

We introduce a novel approach for resolvingcoreference when the trigger word refers tomultiple (sometimes non-contiguous) clauses.Our approach is completely unsupervised, andour experiments show that Neural Networkmodels perform much better (about 20% moreaccurate) than traditional feature-rich baselinemodels. We also present a new dataset forBiomedical Language Processing which, withonly about 25% of the original corpus vocab-ulary, still captures the essential distributionalsemantics of the corpus.

1 Introduction

Event coreference is a key module in many NLPapplications, especially those that involve multisen-tence discourse. Current event coreference systemsrestrict the problem to finding a correspondence be-tween trigger words or phrases and their fully coref-erent event (word or phrase). This approach is ratherlimited since it does not handle the case when thetrigger refers to several events as a group, as in

We worked hard all our lives. But oneyear we went on vacation. There wasboating, crazy adventure sports, andpro-golfing. We also spent time in theevenings strolling around the park. Buteventually we had to go home. Therecouldn’t have been a better vacation.

In this paper we generalize the idea of coreferenceto 3 levels based on the degree of abstraction of thecoreference trigger:

1. Level 1 – Direct Mention: The trigger phraseis specific and usually matches the referringevent(s) word-for-word or phrase-for-phrase.

2. Level 2 – Single Clause: While there is a sim-ilar word-to-phrase or word-to-word relation-ship as in level 1, the trigger is a more genericevent compared to level 1.

3. Level 3 – Multiple Clauses: The trigger isquite generic and refers to a particular instanceof an event that is described over multipleclauses or sentences (either contiguous or non-contiguous). Typically, the abstract event refersto a set of [sub]events, each of them with itsown own participants or arguments.

See Table 1 for examples.We use PubMed1 as our primary corpus.Almost all work on event coreference (for exam-

ple, (Liu et al., 2014) (Lee et al., 2012)) applies tolevels 1 or 2. In this paper, we propose a general-ized coreference classification scheme and addressthe challenges related to resolving level-3 corefer-ences.

Creating gold-standard training and evaluationmaterials for such coreferences is an uphill chal-lenge. First, there is a significant annotation over-head and, depending on the nature of the corpus, theannotator might require significant domain knowl-edge. Each annotation instance might require multi-ple labels depending the number of abstract eventsmentioned in the corpus. Second, the vocabularyof the corpus is rather large due to domain-related

1http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

22

level 1 In turn the activated ERK phospho-rylates Stim1 at serine 575, andthis phosphorylation enhances com-plex formation of Stim1

level 2 (a) BMI1 enhances hTERT activity.(b) This effect was attenuated by PTEN, PTEN ( CS ) , PTEN ( GE ) , and C-PTEN.

level 3 (a) To determine whether theseclumps were also associated withthe cell cortex, we used confocal mi-croscopy. (b) The actin clumps werefound associated with the cell cortexin only a minority of cases ( Fig .4 ). (c) Immuno-EM using anti-actinantibodies has verified this observation

Table 1: Examples of various levels of Coreference (triggers

are underlined and referent indicated in bold)

named entities like proteins, cell-types, DNA andRNA names. The large vocabulary size necessitateslonger and sparser vectors for representing the doc-uments, resulting in significant data sparsity. Last,evaluating such a system in an unsupervised settingusually leads to debatable justifications for evaluat-ing the models. We address these challenges in thefollowing ways:

1. We construct a new dataset, derived from thePubMed corpus, by replacing all named entitieswith their respective NE label. We normalizeProteins, Cell lines, Cell types, RNA and DNAnames using the tagger described in (Tsuruokaet al., 2005). Also, we normalize all figure andtable references to “internal link”, and citationsto other studies as “external link”. This signif-icantly reduces the vocabulary of the dataset.

2. We present an unsupervised model to representabstract coreferences in text. We present multi-ple baseline systems using the traditional Bag-of-Words model and a Neural Network archi-tecture that outperforms the baseline models.

3. We define a cloze-test evaluation method thatrequires no annotation. Our procedure stemsfrom the following insight. Instead of startingwith the coreference trigger word/phrase andasking “which clauses can refer to this?”, we

train an algorithm to predict for a given clausewhich trigger word/phrase it would ‘prefer to’link to, and then apply this algorithm to [se-quences of] clauses within the likely scope ofreference of a trigger. An example is shown inTable 2. A similar idea was mentioned in (Her-mann et al., 2015).

Passage :BAF57 has been shown to directly interactwith the androgen and estrogen receptors. Weused co-immunoprecipitation experiments totest whether BAF57 forms a complex with PRin cultured cells. In the absence of hormone,a certain proportion of BAF57 already copre-cipitated with PR probably due to the largeproportion of PR molecules already present inthe nucleus in the uninduced state; however30 minutes after hormone addition the extentof coprecipitation was increased. In contrast,no complex of PR with the PBAF specificsubunit, BAF180 was observed independentlyof the addition of the hormone. As a posi-tive control for ABSTRACT COREF EVENTwe used BAF250, a known BAF specific sub-unit.Task: Predict ABSTRACT COREF EVENTfrom the list of all abstract events of interestAnswer: this experiment

Table 2: A sample cloze-test evaluation task

2 Related Work

Entity coreference has been studied quite exten-sively. There are primarily two complementary ap-proaches. The first focuses mainly on identifyingentity mention clusters (see (Haghighi and Klein,2009), (Raghunathan et al., 2010), (Ponzetto andStrube, 2006), (Rahman and Ng, 2011), (Ponzettoand Strube, 2006)). These models employ feature-rich approaches to improve the clustering modelsand are limited to noun pairs. The second focuseson jointly modeling mentions across all the entriesin the document (see (Denis et al., 2007), (Poon andDomingos, 2008), (Wick et al., 2008) and (Lee etal., 2011)). Some more recent work uses event ar-gument information to assist entity coreference; thisincludes (Rahman and Ng, 2011), (Haghighi and

23

Klein, 2010).The distinct problem of Event Coreference has

been relatively underexplored. Some earlier work inthis area includes (Humphreys et al., 1997) but thework was very specific to selected events. More re-cently, there have been approaches to model eventcoreferences separately (Liu et al., 2014) as wellas jointly with entities (Lee et al., 2012). All thiswork makes the limiting assumption of word/phraseto word/phrase coreference (levels 1 and 2 describedearlier). Our work aligns with the event coreferenceliterature but assumes longer spans of text and tack-les the more challenging problem of abstract multi-event/clause coreference.

3 Model

Figure 1: Architecture Diagram for the Coreference Model

Let βi be the coreference word/phrase generatedfrom a distribution parameterized by θ. Each βi

generates antecedents A1..n (sentences that lead to-wards the coreference) that contain the coreferentspan. These antecendents also obey a dependencyrelationship between two adjacent sentences in dis-course. Since multi-clause coreference shows a dis-tinct effect of recency, we also define a decay func-tion D parameterized by α. The decay function Ddictates how the level of association of each ante-cendent varies over increasing sentence distance.

3.1 Distributed Representation of SentencesTo simplify modeling complexity, we first ensurethat all the antecedents are represented by vectors ofthe same dimension. We use the sentence2vec repre-sentation from (Le and Mikolov, 2014) to generate a300-dimensional continuous distributed representa-tion for each sentence in the PubMed corpus. These

vectors are trained using gradient descent, with gra-dients are obtained though back-propagation. Thisallows us to reduce the parameters that would havebeen necessary to model the number of words ineach sentence. Table 3 shows some example eventsand their preferred coreference trigger.

phosphorylation phophorylation, phospory-lation, phoshorylation,dephosphorylation, phos-phorylations, Phospho-rylation, autophosphory-lation, phosphorilation,auto-phosphorylation, phos-phorylated

ubiquitination ubiquitylation, ubiquitiny-lation, polyubiquitination,poly-ubiquitination, SUMOy-lation, polyubiquitylation,deubiquitination, sumoy-lation, autoubiquitination,mono-ubiquitination

concluded speculated, hypothesized, hy-pothesised, argued, surmised,conclude, postulated, noticed,noted, postulate

Table 3: Top trigger words (left) under Word2Vec similarity for

sample events (right)

3.2 Multilayer Perceptron ModelThe MultiLayer Perceptron (MLP) model is givenby the function f : RD → RL, where D is the sizeof input vector x and L is the size of the output vectorf(x),

f (x) = G(b(2) +W (2)

(s(b(1) +W (1)x

)))(1)

with bias vectors b(1) , b(2) ; weight matrices W (1),W (2) and activation functions G (softmax) and s(tanh).

For our model, the input dimensions are 300-dimensional sentence vectors. We define 6 classes (6distinct trigger words) for output. The antecedentsare represented using a single vector, composedfrom the N chosen input clauses, where we vary Nfrom 1 ato 5. For composition we currently use sim-ple average. We assume no decay currently. Thearchitecture diagram is shown in Figure 2.

24

Figure 2: Architecture Diagram for MLP

4 Experiments

The Cloze-test evaluation is inspired by thereading comprehension evaluation from Question-Answering research. In this evaluation, the sys-tem first reads a passage and then attempts to pre-dict missing words from sentences that contain in-formation from the passage. For our evaluation, weuse a slightly modified version of the Cloze-test, inwhich the model is trained for each coreference withsentences that appear before and after the corefer-ence. Currently, we arbitrarily limit the number ofsentences in the antecedent and precedent span forcoreference to 5. Also, we consider only 6 labelsfor now, namely these changes, these responses, thisanalysis, this context,this finding, this observation.

4.1 Experimental SetupIn order to maintain the even distribution of coref-erence candidates, we derived our dataset from thePubMed corpus by selecting 1000 samples of eachof the 6 coreferent labels for a total of 6000 train-ing samples, each sample containing the coreferencetrigger and we pick antecedent sentences based onthe following criteria. If the coreference occurs inthe same paragraph, the number of antecedent sen-tences are limited to sentences from the start of theparagraph or upto five antecedent sentence candi-dates otherwise. For the MLP model, we use a 70-30train-test split and apply the early stopping criteriabased on accuracy drop on the validation dataset.

4.2 ResultsOur results show that our MLP model outperformsall other feature-rich baseline models of traditionalclassifiers. Although there is general skepticism

Classifier AccuracyLinear SVM 0.436

SGD Classifer 0.39BernoulliNB 0.349

Random Forest 0.34AdaBoost 0.359

DecisionTree 0.286MLP 0.62

Table 4: Results for various baselines and our work

around sentence vectors, our experiments show thatRNN and LSTM models are suitable for the gener-alized coreference task.

Although we train using a window of N clausestogether, during run-time we obtain the predictionfor individual sentences rather than taking the av-erage over a window. The label of each sentenceor clause depends on the preference of its imme-diate neighbours, and how these sentences form a‘span’, to arrive at a general ‘consensus’ label. Thistesting criteria can be further improved by using ad-vanced similarity and coherence detection methods.For now, if the predicted class for that particular sen-tence is the same as the true label, then that sentenceis labeled as part of the coreference.

5 Conclusion and Future Work

We presented a classification taxonomy that gener-alizes types of event coreference. We presented amodel for unsupervised abstract coreference. Wedescribed a new dataset for biomedical text that issuitable for any generalized biomedical NLP task.Our Cloze-test evaluation method makes annotationunnecessary.

Since this one of the first works to explore abstractevent coreference, there is an uphill task of develop-ing more principled approaches towards modelingand evaluation. We also plan to explore more so-phisticated models for our architecture and get moreinsights into sentence vectors. Also, we plan to ex-tend this idea of coreference into other data domainslike News corpus and probably extend to entity-coreference work as well.

Acknowledgments

This research was supported in part by U.S. ArmyResearch Office (ARO) grant W911NF-14-1-0436

25

under DARPA’s Big Mechanisms Program. Anyopinion, findings, and conclusions or recommen-dations expressed in this material are those of theauthors and do not necessarily reflect the view ofDARPA, ARO, or the U.S. government.

References

Pascal Denis, Jason Baldridge, et al. 2007. Joint de-termination of anaphoricity and coreference resolutionusing integer programming. In HLT-NAACL, pages236–243. Citeseer.

Aria Haghighi and Dan Klein. 2009. Simple coreferenceresolution with rich syntactic and semantic features.In Proceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume 3-Volume 3, pages 1152–1161. Association for Compu-tational Linguistics.

Aria Haghighi and Dan Klein. 2010. Coreference reso-lution in a modular, entity-centered model. In HumanLanguage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Association forComputational Linguistics, pages 385–393. Associa-tion for Computational Linguistics.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In Advances in Neural InformationProcessing Systems, pages 1693–1701.

Kevin Humphreys, Robert Gaizauskas, and Saliha Az-zam. 1997. Event coreference for information extrac-tion. In Proceedings of a Workshop on OperationalFactors in Practical, Robust Anaphora Resolution forUnrestricted Texts, pages 75–81. Association for Com-putational Linguistics.

Quoc V Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In ICML,volume 14, pages 1188–1196.

Heeyoung Lee, Yves Peirsman, Angel Chang, NathanaelChambers, Mihai Surdeanu, and Dan Jurafsky. 2011.Stanford’s multi-pass sieve coreference resolution sys-tem at the conll-2011 shared task. In Proceedings ofthe Fifteenth Conference on Computational NaturalLanguage Learning: Shared Task, pages 28–34. As-sociation for Computational Linguistics.

Heeyoung Lee, Marta Recasens, Angel Chang, MihaiSurdeanu, and Dan Jurafsky. 2012. Joint entity andevent coreference resolution across documents. InProceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning, pages489–500. Association for Computational Linguistics.

Zhengzhong Liu, Jun Araki, Eduard H Hovy, and TerukoMitamura. 2014. Supervised within-document eventcoreference using information propagation. In LREC,pages 4539–4544.

Simone Paolo Ponzetto and Michael Strube. 2006. Ex-ploiting semantic role labeling, wordnet and wikipediafor coreference resolution. In Proceedings of the mainconference on Human Language Technology Confer-ence of the North American Chapter of the Associationof Computational Linguistics, pages 192–199. Associ-ation for Computational Linguistics.

Hoifung Poon and Pedro Domingos. 2008. Joint unsu-pervised coreference resolution with markov logic. InProceedings of the conference on empirical methods innatural language processing, pages 650–659. Associ-ation for Computational Linguistics.

Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran-garajan, Nathanael Chambers, Mihai Surdeanu, DanJurafsky, and Christopher Manning. 2010. A multi-pass sieve for coreference resolution. In Proceedingsof the 2010 Conference on Empirical Methods in Natu-ral Language Processing, pages 492–501. Associationfor Computational Linguistics.

Altaf Rahman and Vincent Ng. 2011. Coreference res-olution with world knowledge. In Proceedings of the49th Annual Meeting of the Association for Compu-tational Linguistics: Human Language Technologies-Volume 1, pages 814–824. Association for Computa-tional Linguistics.

Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim,Tomoko Ohta, John McNaught, Sophia Ananiadou,and Junichi Tsujii. 2005. Developing a robust part-of-speech tagger for biomedical text. In PanhellenicConference on Informatics, pages 382–392. Springer.

Michael L Wick, Khashayar Rohanimanesh, KarlSchultz, and Andrew McCallum. 2008. A unified ap-proach for schema matching, coreference and canoni-calization. In Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 722–730. ACM.

26


Towards Broad-coverage Meaning Representation: The Case ofComparison Structures

Omid BakhshandehUniversity of Rochester

[email protected]

James F. AllenUniversity of Rochester / [email protected]

1 Introduction

Representing the underlying meaning of text hasbeen a long-standing topic of interest in computa-tional linguistics. Recently there has been a renewedinterest in computational modeling of meaning forvarious tasks such as semantic parsing (Zelle andMooney, 1996; Berant and Liang, 2014). Open-domain and broad-coverage semantic representation(Banarescu et al., 2013; Bos, 2008; Allen et al.,2008) is essential for many language understandingtasks such as reading comprehension tests and ques-tion answering.

One of the most common way for expressing eval-uative sentiment towards different entities is to usecomparison. Comparison can happen in very sim-ple structures such as ‘John is taller than Susan’, ormore complicated constructions such as ‘The tableis longer than the sofa is wide’. So far the compu-tational semantics of comparatives and how they af-fect the meaning of the surrounding text has not beenstudied effectively. That is, the difference betweenthe existing semantic and syntactic representation ofcomparatives has not been distinctive enough for en-abling deeper understanding of a sentence. For in-stance, the general logical form representation of thesentence ‘John is taller than Susan’ using the Boxersystem (Bos, 2008) is the following:

named(x0, john, per)

& named(x1, susan, nam)

&than(taller(x0), x1) (1)

Clearly, the above meaning representation does

My Mazda drove faster than his HyundaiSelf_mover Self_motion Manner

Figure 1: The frame-semantic parsing of the sen-tence My Mazda drove faster than his Hyundai.

not fully capture the underlying semantics of the ad-jective ‘tall’ and what it means to be ‘taller’. A hu-man reader can easily infer that the ‘height’ attributeof John is greater than Susan’s. Capturing the under-lying meaning of comparison structures, as opposedto their surface wording, is crucial for accurate eval-uation of qualities and quantities. Consider a morecomplex comparison example, ‘The pizza was great,but it was still worse than the sandwich’. The state-of-the-art sentiment analysis system (Manning et al.,2014) assigns an overall ‘negative’ sentiment valueto this sentence, which clearly lacks the understand-ing of the comparison happening in the sentence.

As another example, consider the generic mean-ing representation of the sentence ‘My Mazda drovefaster than his Hyundai’, according to frame seman-tic parsing using Semafor1 tool (Das et al., 2014) asdepicted in Figure 1. It is evident that this meaningrepresentation does not fully capture how the seman-tics of the adjective fast relates to the driving event,and what it actually means for a car to drive fasterthan another car. More importantly, there is an ellip-sis in this sentence, the resolution of which resultsin complete reading of ‘My Mazda drove faster thanhis Hyundai drove fast’, which is in no way capturedin Figure 12.

1http://demo.ark.cs.cmu.edu/parse2The same shortcomings are shared among other generic

meaning representations such as LinGO English Resource

27

Although the syntax and semantics of compari-son in language have been studied in linguistics for along time (Bresnan, 1973; Cresswell, 1976; Von Ste-chow, 1984), so far, computational modeling of thesemantics of comparison components of natural lan-guage has not been developed fundamentally. Thelack of such a computational framework has leftthe deeper understanding of comparison structuresstill baffling to the currently existing NLP systems.In this paper we summarize our efforts on defin-ing a joint framework for comprehensive semanticrepresentation of the comparison and ellipsis con-structions. We jointly model comparison and ellip-sis as inter-connected predicate-argument structures,which enables automatic ellipsis resolution. In theupcoming sections we summarize our main contri-butions to this topic.

2 A Comprehensive Semantic Frameworkfor Comparison and Ellipsis

We introduce a novel framework for modelingthe semantics of comparison and ellipsis as inter-connected predicate-argument structures. Accord-ing to this framework, comparison and ellipsis op-erators are the predicates, where each predicate hasa set of arguments called its semantic frame. For ex-ample, in the sentence ‘[Sam] is the tallest [student][in the gym]’, the morpheme -est is the comparisonoperator (hence, the comparison predicate) and theentities in the brackets are the arguments.

2.1 Comparison Structures

2.1.1 PredicatesWe consider two main categories of compar-

ison predicates (Bakhshandeh and Allen, 2015;Bakhshandeh et al., 2016), Ordering and Extreme,each of which can grade any of the four parts ofspeech including adjectives, adverbs, nouns, andverbs.• Ordering: Shows the ordering of two or moreentities on a scale, with the following subtypes:

– Comparatives expressed by the morphemesmore/-er and less, with ‘>’, ‘<’ indicating that onedegree is greater or lesser than the other.

(1) The steak is tastier than the potatoes.

Grammar (ERG) (Flickinger, 2011), Boxer (Bos, 2008), orAMR (Banarescu et al., 2013), among others.

Superlative+

Joe is the most eager boy ever.

Figure Scale/+

DomainDomain-specifier

Figure 2: An example predicate-argument structureconsisting of superlative predicate type and its cor-responding semantic frame of arguments.

– Equatives expressed by as in constructions suchas as tall or as much, with ‘≥’ indicating that onedegree equals or is greater than another.

(2) The Mazda drives as fast as the Nissan.

– Superlatives expressed by most/-est and least,indicates that an entity or event has the ‘highest’or ‘lowest’ degree on a scale.

(3) That chef made the best soup.

The details of the Extreme type can be found inthe earlier work (Bakhshandeh and Allen, 2015;Bakhshandeh et al., 2016).

2.1.2 ArgumentsEach predicate takes a set of arguments that we re-

fer to as the predicate’s ‘semantic frame’. Followingare the main arguments included in our framework:

– Figure (Fig): The main role which is being com-pared.– Ground: The main role Figure is compared to.– Scale: The scale for the comparison, such aslength, depth, speed. For a more detailed study onscales please refer to the work on learning adjectivescales (Bakhshandeh and Allen, 2015).Our framework also includes ‘Standard’, ‘Differ-

ential’, ‘Domain’, and ‘Domain Specifier’ argumenttypes. Figure 2 shows an example meaning repre-sentations based on our framework.

2.2 Ellipsis StructuresAs mentioned earlier in Section 1, resolving ellip-sis in comparison structures is crucial for languageunderstanding and failure to do so would deliveran incorrect meaning representation. In linguis-tics various subtypes of elliptical constructions arestudied (Kennedy, 2003; Merchant, 2013; Yoshidaet al., 2016). In our framework we mainly in-clude six types which are seen in comparison struc-

28

tures (Bakhshandeh et al., 2016): ‘VP-deletion’,‘Stripping’3, ‘Pseudo-gapping’, ‘Gapping’, ‘Sluic-ing’, and ‘Subdeletion’. Ellipsis more often occursin comparative and equative comparison construc-tions. A few examples of ellipsis in comparativeconstructions are as follows:• Comparatives: Ellipsis site is the dependentclause headed by than. Three ellipsis possibilitiesfor these clauses resuming (4) are shown below.The elided materials are written in subscript.

(4) Mary drank more tea ...

– VP-deletion (aka ‘Comparative Deletion’):... than John did drink coffee.

– Stripping (aka ‘Phrasal Comparative’):... than John drank coffee.

– Gapping:... than John, drank how-much coffee.

– Pseudogapping:... than John did drank coffee.

Furthermore, we define three argument types forellipsis, which help thoroughly construct the an-tecedent of the elided material by taking into ac-count the existing words of the context sentence:Reference, Exclude, and How-much.

3 Data Collection Methodology

Given the new semantic representation, we aim atannotating corpora which then enables developingand testing models. The diversity and comprehen-siveness of the comparison structures representedin our dataset is dependent on the genre of sen-tences comprising it. Earlier, we had experimentedwith annotating semantic structures on OntoNotesdataset (Bakhshandeh and Allen, 2015). Recently(Bakhshandeh et al., 2016), We have shifted our fo-cus to actual product and restaurant reviews, whichinclude many natural comparison instances. For thispurpose we mainly use Google English Web Tree-bank4 which comes with gold constituency parsetrees. We augment this dataset with the Movie Re-views dataset (Pang and Lee, 2005), where we useBerkeley parser (Petrov et al., 2006) to obtain parsetrees.

We trained linguists by asking them to read thesemantic framework annotation manual as summa-

3VP-deletion and stripping are the more frequent types.4https://catalog.ldc.upenn.edu/

LDC2012T13

Figure 3: The number of various predicate typesacross different resources.

ILP ModelP R F1

Average 0.72/0.78 0.91/0.97 0.76/0.80Baseline

Average 0.62/0.64 0.87/0.97 0.66/0.69

Table 1: Predicate prediction Precision (P), Recall(R) and F1 scores on test set, averaged across allpredicate types. Each cell contains scores accordingto Exact/Head measurement.

rized in Section 2. The annotations were done viaour interactive two-stage tree-based annotation tool.For this task, the annotations were done on top ofconstituency parse trees. This process yielded a to-tal of 2,800 annotated sentences. Figure 3 visual-izes the distribution of predicate types from the var-ious resources. As this Figure shows, reviews areindeed a very rich resource for comparisons, havingmore comparison instances than any other resourceof even a bigger size. There are a total of 5,564 com-parison arguments in our dataset, with ‘scale’ and‘figure’ being the majority types. The total numberof ellipsis predicates is 240, with 197 Stripping, 31VP-deletion and 12 Pseudo-gapping.

4 Predicting Semantic Structures

We model the prediction problem as a jointpredicate-argument prediction of comparison and el-lipsis structures. In a nutshell, we define a glob-ally normalized model for the probability distribu-tion of comparison and ellipsis labels over all parsetree nodes as follows:

pC(c|v, T, θC) ∝ exp(fC(c, T )TθC) (2)

pAc(ac|c, e, v, T, θac) ∝ exp(fAC(c, e, T )Tθac) (3)

where T is the underlying constituency tree, pC isthe probability of assigning predicate type c as thepredicate type and pAc is the probability of assign-ing the argument type ac as the argument type. In

29

ILP Model (Exact/Head) ILP No Constraints (Exact/Head) Baseline (Exact/Head)P R F1 P R F1 P R F1

Average 0.37/0.61 0.54/0.87 0.43/0.71 0.01/0.01 0.86/1.00 0.10/0.10 0.20/0.42 0.36/0.73 0.25/0.52

Table 2: Results of argument prediction on test set, averaged across various argument types.

each of the above equations, f is the correspond-ing feature function. For predicates and the argu-ments the main features are lexical features and bi-gram features, among many others. θC , θE , θac isthe parameters of the log-linear model. We calculatethese parameters using Stochastic Gradient Descentalgorithm.

For inference, we model the problem as a struc-tured prediction task. Given the syntactic tree ofa given sentence, for each node we first select thepredicate type with the highest pC . Then for eachselected comparison predicate, we find the corre-sponding ellipsis predicate that has the highest pE

probability. We tackle the problem of argumentassignment by Integer Linear Programming (ILP),where we pose domain-specific linguistic knowl-edge as constraints. Any specific comparison labelcalls for a unique set of constraints in the ILP for-mulation, which ensures the validity of predictions5.The details of this modeling can be found in earlierwork (Bakhshandeh et al., 2016).

5 Experimental Results

We trained our ILP model on the train-dev part ofthe dataset (70%), and tested on the test set (30%).Evaluation is done against the reference gold anno-tation, with Exact and partial (Head) credits to an-notating the constituency nodes. We mainly reporton two models: our comprehensive ILP model (de-tailed in Section 4), and a rule-based baseline. Inshort, the baseline encodes the same linguisticallymotivated ILP constraints via rules and uses a fewpattern extraction methods for finding comparisonmorphemes.

The average results on predicate prediction(across all types) is shown in Table 1. As the re-sults show, overall, the scores are high for predict-ing the predicates, what is not shown here is ellipsispredicates being the most challenging. The baseline

5For instance, the Superlative predicate type never takesanyGround arguments, or the argument Standard is only ap-plicable to the excessive predicate type.

is competitive, which shows that the linguistic pat-terns can capture many of the predicate types. Ourmodel performs the poorest on Equatives, achiev-ing 71%/73% F1 score, which is a complex mor-pheme used in various linguistic constructions. Ouranalysis shows that the errors are often due to inac-curacies in automatically generated parse trees6. Asyou can see in Table 2, The task of predicting argu-ments is a more demanding task. The baseline per-forms very poorly at predicting the arguments. Ourcomprehensive ILP model consistently outperformsthe No Constraints model, showing the effectivenessof our linguistically motivated ILP constraints.

6 Conclusion

In this work we summarized our work which fo-cuses on an aspect of language with a very richsemantics: Comparison and Ellipsis. The currenttools and methodologies in the research communityare not able to go beyond surface-level shallow rep-resentations for comparison and ellipsis structures.We have developed widely usable comprehensivesemantic theory of linguistic content of comparisonstructures. Our representation is broad-coverage anddomain-independent, hence, can be incorporated asa part of any broad-coverage semantic parser (Ba-narescu et al., 2013; Allen et al., 2008; Bos, 2008)for augmenting their meaning representation.

Acknowledgment

We thank the anonymous reviewers for their com-ments. This work was supported in part by GrantW911NF-15-1-0542 with the US Defense AdvancedResearch Projects Agency (DARPA) and the ArmyResearch Office (ARO).

ReferencesJames F. Allen, Mary Swift, and Will de Beaumont.

2008. Deep semantic analysis of text. In Proceedings6For example, challenging long review sentences with infor-

mal language.

30

of the 2008 Conference on Semantics in Text Process-ing, STEP ’08, pages 343–354, Stroudsburg, PA, USA.Association for Computational Linguistics.

Omid Bakhshandeh and James Allen. 2015. Semanticframework for comparison structures in natural lan-guage. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing,pages 993–1002, Lisbon, Portugal, September. Asso-ciation for Computational Linguistics.

Omid Bakhshandeh, Alexis Cornelia Wellwood, andJames Allen. 2016. Learning to jointly predict ellip-sis and comparison structures. In Proceedings of The20th SIGNLL Conference on Computational NaturalLanguage Learning, pages 62–74, Berlin, Germany,August. Association for Computational Linguistics.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In Proceedings of the 7th LinguisticAnnotation Workshop and Interoperability with Dis-course, pages 178–186, Sofia, Bulgaria, August. As-sociation for Computational Linguistics.

Jonathan Berant and Percy Liang. 2014. Semantic pars-ing via paraphrasing. In Association for Computa-tional Linguistics (ACL).

Johan Bos. 2008. Wide-coverage semantic analysis withboxer. In Johan Bos and Rodolfo Delmonte, editors,Semantics in Text Processing. STEP 2008 ConferenceProceedings, Research in Computational Semantics,pages 277–286. College Publications.

Joan Bresnan. 1973. Syntax of the comparative clauseconstruction in English. Linguistic Inquiry, 4(3):275–343.

Max Cresswell. 1976. The semantics of degree. BarbaraHall Partee (ed.), pages 261–292.

Dipanjan Das, Desai Chen, AndrÃl’ F. T. Martins, NathanSchneider, and Noah A. Smith. 2014. Frame-semanticparsing. Computational Linguistics, 40:1:9–56.

Dan Flickinger. 2011. Accuracy vs. robustness in gram-mar engineering. In Emily M. Bender and Jennifer E.Arnold, editors, Language from a Cognitive Perspec-tive: Grammar, Usage and Processing, number 201,pages 31–50. CSLI Publications, Stanford.

Christopher Kennedy. 2003. Ellipsis and syntactic rep-resentation. In Kerstin Schwabe and Susanne Win-kler, editors, The Interfaces: Deriving and Interpret-ing Omitted Structures, number 61 in Linguistics Ak-tuell, pages 29–54. John Benjamins.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David McClosky.2014. The Stanford CoreNLP natural language pro-cessing toolkit. In Proceedings of 52nd Annual Meet-

ing of the Association for Computational Linguistics:System Demonstrations, pages 55–60.

Jason Merchant. 2013. Voice and ellipsis. LinguisticInquiry, 44(1):77–108.

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploitingclass relationships for sentiment categorization withrespect to rating scales. In Proceedings of ACL, pages115–124.

Slav Petrov, Leon Barrett, Romain Thibaux, and DanKlein. 2006. Learning accurate, compact, and inter-pretable tree annotation. In Proceedings of the 21st In-ternational Conference on Computational Linguisticsand the 44th Annual Meeting of the Association forComputational Linguistics, ACL-44, pages 433–440,Stroudsburg, PA, USA. Association for ComputationalLinguistics.

Arnim Von Stechow. 1984. Comparing semantic theo-ries of comparison. Journal of Semantics, 3(1):1–77.

Masaya Yoshida, Chizuru Nakao, and IvÃan Ortega-Santos. 2016. Ellipsis. Routledge Handbook of Syn-tax, London, UK.

John M. Zelle and Raymond J. Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In Proceedings of the Thirteenth Na-tional Conference on Artificial Intelligence - Volume2, AAAI’96, pages 1050–1055. AAAI Press.

31


DialPort: A General Framework for Aggregating Dialog Systems

Tiancheng Zhao and Maxine EskenaziLanguage Technologies Institute

Carnegie Mellon University{tianchez,max+}@cs.cmu.edu

Kyusong LeePohang University of

Science and [email protected]

Abstract

This paper describes a new spoken dialogportal that connects systems produced by thespoken dialog research community and givesthem access to real users. We introduce aprototype dialog framework that affords easyintegration with various remote dialog agentsas well as external knowledge resources. Todate, the DialPort portal has successfully con-nected to two dialog systems and several pub-lic knowledge APIs. We present currentprogress and envision our future plan.

1 Introduction

Much fundamental research in the spoken dialogdomain remains to be done, including adaption foruser modeling and management of complex dialogs.In recent years, there has been increasing interestin applying deep learning to modeling the processof human-computer conversation (Vinyals and Le,2015; Serban et al., 2015; Wen et al., 2016; Williamsand Zweig, 2016; Zhao and Eskenazi, 2016). One ofthe prerequisites for the success of these methods ishaving a large conversation corpus to train on. Inorder to advance the research in these uphill areasof study with the state-of-the-art data-driven meth-ods, large corpora of multi-type real user dialogs areneeded. At present, few existing large corpora covera wide set of research domains. It is also extremelydifficult for any one group to devote time to collect-ing and curating a significant amount of real userdata. The users must be found and kept interested,and the interface must be created and maintained.

Our proposed solution is DialPort, a data gather-ing portal that groups various types of dialog sys-tems, gives potential users a variety of interestingapplications, and shares the collected data amongstall participating research groups. The connected dia-log systems are not simply listed on a website. Theyare fully integrated into a single virtual agent. Fromthe user’s perspective, DialPort is a dialog systemthat can provide information in many domains andit becomes increasingly more attractive as new re-search groups join and resulting more functionalitiesto discover.

2 Challenges

Besides creating new corpora for advanced dialogresearch, DialPort encounters new research chal-lenges.

• Advanced Dialog State Representation Learn-ing: Traditional dialog states are representedas sets of symbolic variables that are relatedto domain-specific ontology and are trackedby statistical methods (Williams et al., 2013).Such an approach soon becomes intractable ifwe want to capture all the essential dialog statefeatures within nested multi-domain conversa-tions, such as modeling user preferences andtracking discourse features. DialPort must ad-dress this challenge if it is to effectively serveas a portal to many systems.

• Dialog Policy that Combines Various Typesof Agents: DialPort is powered by multipledialog agents from research labs around theworld. It is different from the traditional sin-

32

gle dialog agent and requires new methods todevelop decision-making algorithms to judi-ciously switch amongst various systems whilecreating a homogenous users experience.

• Dialog System Evaluation with Real Users:Evaluation has always be challenging for dia-log systems because inexpensive methods, (e.g.user simulator or recruited users) are often notaccurate. The best evaluation, real users, iscostly. DialPort will create streams of real userdata, which opens the possibility of develop-ing a principled evaluation framework for dia-log systems.

3 Proposed Approach

The prototype DialPort system includes the userinterface, remote agents/resources and the masteragent.

3.1 User InterfaceThe user interface is the public front-end1. The au-dio interface uses the web-based ASR/TTS to rec-ognize the user’s speech and generate DialPort’sspeech output. The visual representation is a vir-tual agent that has animated embodiments poweredby the Unity 3D Engine2.

3.2 Remote Agents and ResourcesA Remote agent is a turn-based dialog system, whichinputs the ASR text output of the latest turn and re-turns the next system response. Every external di-alog system connecting to DialPort is treated as aremote agent. DialPort also deals with remote re-sources, which can be any external knowledge re-source, such as a database of bus schedules. DialPortis in charge of all of the dialog processing and usesthe remote resources as knowledge backends in thesame way as a traditional goal-oriented SDS (Rauxet al., 2005).

3.3 The Master AgentThe master agent operates on a set of remote agentsU , and a set of remote resources R. In order toserve information in R, the master agent has a setof primitive actions P , such as request or inform.

1https://skylar.speech.cs.cmu.edu2unity3d.com/

Together P⋃U composes the available action setA

for the master agent. The dialog state S is made upof the entire history of system output and user inputand distributions over possible slot values. Given thenew inputs from the user interface, the master agentupdates its dialog state and generates the next systemresponse based on its policy, π : S → A, that willchoose the action a that is the most appropriate. Onekey note is that for a ∈ U , it takes more than oneturn to finish a session, i.e. a remote agent usuallywill span several turns with the users, while a ∈ P isprimitive action that only spans for one turn. There-fore, we formulate the problem as a sequential de-cision making problem for Semi-Markov DecisionProcess (SMDP) (Sutton et al., 1999), so a ∈ U isequivalent to a macro action. Therefore, when Dial-Port hands over control to a remote agent, the userinput is directly forwarded to the remote system un-til the session is finished by the remote side. Coreresearch on DialPort is about how to construct an ef-ficient representation for S, and how to learn a goodpolicy π.

3.4 Current Status

To date, the DialPort has connected to two re-mote agents, the dialog system at Cambridge Uni-versity (Gasic et al., 2015) and a chatbot, and to tworemote resources: Yelp food API and NOAA (Na-tional Oceanic and Atmospheric Administration)Weather API.

4 Evaluation

Given the data collected by DialPort, assessment hasseveral aspects. In order to create labeled data forthe first two challenges mentioned in Section 2, wedeveloped an annotation toolkit to label the correctsystem responses and state variables of the dialogs.The labeled data can then be used for new models foradvanced dialog state tracking and multi-agent dia-log policy learning. We will also solicit subjectivefeedback from users after a session with the system.

5 Travel Funding

Two authors are respectively PhD and postdoctoralstudents that need travel funding.

33

ReferencesM Gasic, Dongho Kim, Pirros Tsiakoulis, and Steve

Young. 2015. Distributed dialogue policies for multi-domain statistical dialogue management. In Acoustics,Speech and Signal Processing (ICASSP), 2015 IEEEInternational Conference on, pages 5371–5375. IEEE.

Antoine Raux, Brian Langner, Dan Bohus, Alan WBlack, and Maxine Eskenazi. 2005. Lets go public!taking a spoken dialog system to the real world. In inProc. of Interspeech 2005. Citeseer.

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2015. Build-ing end-to-end dialogue systems using generative hi-erarchical neural network models. arXiv preprintarXiv:1507.04808.

Richard S Sutton, Doina Precup, and Satinder Singh.1999. Between mdps and semi-mdps: A frameworkfor temporal abstraction in reinforcement learning. Ar-tificial intelligence, 112(1):181–211.

Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. arXiv preprint arXiv:1506.05869.

Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Mil-ica Gasic, Lina M. Rojas-Barahona, Pei-Hao Su, Ste-fan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialogue sys-tem. arXiv preprint: 1604.04562, April.

Jason D Williams and Geoffrey Zweig. 2016. End-to-end lstm-based dialog control optimized with su-pervised and reinforcement learning. arXiv preprintarXiv:1606.01269.

Jason Williams, Antoine Raux, Deepak Ramachandran,and Alan Black. 2013. The dialog state tracking chal-lenge. In Proceedings of the SIGDIAL 2013 Confer-ence, pages 404–413.

Tiancheng Zhao and Maxine Eskenazi. 2016. Towardsend-to-end learning for dialog state tracking and man-agement using deep reinforcement learning. arXivpreprint arXiv:1606.02560.

34


C2D2E2: Using Call Centers to Motivate the Use of Dialog and Diarizationin Entity Extraction

Kenneth Church, Weizhong Zhu and Jason PelecanosIBM, Yorktown Heights, NY, USA

{kwchurch, zhuwe, jwpeleca}@us.ibm.com

Abstract

This paper introduces a deceptively simpleentity extraction task intended to encouragemore interdisciplinary collaboration betweenfields that don’t normally work together: di-arization, dialog and entity extraction. Given acorpus of 1.4M call center calls, extract men-tions of trouble ticket numbers. The task ischallenging because first mentions need to bedistinguished from confirmations to avoid un-desirable repetitions. It is common for agentsto say part of the ticket number, and customersconfirm with a repetition. There are opportu-nities for dialog (given/new) and diarization(who said what) to help remove repetitions.New information is spoken slowly by one sideof a conversation; confirmations are spokenmore quickly by the other side of the conver-sation.

1 Extracting Ticket Numbers

Much has been written on extracting entities fromtext (Etzioni et al., 2005), and even speech (Kubalaet al., 1998), but less has been written in the contextof dialog (Clark and Haviland, 1977) and diarization(Tranter and Reynolds, 2006; Anguera et al., 2012;Shum, 2011). This paper describes a ticket extrac-tion task illustrated in Table 1. The challenge is toextract a 7 byte ticket number, “902MDYK,” fromthe dialog. Confirmations ought to improve commu-nication, but steps need to be taken to avoid unde-sirable repetition in extracted entities. Dialog the-ory suggests it should be possible to distinguish firstmentions (bold) from confirmations (italics) basedon prosodic cues such as pitch, energy and duration.

t0 t1 S1 S2278.16 281.07 I do have the new hard-

ware case number for youwhen you’re ready

282.60 282.85 okay284.19 284.80 nine285.03 285.86 zero286.22 286.74 two290.82 291.30 nine292.87 293.95 zero two297.87 298.24 okay299.30 300.49 M. as in Mike301.97 303.56 D. as in delta304.89 306.31 Y. as in Yankee307.50 308.81 K. as in kilo310.14 310.57 okay310.77 311.70 nine

zerotwo

311.73 312.49 M. D.312.53 313.18 Y. T.313.75 314.21 correct314.21 317.28 and thank you for calling

IBM is there anything elseI can assist you with

Table 1: A ticket dialog: 7 bytes (902MDYK) at 1.4 bps. First

mentions (bold) are slower than confirmations (italics).

phone matches calls ticket matches (edit dist)66% 238 059% 82 155% 40 24.1% 4033 3+

Table 2: Phone numbers are used to confirm ticket matches.

Good ticket matches (top row) are confirmed more often than

poor matches (bottom row). Poor matches are more common

because ticket numbers are relatively rare, and most calls don’t

mention them.

35

In Table 1, “zero two” was 55% slower the first timethan the second (1.7 vs. 1.1 seconds).

Much of Table 1 was produced by machine, usingtools that are currently available for public use, orwill be available soon. Words came from ASR (au-tomatic speech recognition) and speaker labels (S1and S2) from diarization.1 We plan to label boldand italics automatically, but for now, that was doneby hand.

It is remarkable how hard it is to transmit ticketnumbers. In this case, it takes 39 seconds to trans-mit 7 bytes, “902MDYK,” a mere 1.4 bps (bits persecond).2 Agents are well aware of the difficulty ofthe task. In Table 1, the agent says the first threedigits slowly in citation form (more like isolated dig-its than continuous speech) (Moon, 1991). Citationform should be helpful, though in practice, ASRis trained on continuous speech, and consequentlystruggles with citation form.

After a few back-and-forth confirmations, the cus-tomer confirms the first three digits with a backchan-nel (Ward and Tsukahara, 2000) “okay,” enablingthe agent to continue transmitting the last four bytes,“MDYK,” slowly at a byte/sec or less, using a com-bination of military and conventional spelling: inMike,” “D. as in delta,” etc. When we discuss Fig-ure 1, we will refer to this strategy as slow mode. Ifthe agent was speaking to another agent, she wouldsay, “Mike delta Yankee kilo,” quickly with no inter-vening silences. We will refer to this strategy as fastmode.

Finally, the customer ends the exchange with an-other backchannel “okay,” followed by a quick rep-etition of all 7 bytes. Again we see that first men-tions (bold) take more time than subsequent men-tions (italics). In Table 1, the bold first mentionof “902MDYK” takes 12.1 = 286.74 − 284.19 +308.81 − 299.30 seconds, which is considerablylonger than the customer’s confirmation in italics:2.4 = 313.18− 310.77 seconds.

Ticket numbers are also hard for machines. ASRerrors don’t help. For example, the final “K” in thefinal repetition was misheard by the machine as “T.”

1The ASR tools are currently available for publicuse at: https://www.ibm.com/watson/developercloud/text-to-speech.html, and diarization will be released soon.

2The estimate of 1.4 bps would be even worse if we includedopportunities for compressing tickets to less than 7 bytes.

t0 transcript344.01 and I do have a hardware case number

whenever you’re ready for it348.86 hang on just one moment353.65 okay go ahead that will be Alfa zero nine358.18 the number two359.85 golf Victor Juliet363.55 I’m sorry what was after golf366.46 golf and then V. as in Victor J. as in Juliet370.28 okay371.86 Alfa zero niner two golf Victor Juliet that

is correct Sir you can’t do anything else fortoday

Table 3: An example with a retransmission: 1.7 bits per second

to transmit “A082GVJ”

After listening to the audio, it isn’t clear if a humancould get this right without context because the cus-tomer is speaking quickly with an accent. Neverthe-less, the confirmation, “correct,” makes it clear thatthe agent believes the dialog was successfully con-cluded and there is no need for additional confirma-tions/corrections. Although it is tempting to workon ASR errors forever, we believe there are biggeropportunities for dialog and diarization.

2 Communication Speed

The corpus can be used to measure factors thatimpact communication speed: given/new, familiar-ity, shared conventions, dialects, experience, correc-tions, etc. In Table 1, first mentions are slower thansubsequent mentions. Disfluencies (Hindle, 1983)and corrections (“I’m sorry what was after golf”)take even more time, as illustrated in Table 3.

Figure 1 shows that familiar phone numbers arequicker than less familiar ticket numbers, especiallyin slow mode, where each letter is expressed as aseparate intonation phrase. Agents speed up whentalking to other agents, and slow down for cus-tomers, especially when customers need more time.Agents have more experience than customers andare therefore faster.

Agents tend to use slow mode when speakingwith customers, especially the first time they say theticket number. Table 1 showed an example of slowmode. Fast mode tends to be used for confirmations,or when agents are speaking with other agents. Fig-ure 1 shows that fast mode is faster than slow mode,as one would expect.

36

●

●

●

●

●

●

●

●

●

●●●

●

●

●●●

●●●

●

●●

●

●

●

●

●●

phone ticket (fast) ticket (slow)

510

15

seco

nds

Figure 1: Time to say phone numbers and tickets, computed

over a sample of 552 simple/robust matches. The plot shows

that phone numbers are faster than ticket numbers. Ticket num-

bers are typically spoken in one of two ways which we call fast

mode and slow mode. The plot shows that fast mode is faster

than slow mode, as one would expect.

Figure 1 gives an optimistic lower bound view oftimes. The figure was computed over a small sampleof 552 calls where simple (robust) matching meth-ods were sufficient to find the endpoints of the matchin the audio. Tables 1 and 3 demonstrate that totaltimes tend to be much longer because of factors notincluded in Figure 1 such as prompts, confirmationsand retransmissions.

Shared knowledge helps. Phone numbers arequicker than tickets because everyone knows theirown phone number. In addition, everyone knowsthat phone numbers are typically 10 digits, parsed:3 + 3 + 4. Communication slows down when phonenumbers are expressed in unfamiliar ways such as“double nine” and “triple zero,” common in IndianEnglish and Australian English, but not AmericanEnglish.

3 Materials

We are working with a call center corpus of 1.4Mcalls. Table 4 shows call duration by number ofspeakers. The average call is 5.6 minutes, but most

Speakers Calls Seconds/Call0 565 201 405 612 5021 3423 837 5334 107 9865 22 1121

6+ 13 1166Table 4: Most of our calls have two speakers, a customer and

an agent, though some have more speakers and some have less.

The duration of the call tends to increase with the number of

speakers. These counts were computed from a relatively small

sample of nearly 7k calls that were manually transcribed.

calls are shorter than average, and a few calls aremuch longer than average. The 50th, 95th and 99thpercentiles are 4, 15 and 31 minutes, respectively.The longer calls are likely to involve one or moretransfers, and therefore, longer calls tend to havemore speakers.

A relatively small sample of almost 7k calls wastranscribed by a human transcription service, mainlyto measure WER (word error rates) for recognition,but can also measure diarization errors. Unfortu-nately, ground truth is hard to come by for entityextraction because we didn’t ask the service to ex-tract phone numbers and tickets.

Heuristics are introduced to overcome this defi-ciency. The first 4-5 bytes of the ticket are pre-dictable from side information (timestamps), notavailable to the dialog participants. Edit distanceis used to match the rest with tickets in a database.Matches are confirmed by comparing phone num-bers in the database with phone numbers extractedfrom the audio. Table 2 shows good ticket matches(top row) are confirmed more often than poormatches (bottom row).3 Given these confirmedmatches, future work will label bold and italics au-tomatically. An annotated corpus of this kind willmotivate future work on the use of dialog and di-arization in entity extraction.

3The phone matching heuristic is imperfect in a couple ofways. The top row is far from 100% because the customer mayuse a different phone number than what is in the database. Thebottom row contains most of the calls because the entities ofinterest are quite rare and do not appear in most calls.

37

4 Conclusions

This paper introduced a deceptively simple entityextraction task intended to encourage more interdis-ciplinary collaboration between fields that don’t nor-mally work together: diarization, dialog and entityextraction. First mentions need to be distinguishedfrom confirmations to avoid undesirable repetitionin extracted entities. Dialog theory suggests the useof prosodic cues to distinguish marked first mentionsfrom unmarked subsequent mentions. We saw in Ta-ble 1 that first mentions (bold) tend to be slower thansubsequent confirmations (italics).

It also helps to determine who said what (diariza-tion), because new information tends to come fromone side of a conversation, and confirmations fromthe other side. While our corpus of 1.4M calls can-not be shared for obvious privacy concerns, the ASRand diarization tools are currently available for pub-lic use (or will be available soon). While muchhas been written on given/new, this corpus-based ap-proach should help establish more precise numericalconclusions in future work.

The corpus can be used to measure a numberof additional factors beyond given/new that impactcommunication speed: familiarity, shared conven-tions, dialects, experience, corrections, etc. Ta-ble 3 shows an example of corrections taking evenmore time (“I’m sorry what was after golf”). Fig-ure 1 shows that familiar phone numbers are quickerthan less familiar ticket numbers, especially in slowmode, where each letter is expressed as a separateintonation phrase. Agents speed up when talking toother agents, and slow down for customers, espe-cially when customers need more time. Agents havemore experience than customers and are thereforefaster.

References

Xavier Anguera, Simon Bozonnet, Nicholas Evans,Corinne Fredouille, Gerald Friedland, and OriolVinyals. 2012. Speaker diarization: A review of re-cent research. IEEE Transactions on Audio, Speech,and Language Processing, 20(2):356–370.

Herbert H Clark and Susan E Haviland. 1977. Compre-hension and the given-new contract. Discourse pro-duction and comprehension. Discourse processes: Ad-vances in research and theory, 1:1–40.

Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland,Daniel S Weld, and Alexander Yates. 2005. Unsuper-vised named-entity extraction from the web: An exper-imental study. Artificial intelligence, 165(1):91–134.

Donald Hindle. 1983. Deterministic parsing of syntac-tic non-fluencies. In Proceedings of the 21st annualmeeting on Association for Computational Linguistics,pages 123–128. Association for Computational Lin-guistics.

Francis Kubala, Richard Schwartz, Rebecca Stone, andRalph Weischedel. 1998. Named entity extrac-tion from speech. In Proceedings of DARPA Broad-cast News Transcription and Understanding Work-shop, pages 287–292. Citeseer.

Seung-Jae Moon. 1991. An acoustic and perceptualstudy of undershoot in clear and citation-form speech.Phonetic Experimental Research at the Institute of Lin-guistics University of Stockholm XIV, University ofStockholm, Institute of Linguistics, pages 153–156.

Stephen Shum. 2011. Unsupervised methods for speakerdiarization. Ph.D. thesis, Massachusetts Institute ofTechnology.

Sue E Tranter and Douglas A Reynolds. 2006. Anoverview of automatic speaker diarization systems.IEEE Transactions on Audio, Speech, and LanguageProcessing, 14(5):1557–1565.

Nigel Ward and Wataru Tsukahara. 2000. Prosodic fea-tures which cue back-channel responses in english andjapanese. Journal of pragmatics, 32(8):1177–1207.

38


Visualizing the Content of a Children’s Story in a Virtual World: LessonsLearned

Quynh Ngoc Thi Do1, Steven Bethard2, Marie-Francine Moens1

1Katholieke Universiteit Leuven, Belgium2University of Arizona, United States

[email protected]@[email protected]

Abstract

We present the problem of “bringing text tolife” via 3D interactive storytelling, where nat-ural language processing (NLP) techniques areused to transform narrative text into events ina virtual world that the user can interact with.This is a challenging problem, which requiresdeep understanding of the semantics of a storyand the ability to ground those semantic ele-ments to the actors and events of the 3D world’sgraphical engine. We show how this problemhas motivated interesting extensions to someclassic NLP tasks, identify some of the keylessons learned from the work so far, and pro-pose some future research directions.

1 Introduction

Our primary goal is to take as input natural languagetext, such as children’s stories, translate the text intoformal knowledge that represents the actions, actors,plots, and surrounding world, and render this formalrepresentation as virtual 3D worlds via a graphicalengine. We believe that translating text to anothermodality (in our case, a visual modality) is a good testcase for evaluating language understanding systems.

We have developed an initial approach to this text-to-virtual-world translation problem based on a prob-abilistic graphical model that maps text and its se-mantic annotations (generated by more traditionalNLP modules, like semantic role labelers or corefer-ence resolvers) to the knowledge representation ofthe graphical engine, which is defined in predicatelogic. In the process, we discovered several failingsof traditional NLP systems when faced with this task:

Semantic Role Labeling We observed that currentstate-of-the-art semantic role labeling (SRL)systems perform poorly on children’s stories,failing to recognize many of the expressed argu-ment roles. Much of this is due to the domainmismatch between the available training data(primarily newswire) and our evaluation (storiesfor 3D visualization).

To address this, we introduced a technique basedon recurrent neural networks for automaticallygenerating additional training data that was sim-ilar to the target domain (Do et al., 2014; Do etal., 2015b). For each selected word (predicate,argument head word) from the source domain,a list of replacement words from the target do-main which we believe can occur at the sameposition as the selected word, are generated byusing a recurrent neural network (RNN) lan-guage model (Mikolov et al., 2010). In addition,linguistic resources such as part of speech tags,WordNet (Miller, 1995), and VerbNet (Schuler,2005), are used as filters to select the best re-placement words.

We primarily targeted improving the resultsof the four circumstance roles AM-LOC, AM-TMP, AM-MNR and AM-DIR, which are impor-tant for semantic frame understanding but notwell recognized by standard SRL systems. Newtraining examples were generated specificallyfor the four selected roles. In an experiment withthe out-of-domain setting of the CoNLL 2009shared task and the SRL system of (Bjorkelundet al., 2009), training the semantic role labelleron the expanded training data outperforms the

39

model trained on the original training data by+3.36%, +2.77%, +2.84% and +14% F1 overthe roles AM-LOC, AM-TMP, AM-MNR andAM-DIR respectively (Do et al., 2015b), butwe still need linguistic resources to filter thewords obtained by the language model. In anexperiment where the same model was againtrained on CoNLL 2009 training data, but theRNN training included a collection of 252 chil-dren stories (mostly fairy tales), we obtained F1gains of +9.19,% +7.67%, +17.92% and +7.84%respectively over the four selected roles AM-LOC, AM-TMP, AM-MNR and AM-DIR, whentesting on the story “The Day Tuk Became aHunter” (Ronald and Carol, 1967) (Do et al.,2014).

Coreference Resolution We observed that currentstate-of-the-art coreference resolution systemsare ignorant of some constraints that are impor-tant in storytelling. For example, a character isoften first presented as an indefinite noun phrase(such as “a woman”), then later as a definitenoun phrase (such as “the woman”), but thischange in definiteness often resulted on missedcoreference links.

To address this, we replaced the inference of theBerkeley coreference resolution system (Dur-rett and Klein, 2013) with a global inferencealgorithm which incorporated narrative specificconstraints through integer linear programming(Do et al., 2015a). Our formulation models threephenomena that are important for short narrativestories: local discourse coherence, which wemodel via centering theory constraints, speaker-listener relations, which we model via directspeech act constraints, and character-naming,which we model via definite noun phrase andexact match constraints. When testing on theUMIREC1 and N22 corpora with the corefer-ence resolution system of (Durrett and Klein,2013) trained on OntoNotes3, our inferencesubstantially improves the original inferenceon the CoNLL 2011 AVG score by +5.42 (forUMIREC) and +5.22 (for N2) points when using

1http://dspace.mit.edu/handle/1721.1/575072http://dspace.mit.edu/handle/1721.1/858933https://catalog.ldc.upenn.edu/LDC2011T03

gold mentions and by +1.15 (for UMIREC) and+2.36 (for N2) points when using predicted men-tions. When testing on the story “The Day TukBecame a Hunter” (Ronald and Carol, 1967),our inference outperforms the original inferenceby 4.46 points on the CoNLL 2011 AVG score4.

Having corrected some of the more serious failuresof NLP systems on stories, we turn to the problemof mapping the semantic analysis of these NLP sys-tems to the knowledge representation of the graphi-cal engine. Our initial approach is implemented asa probabilistic graphical model, where the input is asentence and its (probabilistic) semantic and coref-erence annotations, and the output is a set of logicalpredicate-argument structures. Each structure rep-resents an action and the parameters of that action(e.g., person/object performing the action, locationof the action). The domain is bounded by a finiteset of actions, actors and objects, representable bythe graphical environment. In our implementation,decoding of the model is done through an efficientformulation of a genetic algorithm that exploits con-ditional independence (Alazzam and Lewis III, 2013)and improves parallel scalability.

In an evaluation on three stories (“The Day TukBecame a Hunter” (Ronald and Carol, 1967), “TheBear and the Travellers”5, and “The First Tears”6),this model achieved F1 scores of 81% on recognizingthe correct graphical engine actions, and above 60%on recognizing the correct action parameters (Ludwiget al., Under review). Example scenes generated bythe MUSE software are shown in Figure 1, and aweb-based demonstration can be accessed at http://roshi.cs.kuleuven.be/muse_demon/.

2 Lessons learned

Studying the problem of translating natural languagenarratives to 3D interactive stories has been instruc-tive about the capabilities of current natural process-ing for language understanding and the battles thatstill have to be fought. The truthful rendering of lan-guage content in a virtual world acts as a testbed for

4We only evaluate the entities that are available in our virtualdomain such as tuk, father, mother, bear, sister, igloo, sled, etc.

5http://fairytalesoftheworld.com/quick-reads/

the-bear-and-the-travellers/6http://americanfolklore.net/folklore/2010/09/

the\_first\_tears.html

40

Figure 1: MUSE-generated scenes from “The Day Tuk Became a Hunter” (Ludwig et al., Under review).

natural language understanding, making this multi-modal translation a real-life evaluation task.

On the positive side, some NLP tasks such assemantic role labeling and coreference resolutionproved to be useful for instantiating the correct ac-tion frames in the virtual world with their correctactors. However, some NLP tasks that we imaginedwould be important turned out not to be. For example,temporal relation recognition was not very important,since children’s stories have simpler timelines, andsince the constraints of the actions in the 3D inter-active storytelling representation could sometimesexclude the inappropriate interpretations. Moreover,across all of the NLP tasks, we saw significant dropsin performance when applied to narratives like chil-dren’s stories. While we introduced solutions to someof these problems, much work remains to be doneto achieve easy and effective transfer of the learningmodels to other target texts.

Given the almost complete lack of training data fortranslating children’s stories to the representationsneeded by the graphical engine (we only used twoquite unrelated annotated stories to optimize the in-ference in the Bayesian network when our systemparsed a test story), we had to rely on a pipelined ap-proach. In this way we could exploit the knowledgeobtained by the semantic role labeler and coreferenceresolver, which were trained on other annotated textsand adapted by the novel methods described above.The Bayesian framework of the probabilistic graph-ical model allows it to realize the most plausiblemapping or translation to a knowledge representa-tion given the provided evidences obtained from the

features in a sentence and a previous sentence, the(probabilistic) outcome of the semantic role labelerand the (probabilistic) outcome of the coreferenceresolver, and to model dependencies between thevariables of the network. This Bayesian frameworkfor evidence combination makes it possible to re-cover from errors made by the semantic role labeleror coreference resolver.

Our most striking finding was that the text leavesa large part of its content implicit, but this content isactually needed for a truthful rendering of the text inthe virtual world. For instance, often the action wordsused in the story were more abstract than the actionsdefined by the graphical engine (e.g., “take care” inreference to a knife, where actually “sharpen” wasmeant). Sometimes using word similarities based onembeddings (Mikolov et al., 2013) helped in suchcases, but more often the meaning of such abstractwords depends on the specific previous discourse,which is not captured by general embeddings. Adapt-ing the embeddings, which are trained on a largecorpus to the specific discourse context as is donein (Deschacht et al., 2012) is advisable. Moreover,certain content was not mentioned in the text, but ahuman could infer. For example, given “Tuk and hisfather tied the last things on the sled and then set off,”a human would infer that the two people most likelysat down on the sled. Such knowledge is importantfor rendering a believable 3D interactive story, butcan hardly be inferred from text data, instead need-ing grounding in the real world, perhaps captured byother modalities, such as images.

Another problem we encountered was scalability.

41

If the goal is to allow users to bring any text “tolife”, then all of the parsing and translation to the 3Dworld needs to happen online. Although the compu-tational complexity when the machine comprehendsthe story is reduced by limiting the possible actionsand actors (e.g., characters, objects) to the ones men-tioned in the story and the ones inferred, parsing ofthe story is still slow. But even with the genetic al-gorithm inspired parallel processing we introduced,our graphical model is still too slow to operate in anonline environment. Instead of considering parallelprocessing, it would be interesting to give priorityto the most likely interpretation based on event lan-guage models (Do et al., Under review).

Finally, while working closely with researchers in3D interactive storytelling, we learned that there is lit-tle consistency across designers of graphical worldson the structure of basic actions, actors, objects, etc.Thus a model that has been trained to translate storiesthat take place in one digital world will produce in-valid representations for other digital worlds. Gener-alizing across different digital worlds is a challengingbut interesting future direction. Proposing standardscould make a major impact in this field, and in addi-tion could promote cross-modal translation betweenlanguage and graphical content. We witness an in-creasing interest in easy to program languages forrobotics that operate in virtual worlds (e.g., Mind-storms, ROBOTC) and in formalizing knowledgeof virtual words by ontologies (Mezati et al., 2015).Although valuable, such approaches translate yet toanother human made language. If we really wantto test language understanding in a real-life setting,translation to perceptual images and video might bemore suited, but more difficult to realize unless wefind a way of composing realistic images and videoout of primitive visual patterns.

Acknowledgments

This work was funded by the EU ICT FP7 FETproject “Machine Understanding for interactive Sto-rytElling” (MUSE). We thank the anonymous review-ers for their valuable comments.

References

Azmi Alazzam and Harold W. Lewis III. 2013. A newoptimization algorithm for combinatorial problems. In-

ternational Journal of Advanced Research in ArtificialIntelligence, IJARAI, 2(5).

Anders Bjorkelund, Love Hafdell, and Pierre Nugues.2009. Multilingual semantic role labeling. In Proceed-ings of the Thirteenth Conference on ComputationalNatural Language Learning: Shared Task, CoNLL ’09,pages 43–48, Stroudsburg, PA, USA. ACL.

Koen Deschacht, Jan De Belder, and Marie-FrancineMoens. 2012. The latent words language model. Com-puter Speech and Language, 26(5):384–409, October.

Quynh Ngoc Thi Do, Steven Bethard, and Marie-FrancineMoens. 2014. Text mining for open domain semi-supervised semantic role labeling. In DMNLP@PKDD/ECML, pages 33–48.

Quynh Ngoc Thi Do, Steven Bethard, and Marie-FrancineMoens. 2015a. Adapting coreference resolution fornarrative processing. In Proceedings of EMNLP 2015.

Quynh Ngoc Thi Do, Steven Bethard, and Marie-FrancineMoens. 2015b. Domain adaptation in semantic rolelabeling using a neural language model and linguisticresources. IEEE/ACM Transactions on Audio, Speech& Language Processing, 23(11):1812–1823.

Quynh Ngoc Thi Do, Steven Bethard, and Marie-FrancineMoens. Recurrent neural semantic frame languagemodel. Under review.

Greg Durrett and Dan Klein. 2013. Easy victories anduphill battles in coreference resolution. In Proceedingsof EMNLP 2013.

Oswaldo Ludwig, Do Quynh Thi Ngoc, Smith Cameron,Cavazza Marc, and Moens Marie-Francine. Translatingwritten stories into virtual reality. Under review.

Messaoud Mezati, Foudil Cherif, Cdric Sanza, andVronique Gaildrat. 2015. An ontology for seman-tic modelling of virtual world. International Journalof Artificial Intelligence & Applications, 6(1):65–74,janvier.

Tomas Mikolov, Martin Karafit, Lukas Burget, Jan Cer-nock, and Sanjeev Khudanpur. 2010. Recurrent neuralnetwork based language model. In Takao Kobayashi,Keikichi Hirose, and Satoshi Nakamura, editors, IN-TERSPEECH, pages 1045–1048. ISCA.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word representa-tions in vector space. arXiv preprint arXiv:1301.3781.

George A. Miller. 1995. Wordnet: A lexical database forEnglish. Commun. ACM, 38(11):39–41, November.

Melzack Ronald and Jones Carol. 1967. The Day TukBecame a Hunter and Other Eskimo Stories. Dodd,Mead New York.

Karin Kipper Schuler. 2005. Verbnet: A broad-coverage,comprehensive verb lexicon. Ph.D. thesis, Philadelphia,PA, USA. AAI3179808.

42


Stylistic Transfer in Natural Language Generation Systems Using RecurrentNeural Networks

Jad Kabbara and Jackie Chi Kit CheungSchool of Computer Science

McGill UniversityMontreal, QC, Canada

[email protected] [email protected]

Abstract

Linguistic style conveys the social context inwhich communication occurs and defines par-ticular ways of using language to engage withthe audiences to which the text is accessi-ble. In this work, we are interested in thetask of stylistic transfer in natural languagegeneration (NLG) systems, which could haveapplications in the dissemination of knowl-edge across styles, automatic summarizationand author obfuscation. The main challengesin this task involve the lack of parallel train-ing data and the difficulty in using stylisticfeatures to control generation. To addressthese challenges, we plan to investigate neuralnetwork approaches to NLG to automaticallylearn and incorporate stylistic features in theprocess of language generation. We identifyseveral evaluation criteria, and propose man-ual and automatic evaluation approaches.

1 Introduction

Linguistic style is an integral aspect of natural lan-guage communication. It conveys the social contextin which communication occurs and defines particu-lar ways of using language to engage with the audi-ences to which the text is accessible.

In this work, we examine the task of stylistictransfer in NLG systems; that is, changing the styleor genre of a passage while preserving its seman-tic content. For example, given texts written in onegenre, such as Shakespearean texts, we would like asystem that can convert it into another, say, that ofsimple English Wikipedia. Currently, most knowl-edge available in textual form is locked into the par-

ticular data collection in which it is found. An auto-matic stylistic transfer system would allow that in-formation to be more generally disseminated. Forexample, technical articles could be rewritten into aform that is accessible to a broader audience. Al-ternatively, stylistic transfer could also be useful forsecurity or privacy purposes, such as in author ob-fuscation, where the style of the text is changed inorder to mask the identity of the original author.

One of the main research challenges in stylistictransfer is the difficulty in using linguistic featuresto signal a certain style. Previous work in computa-tional stylistics have identified a number of stylisticcues (e.g., passive vs active sentences, repetitive us-age of pronouns, ratio of adjectives to nouns, andfrequency of uncommon nouns). However, it is un-clear how a system would transfer this knowledgeinto controlling realization decisions in an NLG sys-tem. A second challenge is that it is difficult and ex-pensive to obtain adequate training data. Given thelarge number of stylistic categories, it seems infeasi-ble to collect parallel texts for all, or even a substan-tial number of style pairs. Thus, we cannot directlycast this as a machine translation problem in a stan-dard supervised setting.

Recent advances in deep learning provide an op-portunity to address these problems. Work in im-age recognition using deep learning approaches hasshown that it is possible to learn representations thatseparate aspects of the object from the identity ofthe object. For example, it is possible to learn fea-tures that represent the pose of a face (Cheung et al.,2014) or the direction of a chair (Yang et al., 2015),in order to generate images of faces/chairs with new

43

poses/directions. We plan to design similar recurrentneural network architectures to disentangle the stylefrom the semantic content in text. This setup notonly requires less hand-engineering of features, butalso allows us to frame stylistic transfer as a weaklysupervised problem without parallel data, in whichthe model learns to disentangle and recombine la-tent representations of style and semantic content inorder to generate output text in the desired style.

In the rest of the paper, we discuss our plans toinvestigate stylistic transfer with neural networks inmore detail. We will also propose several evaluationcriteria for stylistic transfer and discuss evaluationmethodologies using human user studies.

2 Related Work

Capturing stylistic variation is a long-standing prob-lem in NLP. Sekine (1997) and Ratnaparkhi (1999)consider the different categories in the Brown cor-pus to be domains. These include general fiction,romance and love story, press: reportage. Gildea(2001), on the other hand, refers to these categoriesas genres. Different NLP sub-communities use theterms domain, style and genre to denote slightly dif-ferent concepts (Lee, 2001). From a linguistic pointof view, domains could be thought of as broad sub-ject fields, while genre can be seen as a category as-signed on the basis of external criteria such as in-tended audience, purpose, and activity type. Styleconveys the social context in which communicationoccurs and define particular ways of using languageto engage with the audiences to which the text is ac-cessible. Some linguists would argue that style anddomain are two attributes characterizing genre (e.g.,(Lee, 2001)) while others view genre and domain asaspects representing style (e.g., (Moessner, 2001)).

The notion of genre has been the focus of relatedNLP tasks. In genre classification (Petrenz and Web-ber, 2011; Sharoff et al., 2010; Feldman et al., 2009),the task is to categorize the text into one of severalgenres. In author identification (Houvardas and Sta-matatos, 2006; Chaski, 2001), the goal is to iden-tify the author of a text, while author obfuscation(Kacmarcik and Gamon, 2006; Juola and Vescovi,2011) consists in modifying aspects of the texts sothat forensic analysis fails to reveal the identity ofthe author.

In (Pavlick and Tetreault, 2016), an analysis offormality in online written communication is pre-sented. A set of linguistic features is proposed basedon a study of human perceptions of formality acrossmultiple genres. Those features are fed to a statis-tical model that classifies texts as having a formalor informal style. At the lexical level, Brooke et al.(2010) focused on constructing lexicons of formalitythat can be used in tasks such as genre classificationor sentiment analysis. In (Inkpen and Hirst, 2004), aset list of near-synonyms is given for a target word,and one synonym is selected based on several typesof preferences, e.g., stylistic (degree of formality).We aim to generalize this work beyond the lexicallevel.

A similar work is that of Xu et al. (2012) whichpropose using phrase-based machine translation sys-tems to carry out paraphrasing while targeting a par-ticular writing style. Since the problem is framedas a machine translation problem, it relies on par-allel data where the source “language” is the origi-nal text to be paraphrased–in that case, Shakespearetexts–and the “translation” is the equivalent mod-ern English version of those Shakespeare texts. Ac-cordingly, for each source sentence, there exists aparallel sentence having the target style. They alsopresent some baselines which do not make use ofparallel sentences and instead rely on manually com-piled dictionaries of expressions commonly found inShakespearean English. In a more recent work, Sen-nrich et al. (2016) carry out translation from Englishto German while controlling the degree of polite-ness. This is done in the context of neural machinetranslation by adding side constraints. Specifically,they mark up the source language of the training data(in this case, English) with a feature that encodes theuse of honorifics seen in the target language (in thiscase, German). This allows them to control the hon-orifics that are produced at test time.

3 Proposed Approach

Recently, RNN-based models have been success-fully used in machine translation (Cho et al., 2014b;Cho et al., 2014a; Sutskever et al., 2014) and di-alogue systems (Wen et al., 2015). Thus, we pro-pose to use an LSTM-based RNN model based onthe encoder-decoder structure (Cho et al., 2014b)

44

to automatically process stylistic nuances instead ofhand-engineering features. The model is a vari-ant of an autoencoder where the latent representa-tion has two separate components: one for styleand one for content. The learned stylistic featureswould be distinct from the content features and spe-cific to each style category, such that they can beswapped between training and testing models to per-form stylistic transfer. The separation, or disentan-glement, between stylistic and content features is re-inforced by modifying the training objective from(Cho et al., 2014b) that maximizes the conditionallog-likelihood (of the output given the input). In-stead, our model is trained to maximize a trainingobjective that also includes a cross-covariance termdedicated for the disentanglement.

At a high level, our proposed approach consists ofthe following steps:

1. For a given style transfer task between twostyles A and B, we will first collect relevant cor-pora for each of those styles.

2. Next, we will train the model on each of thestyles (separately). This would allow the sys-tem to disentangle the content features from thestylistic features. At the end of this step, wewill have (separately) the features that charac-terize styles A and those that characterize styleB.

3. During the testing phase, for a transfer, say,from style A to style B, the system is fed textshaving style A while the stylistic latent vari-ables of the model are fixed to be those learnedfor style B (from the previous step). This wouldforce the model to generate text using style B.For a transfer from style B to A, the system isfed texts having style B and we fix the stylisticlatent variables of the model to be those learnedfor style A.

We intend to apply the model to datasets with rea-sonably differing styles between training and test-ing. Examples include the complete works of Shake-speare1, the Wikpedia Kaggle dataset 2, the Oxford

1http://norvig.com/ngrams/shakespeare.txt2https://www.kaggle.com/c/wikichallenge/Data

Text Archive (literary texts) 3, and Twitter data. Afuture research direction would be to further im-prove the system to process texts that have differingbut similar styles.

4 Evaluation

We first present a simple example that shows theinput and output of the system during the testingphase. Assuming the system was trained on textstaken from Simple English Wikipedia, it would learnthe stylistic features that are particular to that genre.During the testing phase, if we feed the system thefollowing sentence taken from Shakespeare’s playAs You Like It (Act 1, Scene 1):

As I remember, Adam, it was upon thisfashion bequeathed me by will but poora thousand crowns, and, as thou sayest,charged my brother on his blessing tobreed me well. And there begins my sad-ness.

we expect the system to produce a version that mightbe similar to the following:

I remember, Adam, that’s exactly why myfather only left me a thousand crowns inhis will. And as you know, my father askedmy brother to make sure that I was broughtup well. And that’s where my sadness be-gins.

We see three main criteria for the evaluation ofstylistic transfer systems: soundness (i.e., the gen-erated texts being textually entailed with the orig-inal version), coherence (e.g., free of grammaticalerrors, proper word usage, etc.), and effectiveness(i.e., the generated texts actually match the desiredstyle). We propose to evaluate systems using bothhuman and automatic evaluations. Snippets of orig-inal and generated texts will be sampled and re-viewed by human evaluators, who will judge themon these three criteria using Likert ratings. This typeof evaluation technique is also used in related taskssuch as to evaluate author obfuscation systems (Sta-matatos et al., 2015). A future research direction is

3https://ota.ox.ac.uk/

45

to investigate automatic evaluation measures simi-lar to ROUGE and BLEU, which compare the con-tent of the generated text against human-written goldstandards using word or n-gram overlap.

5 Conclusion

We present stylistic transfer as a challenging gener-ation task. Our proposed research will address chal-lenges to the task, such as the lack of parallel train-ing data and the difficulty of defining features thatrepresent style. We will exploit deep learning mod-els to extract stylistic features that are relevant togeneration without requiring explicit parallel train-ing data between the source and the target styles.We plan to evaluate our methods using human judg-ments, according to criteria that we propose, derivedfrom related tasks.

ReferencesJulian Brooke, Tong Wang, and Graeme Hirst. 2010. Au-

tomatic acquisition of lexical formality. In Proceed-ings of the 23rd International Conference on Compu-tational Linguistics: Posters, pages 90–98. Associa-tion for Computational Linguistics.

Carole E Chaski. 2001. Empirical evaluationsof language-based author identification techniques.Forensic Linguistics, 8:1–65.

Brian Cheung, Jesse A Livezey, Arjun K Bansal, andBruno A Olshausen. 2014. Discovering hidden fac-tors of variation in deep networks. arXiv preprintarXiv:1412.6583.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014a. On the proper-ties of neural machine translation: Encoder-decoderapproaches. arXiv preprint arXiv:1409.1259.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014b. Learning phrase rep-resentations using rnn encoder–decoder for statisticalmachine translation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1724–1734, Doha, Qatar,October. Association for Computational Linguistics.

Sergey Feldman, Marius A Marin, Mari Ostendorf, andMaya R Gupta. 2009. Part-of-speech histogramsfor genre classification of text. In 2009 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing, pages 4781–4784. IEEE.

Daniel Gildea. 2001. Corpus variation and parser per-formance. In Proceedings of the 2001 Conference on

Empirical Methods in Natural Language Processing,pages 167–202.

John Houvardas and Efstathios Stamatatos. 2006. N-gram feature selection for authorship identification.In International Conference on Artificial Intelligence:Methodology, Systems, and Applications, pages 77–86. Springer.

Diana Zaiu Inkpen and Graeme Hirst. 2004. Near-synonym choice in natural language generation. In Re-cent Advances in Natural Language Processing, vol-ume 3, pages 141–152.

Patrick Juola and Darren Vescovi. 2011. Analyzing sty-lometric approaches to author obfuscation. In IFIPInternational Conference on Digital Forensics, pages115–125. Springer.

Gary Kacmarcik and Michael Gamon. 2006. Ob-fuscating document stylometry to preserve authoranonymity. In Proceedings of the COLING/ACL onMain conference poster sessions, pages 444–451. As-sociation for Computational Linguistics.

David Y.W. Lee. 2001. Genres, registers, text types, do-mains and styles: Clarifying the concepts and navigat-ing a path through the bnc jungle. Language Learningand Technology, 5(3):37–72, September.

Lilo Moessner. 2001. Genre, text type, style, register:A terminological maze? European Journal of EnglishStudies, 5(2):131–138.

Ellie Pavlick and Joel Tetreault. 2016. An empiricalanalysis of formality in online communication. Trans-actions of the Association for Computational Linguis-tics, 4:61–74.

Philipp Petrenz and Bonnie Webber. 2011. Stable clas-sification of text genres. Computational Linguistics,37(2):385–393.

Adwait Ratnaparkhi. 1999. Learning to parse naturallanguage with maximum entropy models. Machinelearning, 34(1-3):151–175.

Satoshi Sekine. 1997. The domain dependence of pars-ing. In Proceedings of the fifth conference on Appliednatural language processing, pages 96–102. Associa-tion for Computational Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Controlling politeness in neural machine trans-lation via side constraints. In Proceedings of the 2016Conference of the North American Chapter of the As-sociation for Computational Linguistics: Human Lan-guage Technologies, pages 35–40, San Diego, Califor-nia, June. Association for Computational Linguistics.

Serge Sharoff, Zhili Wu, and Katja Markert. 2010. Theweb library of babel: evaluating genre collections. InLREC. Citeseer.

Efstathios Stamatatos, Martin Potthast, FranciscoRangel, Paolo Rosso, and Benno Stein. 2015.

46

Overview of the pan/clef 2015 evaluation lab. InInternational Conference of the Cross-LanguageEvaluation Forum for European Languages, pages518–538. Springer.

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.Sequence to sequence learning with neural networks.In Advances in neural information processing systems,pages 3104–3112.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-HaoSu, David Vandyke, and Steve Young. 2015. Seman-tically conditioned lstm-based natural language gener-ation for spoken dialogue systems. In Proceedings ofthe 2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 1711–1721, Lisbon, Por-tugal, September. Association for Computational Lin-guistics.

Wei Xu, Alan Ritter, William B Dolan, Ralph Grishman,and Colin Cherry. 2012. Paraphrasing for style. In24th International Conference on Computational Lin-guistics, COLING 2012.

Jimei Yang, Scott E Reed, Ming-Hsuan Yang, andHonglak Lee. 2015. Weakly-supervised disentanglingwith recurrent transformations for 3d view synthesis.In Advances in Neural Information Processing Sys-tems, pages 1099–1107.

47


Using Language Groundings forContext-Sensitive Text Prediction

Timothy Lewis, Amy Hurst, Matthew E. Taylor, & Cynthia [email protected] | [email protected] | [email protected] | [email protected]

Abstract

In this paper, we present the conceptof using language groundings for context-sensitive text prediction using a seman-tically informed, context-aware languagemodel. We show initial findings from apreliminary study investigating how usersreact to a communication interface drivenby context-based prediction using a simplelanguage model. We suggest that the re-sults support further exploration using amore informed semantic model and morerealistic context.

Keywords— Grounded language, context sensitivegeneration, predictive text

1 Introduction

Advances in natural language and world percep-tion have led to a resurgence of work on the lan-guage grounding problem. Most work to date hasfocused on learning a model of language describ-ing the world, then using it to understand novellanguage, e.g., following directions, (Artzi andZettlemoyer, 2013; MacGlashan et al., 2015) orlearning to understand commands in a space ofplans or commands (Misra et al., 2014).

Generating language based on context is ar-guably more difficult, although the additionalinformation provided by context makes this apromising area for natural language generationin general. There is a growing body of workon context-based generation in limited domains,such as sportscasting (Chen and Mooney, 2010),

asking questions (Tellex et al., 2014), or gener-ating spatial descriptions or narratives (Huo andSkubic, 2016; Rosenthal et al., 2016). In orderto provide communication suggestions for users,it is not necessary to solve the problem of ar-bitrary natural language generation. Instead,the system must be able to provide predictionsthat support a predictive language interface, inwhich a user is continuously provided with a setof suggestions for possible speech.

We propose an approach, in which a jointlinguistic/perceptual model is used to drivea predictive text tool, targeting augmentativeand alternative communication (AAC) tools forwheelchair users with motor apraxia of speech.We propose to use the speaker’s environment ascontext to make more relevant predictions.

Sensed context will be used to drive the prob-ability of predictions and reduce ambiguity; forexample, while “button” may refer to a fastenerfor clothing or a control for an electronic device,someone in front of an elevator is probably re-ferring to the latter, which in turn focuses whatthey are likely to want to say. Instrumentedwheelchairs can capture a large corpus of lan-guage paired with context to support develop-ment of a user-specific model trained before andduring degradation of the ability to speak.

This paper discusses a pilot study using apreliminary language model with simulated con-text. Participants responded to scenarios usinga prototype interface to communicate. Using re-sults and observations from this user study, wehypothesize that context-based predictive lan-

48

guage can improve usability of a predictive textinterface and represents a promising directionfor future work.

2 Approach

In grounded language acquisition, a combinationof language and physical context are used to de-velop a language model for understanding futureutterances. (Mooney, 2008) The context can bephysical (depending on physical sensors, some-times on a robot), (Fasola and Mataric, 2014)a simulation of some physical context, (Chenand Mooney, 2011) or more abstract descrip-tions. (Kress-Gazit and Fainekos, 2008) We pro-pose to collect and learn from a similar set ofdata, with a language model targeting genera-tion rather than understanding.

2.1 Corpus Collection

In order to learn a model of contextualizedspeech, it is necessary to collect both spoken lan-guage and context describing the environmentwhen communication is occurring. We proposeto perform this collection in three stages, fromgeneral to user-focused, as we build a better cor-pus and model.

(1) Crowdsourcing To gain a better understand-ing of how people may respond in different sit-uations, Mechanical Turk will be leveraged tosolicit responses from users about various sce-narios. Each scenario presents a speaker withtext describing a certain situation (and imageswhen appropriate) and asked what they wouldsay. This provides us with an initial corpus oftyped responses to known scenarios. The pre-liminary study (see Section 3) was performedon a small-scale crowdsourced corpus.

(2) Telepresence For the second stage, we willuse a telepresence robot (see Figure 1). TheBeam robot provides insight into situations thatmay require assistance (for example, having therobot travel between floors of a building via theelevator, or delivering a package from one officeto another). The Beam’s existing video cam-eras and microphone/speaker interactions canbe captured to provide a time-aligned corpus.

Figure 1: Telepresence-based context and language.

(left) Bystanders push a button in response to a ver-

bal request from the robot. (right) Video feed from the

robot’s sensors. In this example, the most visually salient

elements of context are the elevator and people.

(3) End Users When a sufficiently robust modelexists, we will instrument wheelchairs of pro-posed end users (e.g., ALS patients). This sen-sor array must be unobtrusive and relatively lowpower. This will include one or more time-of-flight Kinect 2 RGB-D cameras, an omni-directional powered microphone, and a high-resolution camera.

2.2 Context Interpretation

While any feature of as environment or actionsmay provide important context, we will focuson gathering sensor observations describing themost salient elements of the environment. Weexpect this to be primarily: 1) People in theenvironment, who will circumscribe the set ofthings the user is likely to want to say; 2) Ob-jects in the environment, including fixed objectssuch as elevators; and 3) The environment itself.

Identifying elements of a scene is a difficultproblem; initial efforts will use crowdsourcing orother low-cost, high-speed annotation of sensordata, but the broader intention is to use recentwork on automatic identification of importantvisual elements (Carl Vondrick, 2016) and se-mantic labeling. (Anand et al., 2012)

Existing efforts on building language modelsfrom observations collect and train on corporathat are targeted to a particular scenario. Be-cause we are gathering ongoing speech in a va-riety of settings, we are trying to learn fromnon-targeted speech, where the connection be-tween the language and the sensor observationsmay be tenuous or non-existent. (For example,

49

a person may be talking about medication sideeffects while navigating.) Gathering data overa long period should allow irrelevant context tobe weighted down in the learned model.

2.3 Language Learning

Our approach to text prediction inverts an exist-ing model (Matuszek et al., 2013), in which theauthors trained a joint model of language andcontext, and then treated language as a queryagainst a perceived world state. In that work,the goal is to find a set of groundings in theworld referred to by language x. The inducedmodel is then P (G|x, C), given data of the formD = {(xi, Ci, Gi)|i = 1...n}, where each exam-ple i contains a sentence xi, the world contextCi, and the actual referent (grounding) Gi.

In this work, we treat perceptual context as‘input’ and language as ‘output.’ Given a sim-ilar dataset, the model to be induced is thenP (x|G, C). Our intention is to learn a simi-lar model, incorporating semantic parsing andperceptual information and giving a probabilitydistribution over possible generated outputs, asdone elsewhere (FitzGerald et al., 2013). How-ever, our initial experiments were performed us-ing an n-gram prediction model.

The generation goal is a list of predictionsfrom which a user can select. Since generatedpredictions can range in complexity from wordsto full sentences, generation strategies based oncertainty can be applied, where more complexconstructs are generated in higher-certainty sit-uations. In this setting, providing a top-n listof results is useful, reducing the need for findingthe single best generation.

3 Preliminary Study

For the preliminary user study, a set of fourscenarios were shown to a group of fifteen par-ticipants. The scenarios asked each participantwhat they would say in each of four situations:a social interaction; answering questions from adoctor; asking someone to push an elevator but-ton; and asking someone to retrieve a water bot-tle. The context was described to participantsin text, simplifying out the question of how to

represent real-world context. (See box for anexample scenario and some responses.)

You have been having stomach pains after eatingeach day for the past week. You are visiting yourdoctor, who asks how you are doing. What is yourresponse?

– “My stomach has been bothering me after I eat.”

– “My stomach hurts whenever I eat.”

– “I’m ok but I’ve been having stomach issues.”

– “Good aside from the gut pain I’m having after eating.”

– “I have been having stomach pains after eating each dayfor the past week.”

3.1 Prediction Experiments

An interface was developed to test four differ-ent methods for generating predictions, of whichthree are novel to this work. These meth-ods vary in the length of generated predictions:users were presented with combinations of sin-gle words, short phrases, or full sentences (seeTable 1). A simulated qwerty keyboard wasavailable for fallback text entry. A new pool ofparticipants were asked to use the interface tocommunicate responses to the same four scenar-ios, rather than typing responses on a keyboard.

In order to generate a predictive languagemodel that is context driven based on thesescenarios, n-gram models were constructed us-ing the Presage predictive text entry program1.Four different prediction methods were testedusing this model (see Table 1).

Method Corpus W. P. S.

StdEngStandardEnglish

X XContWord Contextual XContext Contextual X XContSent Contextual X X X

Table 1: The four text prediction methods tested, which

vary in whether they generate words (W), Phrases (P),

and sentences (S), and whether they are based on an ex-

isting English corpus or a preliminary contextual corpus.

For each participant/scenario pair, the num-ber of selections (clicks) necessary to communi-cate was recorded. After each participant com-pleted the tasks, they filled out a survey aboutthe usability of the interface, how effectively

1http://presage.sourceforge.net, 2016-08-01

50

they felt they were able to communicate, andthe perceived speed of each entry method.

Figure 2: The prototype interface used in the prelim-

inary study. The interface provides various options for

text selection including full sentences and a virtual key-

board.

This pilot supported the hypothesis that userscommunicate faster with context-sensitive pre-diction (Figure 3); of the most-comparablemethods, Context was faster than StdEng.While communication is fastest when completesentences are shown, the users did not qualita-tively prefer this option, underscoring the im-portance of personalized communication.

Figure 3: Participants’ qualitative perception of the rel-

ative speed of different methods (green), compared to

the number of selections actually used (blue). Perceived

speed is shown as a weighted average of non-numeric

rankings, and aligns closely with the number of selections

required to complete a task.

4 Discussion and Future Work

We intend to pursue further experiments usingmore complete language and grounding mod-els. For this, some simplifications must be ad-

dressed. The most immediate are the best wayof modeling language and incorporating real-world context; this is necessary to know whetherbuilding a semantically informed, context-awareprediction model will present large gains in accu-racy and acceptability. We believe this work willbe able to contribute to the research community,providing leads and methods for more intelligentand usable language models. Nonetheless, whileambitious, our initial results support the beliefthat this approach has promise for text predic-tion and context-aware generation.

References

Abhishek Anand, Hema Swetha Koppula, ThorstenJoachims, and Ashutosh Saxena. 2012. Contextu-ally guided semantic labeling and search for three-dimensional point clouds. The International Jour-nal of Robotics Research, page 0278364912461538.

Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the As-sociation for Computational Linguistics (TACL),1:49–62.

Antonio Torralba Carl Vondrick, Hamed Pirsiavash.2016. Anticipating visual representations from un-labeled video. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recogni-tion.

David L Chen and Raymond J Mooney. 2010. Train-ing a multilingual sportscaster: Using perceptualcontext to learn language. Journal of ArtificialIntelligence Research, 37:397–435.

David L Chen and Raymond J Mooney. 2011. Learn-ing to interpret natural language navigation in-structions from observations. In AAAI, volume 2,pages 1–2.

Juan Fasola and Maja J Mataric. 2014. Interpret-ing instruction sequences in spatial language dis-course with pragmatics towards natural human-robot interaction. In Robotics and Automation(ICRA), 2014 IEEE International Conference on,pages 2720–2727. IEEE.

Nicholas FitzGerald, Yoav Artzi, and Luke S Zettle-moyer. 2013. Learning distributions over logi-cal forms for referring expression generation. InEMNLP, pages 1914–1925.

Zhiyu Huo and Marjorie Skubic. 2016. Naturalspatial description generation for human-robot in-teraction in indoor environments. In 2016 IEEE

51

International Conference on Smart Computing(SMARTCOMP), pages 1–3. IEEE.

Hadas Kress-Gazit and Georgios E Fainekos. 2008.Translating structured English to robot con-trollers. Advanced Robotics, 22:1343–1359.

James MacGlashan, Monica Babes-Vroman, MariedesJardins, Michael Littman, Smaranda Muresan,Shawn Squire, Stefanie Tellex, Dilip Arumugam,and Lei Yang. 2015. Grounding english com-mands to reward functions. In Proceedings ofRobotics: Science and Systems, Rome, Italy, July.

Cynthia Matuszek, Nicholas FitzGerald, LukeZettlemoyer, Liefeng Bo, and Dieter Fox. 2013.A Joint Model of Language and Perception forGrounded Attribute Learning. In Proc. of the2012 International Conference on Machine Learn-ing, Edinburgh, Scotland, June.

Dipendra K Misra, Jaeyong Sung, Kevin Lee, andAshutosh Saxena. 2014. Tell me dave: Context-sensitive grounding of natural language to mo-bile manipulation instructions. In Proceedings ofRobotics: Science and Systems. Citeseer.

Raymond J Mooney. 2008. Learning to connect lan-guage and perception. In AAAI, pages 1598–1601.

Stephanie Rosenthal, Sai P Selvaraj, and ManuelaVeloso. 2016. Verbalization: Narration of au-tonomous mobile robot experience. In 26th In-ternational Joint Conference on Artificial Intelli-gence (IJCAI16). New York City, NY.

Stefanie Tellex, Ross Knepper, Adrian Li, DanielaRus, and Nicholas Roy. 2014. Asking for helpusing inverse semantics. Proceedings of Robotics:

Science and Systems, Berkeley, USA.

52


Towards a continuous modeling of natural language domains

Sebastian Ruder1,2, Parsa Ghaffari2, and John G. Breslin1

1Insight Centre for Data AnalyticsNational University of Ireland, Galway

{sebastian.ruder,john.breslin}@insight-centre.org2Aylien Ltd.

Dublin, Ireland{sebastian,parsa}@aylien.com

Abstract

Humans continuously adapt their style andlanguage to a variety of domains. However, areliable definition of ‘domain’ has eluded re-searchers thus far. Additionally, the notion ofdiscrete domains stands in contrast to the mul-tiplicity of heterogeneous domains that hu-mans navigate, many of which overlap. In or-der to better understand the change and varia-tion of human language, we draw on researchin domain adaptation and extend the notionof discrete domains to the continuous spec-trum. We propose representation learning-based models that can adapt to continuous do-mains and detail how these can be used to in-vestigate variation in language. To this end,we propose to use dialogue modeling as a testbed due to its proximity to language modelingand its social component.

1 Introduction

The notion of domain permeates natural languageand human interaction: Humans continuously varytheir language depending on the context, in writ-ing, dialogue, and speech. However, the conceptof domain is ill-defined, with conflicting definitionsaiming to capture the essence of what constitutesa domain. In semantics, a domain is considered a“specific area of cultural emphasis” (Oppenheimer,2006) that entails a particular terminology, e.g. aspecific sport. In sociolinguistics, a domain consistsof a group of related social situations, e.g. all humanactivities that take place at home. In discourse a do-main is a “cognitive construct (that is) created in re-sponse to a number of factors” (Douglas, 2004) and

includes a variety of registers. Finally, in the contextof transfer learning, a domain is defined as consist-ing of a feature space X and a marginal probabilitydistribution P (X) where X = {x1, ..., xn} and xi

is the ith feature vector (Pan and Yang, 2010).

These definitions, although pertaining to differentconcepts, have a commonality: They separate theworld in stationary domains that have clear bound-aries. However, the real world is more ambiguous.Domains permeate each other and humans navigatethese changes in domain.

Consequently, it seems only natural to step awayfrom a discrete notion of domain and adopt a con-tinuous notion. Utterances often cannot be natu-rally separated into discrete domains, but often arisefrom a continuous underlying process that is re-flected in many facets of natural language: The webcontains an exponentially growing amount of data,where each document “is potentially its own do-main” (McClosky et al., 2010); a second-languagelearner adapts their style as their command of thelanguage improves; language changes with time andwith locality; even the WSJ section of the Penn Tree-bank – often treated as a single domain – containsdifferent types of documents, such as news, lists ofstock prices, etc. Continuity is also an element ofreal-world applications: In spam detection, spam-mers continuously change their tactics; in sentimentanalysis, sentiment is dependent on trends emergingand falling out of favor.

Drawing on research in domain adaptation, wefirst compare the notion of continuous natural lan-guage domains against mixtures of discrete domainsand motivate the choice of using dialogue modeling

53

as a test bed. We then present a way of representingcontinuous domains and show how continuous do-mains can be incorporated into existing models. Wefinally propose a framework for evaluation.

2 Continuous domains vs. mixtures ofdiscrete domains

In domain adaptation, a novel target domain is tra-ditionally assumed to be discrete and independentof the source domain (Blitzer et al., 2006). Otherresearch uses mixtures to model the target domainbased on a single (Daume III and Marcu, 2006) ormultiple discrete source domains (Mansour, 2009).We argue that modeling a novel domain as a mix-ture of existing domains falls short in light of threefactors.

Firstly, the diversity of human language makes itunfeasible to restrict oneself to a limited number ofsource domains, from which all target domains aremodeled. This is exemplified by the diversity ofthe web, which contains billions of heterogeneouswebsites; the Yahoo! Directory1 famously containedthousands of hand-crafted categories in an attempt toseparate these. Notably, many sub-categories werecross-linked as they could not be fully separated andwebsites often resided in multiple categories.

Similarly, wherever humans come together, theculmination of different profiles and interests givesrise to cliques, interest groups and niche communi-ties that all demonstrate their own unique behaviors,unspoken rules, and memes. A mixture of existingdomains fails to capture these varieties.

Secondly, using discrete domains for soft assign-ments relies on the assumption that the source do-mains are clearly defined. However, discrete labelsonly help to explain domains and make them in-terpretable, when in reality, a domain is a hetero-geneous amalgam of texts. Indeed, Plank and vanNoord (2011) show that selection based on human-assigned labels fares worse than using automatic do-main similarity measures for parsing.

Thirdly, not only a speaker’s style and commandof a language are changing, but a language itselfis continuously evolving. This is amplified in fast-moving media such as social platforms. Therefore,

1https://en.wikipedia.org/wiki/Yahoo!_Directory

applying a discrete label to a domain merely anchorsit in time. A probabilistic model of domains shouldin turn not be restricted to treat domains as indepen-dent points in a space. Rather, such a model shouldbe able to walk the domain manifold and adapt tothe underlying process that is producing the data.

3 Dialogue modeling as a test bed forinvestigating domains

As a domain presupposes a social component and re-lies on context, we propose to use dialogue modelingas a test bed to gain a more nuanced understandingof how language varies with domain.

Dialogue modeling can be seen as a prototypi-cal task in natural language processing akin to lan-guage modeling and should thus expose variationsin the underlying language. It allows one to observethe impact of different strategies to model variationin language across domains on a downstream task,while being inherently unsupervised.

In addition, dialogue has been shown to exhibitcharacteristics that expose how language changesas conversation partners become more linguisticallysimilar to each other over the course of the conver-sation (Niederhoffer and Pennebaker, 2002; Levitanet al., 2011). Similarly, it has been shown that thelinguistic patterns of individual users in online com-munities adapt to match those of the community theyparticipate in (Nguyen and Rose, 2011; Danescu-Niculescu-Mizil et al., 2013).

For this reason, we have selected reddit as amedium and compiled a dataset from large amountsof reddit data. Reddit comments live in a rich en-vironment that is dependent on a large number ofcontextual factors, such as community, user, conver-sation, etc. Similar to Chen et al. (2016), we wouldlike to learn representations that allow us to disen-tangle factors that are normally intertwined, such asstyle and genre, and that will allow us to gain moreinsight about the variation in language. To this end,we are currently training models that condition ondifferent communities, users, and threads.

4 Representing continuous domains

In line with past research (Daume III, 2007; Zhouet al., 2016), we assume that every domain has aninherent low-dimensional structure, which allows its

54

W

G(d,D)

TS

- “Hey, did youhear about the

Beyonceconcert?”

- “Yeah, I’m so turned up to go

with mybae.”


Beyonceconcert?”

- “Yess, RT RT RT. So excitedto go with my

boyfriend.”

2014 2016

t

Figure 1: Transforming a discrete source domain subspace S

into a target domain subspace T with a transformation W .

projection into a lower dimensional subspace.In the discrete setting, we are given two do-

mains, a source domain XS and a target domainXT . We represent examples in the source domainXS as xS

1 , · · · , xSnS∈ Rd where xS

1 is the i-thsource example and nS is number of examples inXS . Similarly, we have nT target domain examplesxT

1 , · · · , xTnT∈ Rd.

We now seek to learn a transformation W that al-lows us to transform the examples in the XS so thattheir distribution is more similar to the distributionof XT . Equivalently, we can factorize the transfor-mation W into two transformations A and B withW = ABT that we can use to project the source andtarget examples into a joint subspace.

We assume that XS and XT lie on lower-dimensional orthonormal subspaces, S, T ∈ RD×d,which can be represented as points on the Grassmanmanifold, G(d, D) as in Figure 1, where d� D.

In computer vision, methods such as SubspaceAlignment (Fernando et al., 2013) or the GeodesicFlow Kernel (Gong et al., 2012) have been usedto find such transformations A and B. Similarly,in natural language processing, CCA (Faruqui andDyer, 2014) and Procrustes analysis (Mogadala andRettinger, 2016) have been used to align subspacespertaining to different languages.

Many recent approaches using autoencoders(Bousmalis et al., 2016; Zhou et al., 2016) learn sucha transformation between discrete domains. Sim-ilarly, in a sequence-to-sequence dialogue model(Vinyals and V. Le, 2015), we can not only train

T3G(d,D)

S


Beyonceconcert?”

- “Yeah, I’m so turned up to go

with mybae.”


Beyonceconcert?”

- “Yess, RT RT RT. So excitedto go with my

boyfriend.”

2014 2015 2016

t

T2T1


Beyonceconcert?”

- “Yass, I’m ex-cited AF to go with my boy-

friend.”

Wt3Wt2Wt1

Figure 2: Transforming a source domain subspace S into con-

tinuous domain subspaces Tt with a temporally varying trans-

formation Wt.

the model to predict the source domain response, butalso – via a reconstruction loss – its transformationsto the target domain.

For continuous domains, we can assume thatsource domain XS and target domain XT are not in-dependent, but that XT has evolved from XS basedon a continuous process. This process can be in-dexed by time, e.g. in order to reflect how a lan-guage learner’s style changes or how language variesas words rise and drop in popularity. We thus seekto learn a time-varying transformation Wt betweenS and T that allows us to transform between sourceand target examples dependent on t as in Figure 2.

Hoffman et al. (2014) assume a stream of ob-servations z1, · · · , znt ∈ Rd drawn from a contin-uously changing domain and regularize Wt by en-couraging the new subspace at t to be close to theprevious subspace at t − 1. Assuming a stream of(chronologically) ordered input data, a straightfor-ward application of this to a representation-learningbased dialogue model trains the parts of the modelthat auto-encode and transform the original messagefor each new example – possibly regularized with asmoothness constraint – while keeping the rest of themodel fixed.

This can be seen as an unsupervised variant offine-tuning, a common neural network domain adap-tation baseline. As our learned transformation con-tinuously evolves, we run the risk associated withfine-tuning of forgetting the knowledge acquiredfrom the source domain. For this reason, neural

55

T

T

2

1

Manchester

Birmingham

East London

UnitedKingdom

Ws3- “Hey china plate, lunch?”

- “Excellent. Just don’t spill the loop

on your roundthe houses

again.

- “Hey mate, lunch?”

- “Bostin. Just don’t spill the soup on your

trousers again.”

- “Hey cock, lunch?”

- “Mint. Just don’t spill the soup on your keks again.”

S

T3

Ws2

Ws1

lat.

long.

- “Hey mate, lunch?”

- “Excellent. Just don’t spill the soup on your trousers

again.”

English

Figure 3: Transforming a source domain subspace S into con-

tinuous target domain subspaces Ts using a spatially varying

transformation Ws.

network architectures that are immune to forgetting,such as the recently proposed Progressive NeuralNetworks (Rusu et al., 2016) are appealing for con-tinuous domain adaptation.

While time is the most obvious dimension alongwhich language evolves, other dimensions are possi-ble: Geographical location influences dialectal vari-ations as in Figure 3; socio-economic status, politi-cal affiliation as well as a domain’s purpose or com-plexity all influence language and can thus be con-ceived as axes that span a manifold for embeddingdomain subspaces.

5 Investigating language change

A continuous notion of domains naturally lends it-self to a diachronic study of language. By look-ing at the representations produced by the modelover different time steps, one gains insight into thechange of language in a community or another do-main. Similarly, observing how a user adapts theirstyle to different users and communities reveals in-sights about the language of those entities.

Domain mixture models use various domain sim-ilarity measures to determine how similar the lan-guages of two domains are, such as Renyi diver-gence (Van Asch and Daelemans, 2010), Kullback-Leibler (KL) divergence, Jensen-Shannon diver-gence, and vector similarity metrics (Plank and van

Noord, 2011), as well as task-specific measures(Zhou et al., 2016).

While word distributions have been used tradi-tionally to compare domains, embedding domainsin a manifold offers the possibility to evaluate thelearned subspace representations. For this, cosinesimilarity as used for comparing word embeddingsor KL divergence as used in the Variational Autoen-coder (Kingma and Welling, 2013) are a natural fit.

6 Evaluation

Our evaluation consists of three parts for evaluatingthe learned representations, the model, and the vari-ation of language itself.

Firstly, as our models produce new representa-tions for every subspace, we can compare a snapshotof a domain’s representation after every n time stepsto chart a trajectory of its changes.

Secondly, as we are conducting experiments ondialogue modeling, gold data for evaluation is read-ily available in the form of the actual response. Wecan thus train a model on reddit data of a certain pe-riod, adapt it to a stream of future conversations andevaluate its performance with BLEU or another met-ric that might be more suitable to expose variation inlanguage. At the same time, human evaluations willreveal whether the generated responses are faithfulto the target domain.

Finally, the learned representations will allow usto investigate the variations in language. Ideally, wewould like to walk the manifold and observe howlanguage changes as we move from one domain tothe other, similarly to (Radford et al., 2016).

7 Conclusion

We have proposed a notion of continuous naturallanguage domains along with dialogue modeling asa test bed. We have presented a representationof continuous domains and detailed how this rep-resentation can be incorporated into representationlearning-based models. Finally, we have outlinedhow these models can be used to investigate changeand variation in language. While our models allowus to shed light on how language changes, modelsthat can adapt to continuous changes are key forpersonalization and the reality of grappling with anever-changing world.

56

ReferencesJohn Blitzer, Ryan McDonald, and Fernando Pereira.

2006. Domain Adaptation with Structural Correspon-dence Learning. EMNLP ’06 Proceedings of the 2006Conference on Empirical Methods in Natural Lan-guage Processing, (July):120–128.

Konstantinos Bousmalis, George Trigeorgis, Nathan Sil-berman, Dilip Krishnan, and Dumitru Erhan. 2016.Domain Separation Networks. NIPS.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman,Ilya Sutskever, and Pieter Abbeel. 2016. Info-GAN: Interpretable Representation Learning by In-formation Maximizing Generative Adversarial Nets.arXiv preprint arXiv:1606.03657.

Cristian Danescu-Niculescu-Mizil, Robert West, Dan Ju-rafsky, and Christopher Potts. 2013. No Country forOld Members : User Lifecycle and Linguistic Changein Online Communities. Proceedings of the 22ndinternational conference on World Wide Web, pages307–317.

Hal Daume III and Daniel Marcu. 2006. Domain Adap-tation for Statistical Classifiers. Journal of ArtificialIntelligence Research, 26:101–126.

Hal Daume III. 2007. Frustratingly Easy DomainAdaptation. Association for Computational Linguistic(ACL)s, (June):256–263.

Dan Douglas. 2004. Discourse Domains: The CognitiveContext of Speaking. In Diana Boxer and Andrew D.Cohen, editors, Studying Speaking to Inform SecondLanguage Learning. Multilingual Matters.

Manaal Faruqui and Chris Dyer. 2014. Improving Vec-tor Space Word Representations Using MultilingualCorrelation. Proceedings of the 14th Conference ofthe European Chapter of the Association for Compu-tational Linguistics, pages 462 – 471.

Basura Fernando, Amaury Habrard, Marc Sebban, TinneTuytelaars, K U Leuven, Laboratoire Hubert, CurienUmr, and Benoit Lauras. 2013. Unsupervised Vi-sual Domain Adaptation Using Subspace Alignment.Proceedings of the IEEE International Conference onComputer Vision.

Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman.2012. Geodesic Flow Kernel for Unsupervised Do-main Adaptation. 2012 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR).

Judy Hoffman, Trevor Darrell, and Kate Saenko. 2014.Continuous manifold based adaptation for evolving vi-sual domains. Proceedings of the IEEE ComputerSociety Conference on Computer Vision and PatternRecognition, pages 867–874.

Diederik P Kingma and Max Welling. 2013.Auto-Encoding Variational Bayes. arXiv preprintarXiv:1312.6114, (Ml):1–14.

Rivka Levitan, Agustn Gravano, and Julia Hirschberg.2011. Entrainment in Speech Preceding Backchan-nels. Annual Meeting of the Association for Compu-tational Linguistics (ACL/HLT), pages 113–117.

Yishay Mansour. 2009. Domain Adaptation with Multi-ple Sources. NIPS, pages 1–8.

David McClosky, Eugene Charniak, and Mark Johnson.2010. Automatic domain adaptation for parsing. Hu-man Language Technologies: The 2010 Annual Con-ference of the North American Chapter of the Associ-ation for Computational Linguistics, pages 28–36.

Aditya Mogadala and Achim Rettinger. 2016. BilingualWord Embeddings from Parallel and Non-parallel Cor-pora for Cross-Language Text Classification. NAACL,pages 692–702.

Dong Nguyen and Carolyn P. Rose. 2011. Languageuse as a reflection of socialization in online commu-nities. Proceedings of the Workshop on Languages in. . . , (June):76–85.

K. G. Niederhoffer and J. W. Pennebaker. 2002. Lin-guistic Style Matching in Social Interaction. Journalof Language and Social Psychology, 21(4):337–360.

Harriet J. Oppenheimer. 2006. The Anthropology of Lan-guage: An Introduction to Linguistic Anthropology.Wadsworth, Belmont (Canada).

Sinno Jialin Pan and Qiang Yang. 2010. A survey ontransfer learning. IEEE Transactions on Knowledgeand Data Engineering, 22(10):1345–1359.

Barbara Plank and Gertjan van Noord. 2011. EffectiveMeasures of Domain Similarity for Parsing. Proceed-ings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Tech-nologies, 1:1566–1576.

Alec Radford, Luke Metz, and Soumith Chintala.2016. Unsupervised Representation Learning withDeep Convolutional Generative Adversarial Networks.ICLR, pages 1–15.

Andrei A Rusu, Neil C Rabinowitz, Guillaume Des-jardins, Hubert Soyer, James Kirkpatrick, KorayKavukcuoglu, Razvan Pascanu, Raia Hadsell, andGoogle Deepmind. 2016. Progressive Neural Net-works. arXiv preprint arXiv:1606.04671.

Vincent Van Asch and Walter Daelemans. 2010. UsingDomain Similarity for Performance Estimation. Com-putational Linguistics, (July):31–36.

Oriol Vinyals and Quoc V. Le. 2015. A Neural Conver-sational Model.

Guangyou Zhou, Zhiwen Xie, Jimmy Xiangji Huang, andTingting He. 2016. Bi-Transferring Deep Neural Net-works for Domain Adaptation. ACL, pages 322–332.

57

Author Index

Aizawa, Akiko, 1Allen, James, 27

Bakhshandeh, Omid, 27Benikova, Darina, 6Bethard, Steven, 39Breslin, John G., 53

Cheung, Jackie Chi Kit, 43Church, Ken, 35

Do, Quynh Ngoc Thi, 39

Eskenazi, Maxine, 32

Ghaffari, Parsa, 53

Hovy, Eduard, 22Hurst, Amy, 48

Ilievski, Filip, 17

Kabbara, Jad, 43

Lee, Kyusong, 32Lewis, Timothy, 48

Matuszek, Cynthia, 48Mitamura, Teruko, 22Moens, Marie-Francine, 39Mooney, Raymond, 11

Pelecanos, Jason, 35Pichotta, Karl, 11Postma, Marten, 17

Rajagopal, Dheeraj, 22Ruder, Sebastian, 53

Sugawara, Saku, 1

Taylor, Matthew, 48

van Erp, Marieke, 17Vossen, Piek, 17

Zesch, Torsten, 6Zhao, Tiancheng, 32Zhu, Weizhong, 35

59

Scaling Early Achievements to Robust Methods Workshop ...List/Enumeration 11.7% Tracking, retaining,...

Documents

Transcript of Scaling Early Achievements to Robust Methods Workshop ...List/Enumeration 11.7% Tracking, retaining,...