Real-Time Understanding of Complex Discriminative Scene … · 2016. 9. 22. · Proceedings of the...

10
Proceedings of the SIGDIAL 2016 Conference, pages 232–241, Los Angeles, USA, 13-15 September 2016. c 2016 Association for Computational Linguistics Real-Time Understanding of Complex Discriminative Scene Descriptions Ramesh Manuvinakurike 1 , Casey Kennington 3* , David DeVault 1 and David Schlangen 2 1 USC Institute for Creative Technologies / Los Angeles, USA 2 DSG / CITEC / Bielefeld University / Bielefeld, Germany 3 Boise State University / Boise, USA 1 [email protected], 2 [email protected] 3 [email protected] Abstract Real-world scenes typically have complex structure, and utterances about them con- sequently do as well. We devise and evaluate a model that processes descrip- tions of complex configurations of geo- metric shapes and can identify the de- scribed scenes among a set of candidates, including similar distractors. The model works with raw images of scenes, and by design can work word-by-word in- crementally. Hence, it can be used in highly-responsive interactive and situated settings. Using a corpus of descriptions from game-play between human subjects (who found this to be a challenging task), we show that reconstruction of description structure in our system contributes to task success and supports the performance of the word-based model of grounded seman- tics that we use. 1 Introduction In this paper, we present and evaluate a language processing pipeline that enables an automated sys- tem to detect and understand complex referen- tial language about visual objects depicted on a screen. This is an important practical capability for present and future interactive spoken dialogue systems. There is a trend toward increasing de- ployment of spoken dialogue systems for smart- phones, tablets, automobiles, TVs, and other set- tings where information and options are presented on-screen along with an interactive speech chan- nel in which visual items can be discussed (Ce- likyilmaz et al., 2014). Similarly, for future sys- tems such as smartphones, quadcopters, or self- driving cars that are equipped with cameras, users * The work was done while at Bielefeld University. may wish to discuss objects visible to the system in camera images or video streams. A challenge in enabling such capabilities for a broad range of applications is that human speakers draw on a diverse set of perceptual and language skills to communicate about objects in situated vi- sual contexts. Consider the example in Figure 1, drawn from the corpus of RDG-Pento games (dis- cussed further in Section 2). In this example, a hu- man in the director role describes the visual scene highlighted in red (the target image) to another hu- man in the matcher role. The scene description is provided in one continuous stream of speech, but it includes three functional segments each provid- ing different referential information: [this one is kind of a uh a blue T ][and a wooden w sort of ] [the T is kind of malformed]. The first and third of these three segments refer to the object at the top left of the target image, while the middle seg- ment refers to the object at bottom right. An ability to detect the individual segments of language that carry information about individual referents is an important part of deciphering a scene description like this. Beyond detection, actually understand- ing these referential segments in context seems to require perceptual knowledge of vocabulary for colors, shapes, materials and hedged descriptions like kind of a blue T. In other game scenarios, it’s important to understand plural references like two brown crosses and relational expressions like this one has the L on top of the T. A variety of vocabulary knowledge is needed, as different speakers may describe individual objects in very different ways (the object described as kind of a blue T may also be called a blue odd-shaped piece or a facebook). When many scenes are de- scribed by the same pair of speakers, the pair tends to entrain or align to each other’s vocabulary (Gar- rod and Anderson, 1987), for example by settling on facebook as a shorthand description for this ob- 232

Transcript of Real-Time Understanding of Complex Discriminative Scene … · 2016. 9. 22. · Proceedings of the...

  • Proceedings of the SIGDIAL 2016 Conference, pages 232–241,Los Angeles, USA, 13-15 September 2016. c©2016 Association for Computational Linguistics

    Real-Time Understanding of Complex Discriminative Scene Descriptions

    Ramesh Manuvinakurike1, Casey Kennington3*, David DeVault1 and David Schlangen21USC Institute for Creative Technologies / Los Angeles, USA

    2DSG / CITEC / Bielefeld University / Bielefeld, Germany3Boise State University / Boise, USA

    [email protected], [email protected]@cs.boisestate.edu

    Abstract

    Real-world scenes typically have complexstructure, and utterances about them con-sequently do as well. We devise andevaluate a model that processes descrip-tions of complex configurations of geo-metric shapes and can identify the de-scribed scenes among a set of candidates,including similar distractors. The modelworks with raw images of scenes, andby design can work word-by-word in-crementally. Hence, it can be used inhighly-responsive interactive and situatedsettings. Using a corpus of descriptionsfrom game-play between human subjects(who found this to be a challenging task),we show that reconstruction of descriptionstructure in our system contributes to tasksuccess and supports the performance ofthe word-based model of grounded seman-tics that we use.

    1 Introduction

    In this paper, we present and evaluate a languageprocessing pipeline that enables an automated sys-tem to detect and understand complex referen-tial language about visual objects depicted on ascreen. This is an important practical capabilityfor present and future interactive spoken dialoguesystems. There is a trend toward increasing de-ployment of spoken dialogue systems for smart-phones, tablets, automobiles, TVs, and other set-tings where information and options are presentedon-screen along with an interactive speech chan-nel in which visual items can be discussed (Ce-likyilmaz et al., 2014). Similarly, for future sys-tems such as smartphones, quadcopters, or self-driving cars that are equipped with cameras, users

    * The work was done while at Bielefeld University.

    may wish to discuss objects visible to the systemin camera images or video streams.

    A challenge in enabling such capabilities for abroad range of applications is that human speakersdraw on a diverse set of perceptual and languageskills to communicate about objects in situated vi-sual contexts. Consider the example in Figure 1,drawn from the corpus of RDG-Pento games (dis-cussed further in Section 2). In this example, a hu-man in the director role describes the visual scenehighlighted in red (the target image) to another hu-man in the matcher role. The scene description isprovided in one continuous stream of speech, butit includes three functional segments each provid-ing different referential information: [this one iskind of a uh a blue T] [and a wooden w sort of ][the T is kind of malformed]. The first and thirdof these three segments refer to the object at thetop left of the target image, while the middle seg-ment refers to the object at bottom right. An abilityto detect the individual segments of language thatcarry information about individual referents is animportant part of deciphering a scene descriptionlike this. Beyond detection, actually understand-ing these referential segments in context seemsto require perceptual knowledge of vocabulary forcolors, shapes, materials and hedged descriptionslike kind of a blue T. In other game scenarios, it’simportant to understand plural references like twobrown crosses and relational expressions like thisone has the L on top of the T.

    A variety of vocabulary knowledge is needed, asdifferent speakers may describe individual objectsin very different ways (the object described as kindof a blue T may also be called a blue odd-shapedpiece or a facebook). When many scenes are de-scribed by the same pair of speakers, the pair tendsto entrain or align to each other’s vocabulary (Gar-rod and Anderson, 1987), for example by settlingon facebook as a shorthand description for this ob-

    232

  • ject type. Finally, to understand a full scene de-scription, the matcher needs to combine all the ev-idence from multiple referential segments involv-ing a group of objects to identify the target image.

    In this paper, we define and evaluate a languageprocessing pipeline that allows many of these per-ceptual and language skills to be integrated intoan automated system for understanding complexscene descriptions. We take the challenging visualreference game RDG-Pento, shown in Figure 1, asour testbed, and we evaluate both human-humanand automated system performance in a corpusstudy. No prior work we are aware of has putforth techniques for grounded understanding ofthe kinds of noisy, complex, spoken descriptionsof visual scenes that can occur in such interactivedialogue settings. This work describes and evalu-ates an initial approach to this complex problem,and it demonstrates the critical importance of seg-mentation and entrainment to achieving strong un-derstanding performance. This approach extendsthe prior work (Kennington and Schlangen, 2015;Han et al., 2015) that assumed either that referen-tial language from users has been pre-segmented,or that visual scenes are given not as raw imagesbut as clean semantic representations, or that vi-sual scenes are simple enough to be described witha one-off referring expression or caption. Ourwork makes none of these assumptions.

    Our automated pipeline, discussed in Section 3,includes components for learning perceptuallygrounded word meanings, segmenting a streamof speech, identifying the type of referential lan-guage in each speech segment, resolving the ref-erences in each type of segment, and aggregatingevidence across segments to select the most likelytarget image. Our technical approach enables allof these components to be trained in a supervisedmanner from annotated, in-domain, human-humanreference data. Our quantitative evaluation, pre-sented in Section 4, looks at the performance ofthe individual components as well as the over-all pipeline, and quantifies the strong importanceof segmentation, segment type identification, andspeaker-specific vocabulary entrainment for im-proving performance in this task.

    2 The RDG-Pento Game

    The RDG-Pento (Rapid Dialogue Game-Pentomino) game is a two player collaborativegame. RDG-Pento is a variant of the RDG-Image

    Director: this one is kind of a uh a blue T and awooden w sort of the T is kind of mal-formed

    Matcher: okay got it

    Figure 1: In the game, the director is describingthe image highlighted in red (the target image) tothe matcher, who tries to identify this image fromamong the 8 possible images. The figure showsthe game interface as seen by the director includ-ing a transcript of the director’s speech.

    game described by Manuvinakurike and DeVault(2015). As in RDG-Image, both players see 8images on their screen in a 2X4 grid as shownin Figure 1. One person is assigned the role ofdirector and the other person that of matcher.The director’s screen has a single target image(TI) highlighted with a red border. The goal ofthe director is to uniquely describe the TI for thematcher to identify among the distractor images.The 8 images are shown in a different order onthe director and matcher screens, so that theTI cannot be identified by grid position. Theplayers can speak freely until the matcher makes aselection. Once the matcher indicates a selection,the director can advance the game. Over time, thegameplay gets progressively more challenging asthe images tend to contain more objects that aresimilar in shape and color. The task is complex bydesign.

    In RDG-Pento, the individual images are takenfrom a real-world, tabletop scene containing an ar-rangement of between one and six physical Pen-tomino objects. Individual images with varyingnumbers of objects are illustrated in Figure 2. The8 images at any one time always contain the samenumber of objects; the number of objects increasesas the game progresses. Players play for 5 rounds,

    233

  • Figure 2: Example scene descriptions for three TIs

    alternating roles. Each round has a time limit(about 200 seconds) that creates time pressure forthe players, and the time remaining ticks down ina countdown timer.

    Data Set The corpus used here was col-lected using a web framework for crowd-sourced data collection called Pair Me Up (PMU)(Manuvinakurike and DeVault, 2015). To createthis corpus, 42 pairs of native English-speakers lo-cated in the U.S. and Canada were recruited us-ing AMT. Game play and audio data were cap-tured for each pair of speakers (who were not colo-cated and communicated entirely through theirweb browsers), and the resulting audio data wastranscribed and annotated. 16 pairs completedall 5 game rounds, while the remaining crowd-sourced pairs completed only part of the game forvarious reasons. As our focus is on understandingindividual scene descriptions, our data set here in-cludes data from the 16 complete games as well aspartial games. A more complete description andanalysis of the corpus can be found in Zarrieß etal. (2016).

    Data Annotation We annotated the transcribeddirector and matcher speech through a process ofsegmentation, segment type labeling, and referentidentification. The segment types are shown inTable 1, and example annotations are provided inFigure 2. The annotation is carried out on each tar-

    Segment type Label ExamplesSingular SIN this is a green t, plus signMultiple objects MUL two Zs at top, they’re all greenRelation REL above, in a diagonalOthers OT that was tough, lets start

    Table 1: Segment types, labels, and examples

    get image subdialogue in which the director andmatcher discuss an individual target image. Thesegmentation and labeling steps create a completepartition of each speaker’s speech into sequencesof words with a related semantic function in ourframework.1

    Sequences of words that ascribe properties to asingle object are joined under the SIN label. OurSIN segment type is not a simple syntactic conceptlike “singular NP referring expression”. The SINtype includes not only simple singular NPs like theblue s but also clauses like it’s the blue s and con-joined clauses like it’s like a harry potter and it’slike maroon (Figure 1). The individuation crite-rion for SIN is that a SIN segment must ascribeproperties only to a single object; as such it maycontain word sequences of various syntactic types.

    Sequences of words such as the two crosses thatascribe properties to multiple objects are joinedinto a segment under the MUL label.

    Sequences of words that describe a geometricrelation between objects are segmented and givena REL label. These are generally prepositional ex-pressions, and include both single-word preposi-tions (underneath, below) and multi-word com-plex prepositions (Quirk et al., 1985) which in-clude multiple orthographic words (“next to”, “leftof” etc.). The REL segments generally describegeometric relations between objects referred to inSIN and MUL segments. An example would be[MUL two crosses] [REL above] [MUL two Ts].

    All other word sequences are assigned the typeOthers and given an OT label. This segment typeincludes acknowledgments, confirmations, feed-back, and laughter, among other dialogue act typesnot addressed in this work.

    For each segment of type SIN, MUL, or REL,the correct referent object or objects within the tar-get image are also annotated.

    In the data set, there are a total of 4132 target

    1The annotation scheme was developed iteratively whilekeeping the reference resolution task and the WAC model(see Section 3.3.1) in mind. The annotation was done by anexpert annotator.

    234

  • image speaker transcripts in which either the di-rector or the matcher’s transcribed speech for atarget image is annotated. There are 8030 anno-tated segments (5451 director segments and 2579matcher segments). There are 1372 word typesand 55,238 word tokens.

    3 Language Processing Pipeline

    In this section, we present our language process-ing pipeline for segmentation and understandingof complex scene descriptions. The modules,decision-making, and information flow for thepipeline are visualized in Figure 3. The pipelinemodules include a Segmenter (Section 3.1), a Seg-ment Type Classifier (Section 3.2), and a Refer-ence Resolver (Section 3.3).

    In this paper, we focus on how our pipelinecould be used to automate the role of the matcherin the RDG-Pento game. We consider the task ofselecting the correct target image based on a hu-man director’s transcribed speech drawn from ourRDG-Pento corpus. The pipeline is designed how-ever for eventual real-time operation using incre-mental ASR results, so that in the future it can beincorporated into a real-time interactive dialoguesystem. We view it as a crucial design constrainton our pipeline modules that the resolution pro-cess must take place incrementally; i.e., process-ing must not be deferred until the end of the user’sspeech. This is because humans resolve (i.e., com-prehend) speech as it unfolds (Tanenhaus, 1995;Spivey et al., 2002), and incremental processing(i.e., processing word by word) is important to de-veloping an efficient and natural speech channelfor interactive systems (Skantze and Schlangen,2009; Paetzel et al., 2015; DeVault et al., 2009;Aist et al., 2007). In the current study, we havetherefore provided the human director’s correctlytranscribed speech as input to our pipeline on aword-by-word basis, as visualized in Figure 3.

    3.1 Segmenter

    The segmenter module is tasked with identifyingthe boundary points between segments. In ourpipeline, this task is performed independently ofthe determination of segment types, which is han-dled by a separate classifier (Section 3.2).

    Our approach to segmentation is similar to Ce-likyilmaz et al. (2014) which used CRFs for asimilar task. Our pipeline currently uses linear-chain CRFs to find the segment boundaries (im-

    plemented with Mallet (McCallum, 2002)). Usinga CRF trained on the annotated RDG-Pento dataset, we identify the most likely sequence of word-level boundary tags, where each tag indicates if thecurrent word ends the previous segment or not.2

    An example segmentation is shown in Figure 3,where the word sequence weird L to the top leftof is segmented into two segments, [weird L] and[to the top left of]. The features provided to theCRF include unigrams3, the speaker’s role, part-of-speech (POS) tags obtained using the StanfordPOS tagger (Toutanova et al., 2003), and informa-tion about the scene such as the number of objects.

    3.2 Segment Type Classifier

    The segment type classifier assigns each detectedsegment with one of the type labels in Table 1(SIN, MUL, REL, OT). This label informs theReference Resolver module in how to proceedwith the resolution process, as explained below.

    The segment type labeler is an SVM classifierimplemented in LIBSVM (Chang and Lin, 2011).Features used include word unigrams, word POS,user role, number of objects in the TI, and thetop-level syntactic category of the segment as ob-tained from the Stanford parser (Klein and Man-ning, 2003). Figure 3 shows two examples of out-put from the segment type classifier, which assignsSIN to [weird L] and REL to [to the top left of].

    3.3 Reference Resolver

    We introduce some notation to help explain theoperation of the reference resolver (RR) module.When a scene description is to be resolved, there isa visual context in the game which we encode as acontext set C = I1, ..., I8 containing the eight visi-ble images (see Figure 1). Each image Ik containsn objects {ok1, . . . , okn}, where n is fixed per con-text set, but varies across context sets from n = 1to n = 6. The set of all objects in all images isO = {okl }, with 0 < k ≤ 8, 0 < l ≤ n.

    When the RR is invoked, the director hasspoken some sequence of words which hasbeen segmented by earlier modules into oneor more segments Sj = w1:mj , and whereeach segment has been assigned a segment typetype(Sj) ∈ {SIN, MUL,REL,OT}. For exam-

    2We currently adopt this two-tag approach rather thanBIO tagging as our tag-set provides a complete partition ofeach speaker’s speech.

    3Words of low frequency (i.e.,

  • Figure 3: Information flow during processing of an utterance. The modules operate incrementally, word-by-word; as shown here, this can lead to revisions of decisions.

    ple, S1 = ⟨weird, L⟩, S2 = ⟨to, the, top, left, of⟩and type(S1) = SIN, type(S2) = REL.

    The RR then tries to understand the individualwords, typed segments, and the full scene descrip-tion in terms of the visible objects okl and the im-ages Ik in the context set. We describe how words,segments, and scene descriptions are understoodin the following three sections.

    3.3.1 Understanding wordsWe understand individual words using the Words-as-Classifiers (WAC) model of Kennington andSchlangen (2015). In this model, a classifier istrained for each word wp in the vocabulary. Themodel constructs a function from the perceptualfeatures of a given object to a judgment about howwell those features “fit” together with the word be-ing understood. Such a function can be learnedusing a logistic regression classifier, separately foreach word.

    The inputs to the classifier are the low-level con-tinuous features that represent the object (RGBvalues, HSV values, number of detected edges,x/y coordinates and radial distance from the cen-ter) extracted using OpenCV.4 These classifiers arelearned from instances of language use, i.e., by ob-serving referring expressions paired with the ob-

    4http://opencv.org

    ject referred to. Crucially, once learned, theseword classifiers can be applied to any number ofobjects in a scene.

    We trained a WAC model for each of the (non-relational) words in our RDG-Pento corpus, usingthe annotated correct referent information for oursegmented data. After training, words can be ap-plied to objects to yield a score:

    score(wp, okl ) = wp(okl ) (1)

    (Technically, the score is the response of the clas-sifier associated with word wp applied to the fea-ture representation of object okl .)

    Note that relational expressions are trainedslightly differently than non-relational words. Ex-amples of relational expressions include under-neath, below, next to, left of, right of, above, anddiagonal. A WAC classifier is trained for each fullrelational expression eq (treated as a single token),and the ‘fit’ for a relational expression’s classifieris a fit for a pair of objects: (The features used forsuch a classifier are comparative features, such asthe euclidean distance between the two objects, aswell as x and y distances.)

    scorerel(eq, okl1 , okl2) = eq(o

    kl1 , o

    kl2) (2)

    There are about 300 of these expressions in RDG-Pento. [SIN x] [REL r] [SIN y] is resolved as

    236

  • r(x,y), so x and y are jointly constrained. See Ken-nington and Schlangen (2015) for details on thistraining.

    3.3.2 Understanding segmentsConsider an arbitrary segment Sj = w1:mj suchas S1 = ⟨weird, L⟩. For a segment (SIN or MUL),we attempt to understand the segment as referringto some object or set of objects. To do so, we com-bine the word-level scores for all the words in thesegment to yield a segment-level score5 for eachobject okl :

    score(Sj , okl ) = score(w1, okl ) ⊙ . . . ⊙

    score(wmj , okl )

    (3)

    Each segment Sj = w1:mj hence induces an or-der Rj on the object set O, through the scores as-signed to each object okl . With these ranked scores,we look at the type of segment to compute a finalscore score∗k(Sj) for each image Ik. For SIN seg-ments, score∗k(Sj) is the score of the top-scoringobject in Ik. For MUL segments with a cardinal-ity of two (e.g., two red crosses), score∗k(Sj) is thesum of the scores of the top two objects in Ik, andso on.

    Obtaining the final score score∗k(Sj) for RELsegments is done in a similar manner with someminor differences. Because REL segments ex-press a relation between pairs of objects (referredto in neighboring segments), a score for the rela-tional expression in Sj can be computed for anypair of distinct objects okl1 and o

    kl2

    in image Ik us-ing Eq. (2). We let score∗k(Sj) equal the scorecomputed for the top-scoring objects okl1 and o

    kl2

    of the neighboring segments.

    3.3.3 Understanding scene descriptionsIn general, a scene description consists of seg-ments S1, ..., Sz . Composition takes segmentsS1, ..., Sz and produces a ranking over images.For this particular task, we make the following as-sumption: in each segment, the speaker is attempt-ing to refer to a specific object (or set of objects),which from our perspective as matcher could bein any of the images. A good candidate Ik forthe target image will have high scoring objects, alldrawn from the same image, for all the segmentsS1, ..., Sz .

    We therefore obtain a final score for each imageas shown in Eq. (4):

    5The composition operator⊙ is left-associative and henceincremental. In this paper, word-level scores are composedby multiplying them.

    Label Precision Recall F-ScoreSEG 0.85 0.74 0.79NOSEG 0.93 0.97 0.95

    Table 2: Segmenter performance

    Label Precision Recall F-score % of segmentsSIN 0.91 0.96 0.93 57REL 0.97 0.85 0.91 6MUL 0.86 0.60 0.71 3OT 0.96 0.97 0.96 34

    Table 3: Segment type classifier performance

    score(Ik) =z∑

    j=1

    score∗k(Sj) (4)

    The image I∗k selected by our pipeline for a fullscene description is then given by:

    I∗k = argmaxk

    score(Ik) (5)

    4 Experiments & Evaluations

    We first evaluate the segmenter and segment typeclassifier as individual modules. We then evalu-ate the entire processing pipeline and explore theimpact of several factors on pipeline performance.

    4.1 Segmenter EvaluationTask & Data We used the annotated RDG-Pento data to perform a “hold-one-dialogue-pair-out” cross-validation of the segmenter. The task isto segment each speaker’s speech for each targetimage by tagging each word using the tags SEGand NOSEG. The SEG tag here indicates the lastword in the current segment. Figure 3 gives anexample of the tagging.

    Results The results are presented in Table 2.These results show that the segmenter is workingwith some success, with precision 0.85 and recall0.74 for the SEG tag indicating a word bound-ary. Note that occasional errors in segment bound-aries may not be overly problematic for the overallpipeline, as what we ultimately care most about isaccurate target image selection. We evaluate theoverall pipeline below (Section 4.3).

    4.2 Segment Type Classifier EvaluationTask & Data We used the annotated RDG-Pento data to perform a hold-one-pair-out cross-validation of the segment type classifier, training a

    237

  • SVM classifier to predict labels SIN, MUL, REL,and OT using the features described in Section 3.2.

    Results The results are given in Table 3. We alsoreport the percentage of segments that have eachlabel in the corpus. The segment type classifierperforms well on most of the class labels. Of slightconcern is the low-frequency MUL label. One fac-tor here is that people use number words like twonot just to refer to multiple objects, but also to de-scribe individual objects, e.g., the two red crosses(a MUL segment) vs. the one with two sides (aSIN segment).

    4.3 Pipeline Evaluation

    We evaluated our pipeline under varied conditionsto understand how well it works when segmenta-tion is not performed at all, when the segmentationand type classifier modules produce perfect output(using oracle annotations), and when entrainmentto a specific speaker is possible. We evaluate ourpipeline on the accuracy of the task of image re-trieval given a scene description from our data set.

    4.3.1 Three baselinesWe compare against a weak random baseline (1/8= 0.125) as well as a rather strong one, namely theaccuracies of the human-human pairs in the RDG-Pento corpus. As Table 4 shows, in the simplestcase, with only one object per image, the averagehuman success rate is 85%, but this decreases to60% when there are four objects/image. It then in-creases to 68% when 6 objects are present, possi-bly due to the use of a more structured descriptionordering in the six object scenes. We leave furtheranalysis of the human strategies for future work.These numbers show that the game is challengingfor humans.

    We also include in Table 4 a simple Naive Bayesclassification approach as an alternative to our en-tire pipeline. In our study, there were only 40 pos-sible image sets that were fixed in advance. Foreach possible image set, a different Naive Bayesclassifier is trained using Weka (Hall et al., 2009)in a hold-one-pair-out cross-validation. The eightimages are treated as atomic classes to be pre-dicted, and unigram features drawn from the unionof all (unsegmented) director speech are used topredict the target image. This method is broadlycomparable to the NLU model used in (Paetzel etal., 2015) to achieve high performance in resolv-ing references to pictures of single objects. As can

    be seen, the accuracy for this method is as highas 43% for single object TIs in the RDG-Pentodata set, but the accuracy rapidly falls to near therandom baseline as the number of objects/imageincreases. This weak performance for a classifierwithout segmentation confirms the importance ofsegmenting complex descriptions into referencesto individual objects in the RDG-Pento game.

    4.3.2 Five versions of the pipelineTable 4 includes results for 5 versions of ourpipeline. The versions differ in terms of whichsegment boundaries and segment type labels areused, and in the type of cross-validation per-formed. A first version (I) explores how well thepipeline works if unsegmented scene descriptionsare provided and a SIN label is assumed to coverthe entire scene description. This model is broadlycomparable to the Naive Bayes baseline, but sub-stitutes a WAC-based NLU component. Theevaluation of version (I) uses a hold-one-pair-out(HOPO) cross-validation, where all modules aretrained on every pair except for the one being usedfor testing. A second version (II) uses automati-cally determined segment boundaries and segmenttype labels, in a HOPO cross-validation, and rep-resents our pipeline as described in Section 3. Athird version (III) substitutes in human-annotatedor “oracle” segment boundaries and type labels,allowing us to observe the performance loss asso-ciated with imperfect segmentation and type label-ing in our pipeline. The fourth and fifth versionsof the pipeline switch to a hold-one-episode-out(HOEO) cross-validation, where only the specificscene description (“episode”) being tested is heldout from training. When compared with a HOPOcross-validation, the HOEO setup allows us to in-vestigate the value of learning from and entrain-ing to the specific speaker’s vocabulary and speechpatterns (such as calling the purple object in Fig-ure 2 a “harry potter”).

    4.3.3 ResultsTable 4 summarizes the image retrieval accura-cies for our three baselines and five versions ofour pipeline. We discuss here some observationsfrom these results. First, in comparing pipelineversions (I) and (II), we observe that the use ofautomated segmentation and a segment type clas-sifier in (II) leads to a substantial increase in ac-curacy of 5-20% (p

  • #objects per TI1 2 3 4 6

    Random baseline 0.13 0.13 0.13 0.13 0.13Naive Bayes baseline 0.43 0.20 0.14 0.14 0.13

    Seg+lab X-validation(I) None HOPO 0.47 0.20 0.24 0.13 0.15(II) Auto HOPO 0.52 0.40 0.31 0.24 0.23

    (III) Oracle HOPO 0.54 0.42 0.32 0.30 0.26(IV) Auto HOEO 0.60 0.46 0.37 0.25 0.23(V) Oracle HOEO 0.64 0.50 0.41 0.34 0.44Human-human baseline 0.85 0.73 0.66 0.60 0.68

    Table 4: Image retrieval accuracies for five ver-sions of the pipeline and three baselines.

    number of objects/image. Comparing (II) and(III), we see that if our segmenter and segmenttype classifier could reproduce the human segmentannotations perfectly, an additional improvementof 1-6% (p

  • WAC model, akin to referentially afforded conceptcomposition (Mcnally and Boleda, 2015)). Alsorelated are the recent efforts in automatic imagecaptioning and retrieval, where the task is to gen-erate a description (a caption) for a given imageor retrieve one being given a description. A fre-quently taken approach is to use a convolutionalneural network to map the image into a dense vec-tor, and then to condition a neural language modelon this to produce an output string or using it tomap the description into the same space (Vinyalset al., 2015; Devlin et al., 2015; Socher et al.,2014). See also Fang et al. (2015), which is moredirectly related to our model in that they use “worddetectors” to propose words for image regions.

    6 Conclusions & Future work

    We have presented an approach to understand-ing complex, multi-utterance references to im-ages containing spatially complex scenes. The ap-proach by design works incrementally, and henceis ready to be used in an interactive system. Wepresented evaluations that go end-to-end from ut-terance input to resolution decision (but not yettaking in speech). We have shown that segmen-tation is a critical component for understandingcomplex visual scene descriptions. This workopens avenues for future explorations in variousdirections. Intra- and inter-segment composition(through multiplication and addition, respectively)are approached somewhat simplistically, and wewant to explore the consequences of these deci-sions more deeply in future work. Additionally, asdiscussed above, there seems to be much implicitinformation in how speakers go from one refer-ence to the next, which might be possible to cap-ture in a transition model. Finally, in an onlinesetting, there is more than just the decision “this isthe referent”; one must also decide when and howto act based on the confidence in the resolution.Lastly, our results have shown that human pairs doalign on their conceptual description frames (Gar-rod and Anderson, 1987). Whether human userswould also do this with an artificial interlocutor, ifit were able to do the required kind of online learn-ing, is another exciting question for future work,enabled by the work presented here. We also planto extend our work in the future to include descrip-tions which contain relations between non singularobjects (Ex: [MUL two red crosses] [REL above][SIN brown L], [MUL two red crosses] [REL on

    top of] [MUL two green Ts] etc.). However, suchdescriptions were very rare in the corpus.

    Obtaining samples for training the classifiersis another issue. One source of sparsity isidiosyncratic descriptions like ’harry potter’ or’facebook’. In dialogue (our intended setting),these could be grounded through clarification re-quests. A more extensive solution would ad-dress metaphoric or meronymic usage (”looks likexyz”). We will explore this in future work.

    Acknowledgments

    This work was in part supported by the Clus-ter of Excellence Cognitive Interaction Technol-ogy ’CITEC’ (EXC 277) at Bielefeld University,which is funded by the German Research Founda-tion (DFG), and the KogniHome project, fundedby BMBF. This work was supported in part by theNational Science Foundation under Grant No. IIS-1219253 and by the U.S. Army. Any opinions,findings, and conclusions or recommendations ex-pressed in this material are those of the author(s)and do not necessarily reflect the views, position,or policy of the National Science Foundation orthe United States Government, and no official en-dorsement should be inferred.

    ReferencesGregory Aist, James Allen, Ellen Campana, Car-

    los Gomez Gallo, Scott Stoness, Mary Swift, andMichael K Tanenhaus. 2007. Incremental dialoguesystem faster than and preferred to its nonincremen-tal counterpart. In the 29th Annual Conference ofthe Cognitive Science Society.

    Asli Celikyilmaz, Zhaleh Feizollahi, Dilek Hakkani-Tur, and Ruhi Sarikaya. 2014. Resolving refer-ring expressions in conversational dialogs for naturaluser interfaces. In Proceedings of EMNLP.

    Chih-Chung Chang and Chih-Jen Lin. 2011. Lib-svm: a library for support vector machines. ACMTransactions on Intelligent Systems and Technology(TIST), 2(3):27.

    David DeVault, Kenji Sagae, and David Traum. 2009.Can i finish?: learning when to respond to incremen-tal interpretation results in interactive dialogue. InProceedings of the SIGDIAL 2009 Conference: The10th Annual Meeting of the Special Interest Groupon Discourse and Dialogue, pages 11–20. Associa-tion for Computational Linguistics.

    Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta,Li Deng, Xiaodong He, Geoffrey Zweig, and Mar-garet Mitchell. 2015. Language models for image

    240

  • captioning: The quirks and what works. In Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 2: Short Papers), pages 100–105,Beijing, China, July. Association for ComputationalLinguistics.

    Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Sri-vastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xi-aodong He, Margaret Mitchell, John Platt, LawrenceZitnick, and Geoffrey Zweig. 2015. From cap-tions to visual concepts and back. In Proceedingsof CVPR, Boston, MA, USA, June. IEEE.

    Simon Garrod and Anthony Anderson. 1987. Say-ing what you mean in dialogue: A study in concep-tual and semantic co-ordination. Cognition, 27:181–218.

    Mark Hall, Eibe Frank, Geoffrey Holmes, BernhardPfahringer, Peter Reutemann, and Ian H Witten.2009. The weka data mining software: an update.ACM SIGKDD explorations newsletter, 11(1):10–18.

    Ting Han, Casey Kennington, and David Schlangen.2015. Building and Applying Perceptually-Grounded Representations of Multimodal Scene De-scriptions. In Proceedings of SEMDial, Gothenburg,Sweden.

    Casey Kennington and David Schlangen. 2015. Sim-ple learning and compositional application of per-ceptually grounded word meanings for incrementalreference resolution. In Proceedings of the Confer-ence for the Association for Computational Linguis-tics (ACL).

    Dan Klein and Christopher D. Manning. 2003. Ac-curate unlexicalized parsing. In Proceedings of the41st Meeting of the Association for ComputationalLinguistics, pages 423–430.

    Ramesh Manuvinakurike and David DeVault. 2015.Pair me up: A web framework for crowd-sourcedspoken dialogue collection. In Natural LanguageDialog Systems and Intelligent Assistants, pages189–201. Springer.

    Mitchell P Marcus. 1995. Text Chunking usingTransformation-Based Learning. In In Pro- ceed-ings of the Third Workshop on Very Large Cor- pora,pages 82–94.

    Andrew Kachites McCallum. 2002. Mal-let: A machine learning for language toolkit.http://mallet.cs.umass.edu.

    Louise Mcnally and Gemma Boleda. 2015. Concep-tual vs. Referential Affordance in Concept Composi-tion. In Compositionality and Concepts in Linguis-tics and Psychology, pages 1–20.

    Dmitrijs Milajevs, Dimitri Kartsaklis, MehrnooshSadrzadeh, Matthew Purver, Computer Science,and Mile End Road. 2014. Evaluating NeuralWord Representations in Tensor-Based Composi-tional Settings. In EMLNP, pages 708–719.

    Maike Paetzel, Ramesh Manuvinakurike, and DavidDeVault. 2015. “So, which one is it?” The effectof alternative incremental architectures in a high-performance game-playing agent. In The 16th An-nual Meeting of the Special Interest Group on Dis-course and Dialogue (SigDial).

    R. Quirk, S. Greenbaum, G. Leech, and J. Svartvik.1985. A Comprehensive grammar of the Englishlanguage. General Grammar Series. Longman.

    Gabriel Skantze and David Schlangen. 2009. Incre-mental dialogue processing in a micro-domain. InProceedings of the 12th Conference of the EuropeanChapter of the Association for Computational Lin-guistics, pages 745–753. Association for Computa-tional Linguistics.

    Richard Socher, Andrej Karpathy, Quoc V Le, Christo-pher D Manning, and Andrew Y Ng. 2014.Grounded Compositional Semantics for Finding andDescribing Images with Sentences. Transactions ofthe ACL (TACL).

    Michael J Spivey, Michael K Tanenhaus, Kathleen MEberhard, and Julie C Sedivy. 2002. Eye move-ments and spoken language comprehension: Effectsof visual context on syntactic ambiguity resolution.Cognitive Psychology, 45(4):447–481.

    Michael Tanenhaus. 1995. Integration of Visual andLinguistic Information in Spoken Language Com-prehension. Science, 268:1632–1634.

    Kristina Toutanova, Dan Klein, Christopher D Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Proceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Compu-tational Linguistics.

    Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Computer Vision and Pat-tern Recognition.

    Sina Zarrieß, Julian Hough, Casey Kennington,Ramesh Manuvinakurike, David DeVault, andDavid Schlangen. 2016. Pentoref: A corpus of spo-ken references in task-oriented dialogues.

    241