What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and...

12
Proceedings of NAACL-HLT 2019, pages 3772–3783 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics 3772 What do Entity-Centric Models Learn? Insights from Entity Linking in Multi-Party Dialogue Laura Aina * Carina Silberer * Ionut-Teodor Sorodoc * Matthijs Westera * Gemma Boleda Universitat Pompeu Fabra Barcelona, Spain {firstname.lastname}@upf.edu Abstract Humans use language to refer to entities in the external world. Motivated by this, in recent years several models that incorporate a bias towards learning entity representations have been proposed. Such entity-centric models have shown empirical success, but we still know little about why. In this paper we analyze the behavior of two re- cently proposed entity-centric models in a ref- erential task, Entity Linking in Multi-party Di- alogue (SemEval 2018 Task 4). We show that these models outperform the state of the art on this task, and that they do better on lower frequency entities than a counterpart model that is not entity-centric, with the same model size. We argue that making models entity- centric naturally fosters good architectural de- cisions. However, we also show that these models do not really build entity representa- tions and that they make poor use of linguis- tic context. These negative results underscore the need for model analysis, to test whether the motivations for particular architectures are borne out in how models behave when de- ployed. 1 Introduction Modeling reference to entities is arguably crucial for language understanding, as humans use language to talk about things in the world. A hypothesis in recent work on referential tasks such as co-reference resolution and entity link- ing (Haghighi and Klein, 2010; Clark and Manning, 2016; Henaff et al., 2017; Aina et al., 2018; Clark et al., 2018) is that encouraging models to learn and use entity representations will help them better carry out referential tasks. To illustrate, creating an entity representation with the relevant information upon reading a woman should make it easier to * denotes equal contribution. J OEY TRIBBIANI (183): ”. . . see Ross , because I think you love her .” 335 183 335 306 Figure 1: Character identification: example. resolve a pronoun mention like she. 1 In the men- tioned work, several models have been proposed that incorporate an explicit bias towards entity representations. Such entity-centric models have shown empirical success, but we still know little about what it is that they effectively learn to model. In this analysis paper, we adapt two previous entity-centric models (Henaff et al., 2017; Aina et al., 2018) for a recently proposed referential task and show that, despite their strengths, they are still very far from modeling entities. 2 The task is character identification on multi- party dialogue as posed in SemEval 2018 Task 4 (Choi and Chen, 2018). 3 Models are given dia- logues from the TV show Friends and asked to link entity mentions (nominal expressions like I, she or the woman) to the characters to which they refer in each case. Figure 1 shows an example, where the mentions Ross and you are linked to entity 335, mention I to entity 183, etc. Since the TV series revolves around a set of entities that recur over many scenes and episodes, it is a good benchmark to analyze whether entity-centric models learn and use entity representations for referential tasks. Our contributions are three-fold: First, we adapt two previous entity-centric models and show that they do better on lower frequency entities 1 Note the analogy with traditional models in formal lin- guistics like Discourse Representation Theory (Kamp and Reyle, 2013). 2 Source code for our model, the training procedure and the new dataset is published on https://github.com/ amore-upf/analysis-entity-centric-nns. 3 https://competitions.codalab.org/ competitions/17310.

Transcript of What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and...

Page 1: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

Proceedings of NAACL-HLT 2019, pages 3772–3783Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

3772

What do Entity-Centric Models Learn? Insights from Entity Linking inMulti-Party Dialogue

Laura Aina∗ Carina Silberer∗ Ionut-Teodor Sorodoc∗

Matthijs Westera∗ Gemma Boleda

Universitat Pompeu FabraBarcelona, Spain

{firstname.lastname}@upf.edu

Abstract

Humans use language to refer to entities in theexternal world. Motivated by this, in recentyears several models that incorporate a biastowards learning entity representations havebeen proposed. Such entity-centric modelshave shown empirical success, but we stillknow little about why.

In this paper we analyze the behavior of two re-cently proposed entity-centric models in a ref-erential task, Entity Linking in Multi-party Di-alogue (SemEval 2018 Task 4). We show thatthese models outperform the state of the arton this task, and that they do better on lowerfrequency entities than a counterpart modelthat is not entity-centric, with the same modelsize. We argue that making models entity-centric naturally fosters good architectural de-cisions. However, we also show that thesemodels do not really build entity representa-tions and that they make poor use of linguis-tic context. These negative results underscorethe need for model analysis, to test whetherthe motivations for particular architectures areborne out in how models behave when de-ployed.

1 Introduction

Modeling reference to entities is arguably crucialfor language understanding, as humans uselanguage to talk about things in the world. Ahypothesis in recent work on referential taskssuch as co-reference resolution and entity link-ing (Haghighi and Klein, 2010; Clark and Manning,2016; Henaff et al., 2017; Aina et al., 2018; Clarket al., 2018) is that encouraging models to learnand use entity representations will help them bettercarry out referential tasks. To illustrate, creating anentity representation with the relevant informationupon reading a woman should make it easier to

∗denotes equal contribution.

JOEY TRIBBIANI (183):”. . . see Ross, because I think you love her .”

335 183 335 306

Figure 1: Character identification: example.

resolve a pronoun mention like she.1 In the men-tioned work, several models have been proposedthat incorporate an explicit bias towards entityrepresentations. Such entity-centric models haveshown empirical success, but we still know littleabout what it is that they effectively learn to model.In this analysis paper, we adapt two previousentity-centric models (Henaff et al., 2017; Ainaet al., 2018) for a recently proposed referential taskand show that, despite their strengths, they are stillvery far from modeling entities.2

The task is character identification on multi-party dialogue as posed in SemEval 2018 Task 4(Choi and Chen, 2018).3 Models are given dia-logues from the TV show Friends and asked to linkentity mentions (nominal expressions like I, she orthe woman) to the characters to which they referin each case. Figure 1 shows an example, wherethe mentions Ross and you are linked to entity 335,mention I to entity 183, etc. Since the TV seriesrevolves around a set of entities that recur overmany scenes and episodes, it is a good benchmarkto analyze whether entity-centric models learn anduse entity representations for referential tasks.

Our contributions are three-fold: First, weadapt two previous entity-centric models and showthat they do better on lower frequency entities

1Note the analogy with traditional models in formal lin-guistics like Discourse Representation Theory (Kamp andReyle, 2013).

2Source code for our model, the training procedure andthe new dataset is published on https://github.com/amore-upf/analysis-entity-centric-nns.

3https://competitions.codalab.org/competitions/17310.

Page 2: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3773

(a significant challenge for current data-hungrymodels) than a counterpart model that is not entity-centric, with the same model size. Second, throughanalysis we provide insights into how they achievethese improvements, and argue that making modelsentity-centric fosters architectural decisions thatresult in good inductive biases. Third, we createa dataset and task to evaluate the models’ ability toencode entity information such as gender, and showthat models fail at it. More generally, our paperunderscores the need for the analysis of model be-havior, not only through ablation studies, but alsothrough the targeted probing of model represen-tations (Linzen et al., 2016; Conneau et al., 2018).

2 Related Work

Modeling. Various memory architectures havebeen proposed that are not specifically for entity-centric models, but could in principle be employedin them (Graves et al., 2014; Sukhbaatar et al.,2015; Joulin and Mikolov, 2015; Bansal et al.,2017). The two models we base our results on(Henaff et al., 2017; Aina et al., 2018) were explic-itly motivated as entity-centric. We show that ouradaptations yield good results and provide a closeranalysis of their behavior.

Tasks. The task of entity linking has been for-malized as resolving entity mentions to referentialentity entries in a knowledge repository, mostlyWikipedia (Bunescu and Pasca, 2006; Mihalceaand Csomai, 2007 and much subsequent work; forrecent approaches see Francis-Landau et al., 2016;Chen et al., 2018). In the present entity linking task,only a list of entities is given, without associatedencyclopedic entries, and information about theentities needs to be acquired from scratch throughthe task; note the analogy to how a human audi-ence might get familiar with the TV show charac-ters by watching it. Moreover, it addresses multi-party dialogue (as opposed to, typically, narrativetext), where speaker information is crucial. A taskclosely related to entity linking is coreference res-olution, i.e., predicting which portions of a textrefer to the same entity (e.g., Marie Curie and thescientist). This typically requires clustering men-tions that refer to the same entity (Pradhan et al.,2011). Mention clusters essentially correspond toentities, and recent work on coreference and lan-guage modeling has started exploiting an explicitnotion of entity (Haghighi and Klein, 2010; Clarkand Manning, 2016; Yang et al., 2017). Previous

work both on entity linking and on coreference reso-lution (cited above, as well as Wiseman et al., 2016)often presents more complex models that incorpo-rate e.g. hand-engineered features. In contrast, wekeep our underlying model basic since we want tosystematically analyze how certain architectural de-cisions affect performance. For the same reason wedeviate from previous work to entity linking thatuses a specialized coreference resolution module(e.g., Chen et al., 2017).

Analysis of Neural Network Models. Our workjoins a recent strand in NLP that systematicallyanalyzes what different neural network modelslearn about language (Linzen et al., 2016; Kadaret al., 2017; Conneau et al., 2018; Gulordava et al.,2018b; Nematzadeh et al., 2018, a.o.). This work,like ours, has yielded both positive and negativeresults: There is evidence that they learn complexlinguistic phenomena of morphological and syn-tactic nature, like long distance agreement (Gulor-dava et al., 2018b; Giulianelli et al., 2018), but lessevidence that they learn how language relates tosituations; for instance, Nematzadeh et al. (2018)show that memory-augmented neural models failon tasks that require keeping track of inconsistentstates of the world.

3 Models

We approach character identification as a clas-sification task, and compare a baseline LSTM(Hochreiter and Schmidhuber, 1997) with twomodels that enrich the LSTM with a memory mod-ule designed to learn and use entity representations.LSTMs are the workhorse for text processing, andthus a good baseline to assess the contribution ofthis module. The LSTM processes text of dialoguescenes one token at a time, and the output is aprobability distribution over the entities (the setof entity IDs are given).

3.1 Baseline: BILSTMThe BILSTM model is depicted in Figure 2. It is astandard bidirectional LSTM (Graves et al., 2005),with the difference with most uses of LSTMs inNLP that we incorporate speaker information inaddition to the linguistic content of utterances.

The model is given chunks of dialogue (see Ap-pendix for hyperparameter settings such as thechunk size). At each time step i, one-hot vectorsfor token ti and speaker entities si are embedded

Page 3: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3774

Joey

you

...

softmax

Wt

WeJoey: think

Joey: you

Joey: love

{

...

...

...

Inputs: ("Speaker: token")

Class scores: o

BiLSTM:

i

xi

Wo

......

hi-1 hi hi+1

Figure 2: BILSTM applied to “...think you love...” asspoken by Joey (from Figure 1), outputting class scoresfor mention “you” (bias bo not depicted).

via two distinct matrices Wt and We and concate-nated to form a vector xi (Eq. 1, where ‖ denotesconcatenation; note that in case of multiple simulta-neous speakers Si their embeddings are summed).

xi = Wt ti ‖∑s∈Si

We s (1)

The vector xi is fed through the nonlinear acti-vation function tanh and input to a bidirectionalLSTM. The hidden state

−→hi of a unidirectional

LSTM for the ith input is recursively defined asa combination of that input with the LSTM’sprevious hidden state

−→hi−1. For a bidirectional

LSTM, the hidden state hi is the concatenation ofthe hidden states

−→hi and

←−hi of two unidirectional

LSTMs which process the data in oppositedirections (Eqs. 2-4).

−→hi = LSTM(tanh(xi),

−→hi−1) (2)

←−hi = LSTM(tanh(xi),

←−hi+1) (3)

hi =−→hi ‖←−hi (4)

For every entity mention ti (i.e., every token4 thatis tagged as referring to an entity), we obtain adistribution over all entities, oi ∈ [0, 1]1×N , byapplying a linear transformation to its hiddenstate hi (Eq. 5), and feeding the resulting gi to asoftmax classifier (Eq. 6).

gi = Wo hi + bo (5)

oi = softmax(gi) (6)

Eq. 5 is where the other models will diverge.

cos

q

o

i

i

Wq

We

gisoftmax

...... hi

...

Query & library:

Class scores:

Gate:

(from BiLSTM)

Figure 3: ENTLIB; everything before hi, omitted here,is the same as in Figure 2.

3.2 ENTLIB (Static Memory)The ENTLIB model (Figure 3) is an adaptation ofour previous work in Aina et al. (2018), which wasthe winner of the SemEval 2018 Task 4 competition.This model adds a simple memory module that isexpected to represent entities because its vectorsare tied to the output classes (accordingly, Ainaet al., 2018, call this module entity library). Wecall this memory ‘static’, since it is updated onlyduring training, after which it remains fixed.

Where BILSTM maps the hidden state hi toclass scores oi with a single transformation (plussoftmax), ENTLIB instead takes two steps: It firsttransforms hi into a ‘query’ vector qi (Eq. 7) thatit will then use to query the entity library. As wewill see, this mechanism helps dividing the laborbetween representing the context (hidden layer)and doing the prediction task (query layer).

qi = Wq hi + bq (7)

A weight matrix We is used as the entity library,which is the same as the speaker embedding inEq. 1: the query vector qi ∈ R1×k is compared toeach vector in We (cosine), and a gate vector giis obtained by applying the ReLU function to thecosine similarity scores (Eq. 8).5 Thus, the queryextracted from the LSTM’s hidden state is used asa soft pointer over the model’s representation ofthe entities.

gi = ReLU(cos(We,qi)) (8)

As before, a softmax over gi then yields the dis-tribution over entities (Eq. 6). So, in the ENTLIB

4For multi-word mentions this is done only for the lasttoken in the mention.

5In Aina et al. (2018), the gate did not include the ReLUnonlinear activation function. Adding it improved results.

Page 4: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3775

Vi-1

q

o

i

i

Wq

gisoftmax

Vi...

...... hi

...

Vi+1...

cos Wecos +

R

Q

+

S

V~i

(from BiLSTM)

×+

(Keys)

(Values)

Figure 4: ENTNET; everything before hi, omitted here,is the same as in Figure 2.

model Eqs. 7 and 8 together take the place of Eq. 5in the BILSTM model.

Our implementation differs fromAina et al. (2018) in one important point that wewill show to be relevant to model less frequententities (training also differs, see Section 4): Theoriginal model did not do parameter sharingbetween speakers and referents, but used twodistinct weight matrices.

Note that the contents of the entity library inENTLIB do not change during forward propagationof activations, but only during backpropagation oferrors, i.e., during training, when the weights ofWe are updated. If anything, they will encodepermanent properties of entities, not propertiesthat change within a scene or between scenes orepisodes, which should be useful for reference. Thenext model attempts to overcome this limitation.

3.3 ENTNET (Dynamic Memory)ENTNET is an adaptation of Recurrent Entity Net-works (Henaff et al., 2017, Figure 4) to the task.Instead of representing each entity by a single vec-tor, as in ENTLIB, here each entity is representedjointly by a context-invariant or ‘static’ key anda context-dependent or ‘dynamic’ value. For thekeys the entity embedding We is used, just like theentity library of ENTLIB. But the values Vi can bedynamically updated throughout a scene.

As before, an entity query qi is first obtainedfrom the BILSTM (Eq. 7). Then, ENTNET com-putes gate values gi by estimating the query’s simi-larity to both keys and values, as in Eq. 9 (replacingEq. 8 of ENTLIB).6 Output scores oi are computed

6Two small changes with respect to the original model(motivated by empirical results in the hyperparameter search)

as in the previous models (Eq. 6).

gi = ReLU(cos(We,qi) + cos(Vi,qi)) (9)

The values Vi are initialized at the start of ev-ery scene (i = 0) as being identical to the keys(V0 = We). After processing the ith token, newinformation can be added to the values. Eq. 10computes this new information Vi,j , for the jth

entity, where Q, R and S are learned linear trans-formations and PReLU denotes the parameterizedrectified linear unit (He et al., 2015):

Vi,j = PReLU(QWej +RVi,j + Sqi) (10)

This information Vi,j , multiplied by the respectivegate gi,j , is added to the values to be used whenprocessing the next (i + 1th) token (Eq. 11), andthe result is normalized (Eq. 12):

Vi+1,j = Vj + gi,j ∗ Vi,j (11)

Vi+1,j =Vi+1,j

‖Vi+1,j‖(12)

Our adaptation of the Recurrent Entity Networkinvolves two changes. First, we use a biLSTMto process the linguistic utterance, while Henaffet al. (2017) used a simple multiplicative mask (wehave natural dialogue, while their main evaluationwas on bAbI, a synthetic dataset). Second, in theoriginal model the gates were used to retrieve andoutput information about the query, whereas we usethem directly as output scores because our task isreferential. This also allows us to tie the keys to thecharacters of the Friends series as in the previousmodel, and thus have them represent entities (in theoriginal model, the keys represented entity types,not instances).

4 Character Identification

The training and test data for the task span thefirst two seasons of Friends, divided into scenesand episodes, which were in turn divided into ut-terances (and tokens) annotated with speaker iden-tity.7 The set of all possible entities to refer to isgiven, as well as the set of mentions to resolve.Only the dialogues and speaker information areavailable (e.g., no video or descriptive text). Indeed,

are that we compute the gate using cosine similarity insteadof dot product, and the obtained similarities are fed through aReLU nonlinearity instead of sigmoid.

7The dataset also includes automatic linguistic annotations,e.g., PoS tags, which our models do not use.

Page 5: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3776

all (78) main (7)models #par F1 Acc F1 AccSemEv-1st - 41.1 74.7 79.4 77.2SemEv-2nd - 13.5 68.6 83.4 82.1BILSTM 3.4M 34.4 74.6 85.0 83.5ENTLIB 3.3M 49.6∗ 77.6∗ 84.9 84.2ENTNET 3.4M 52.5∗ 77.5∗ 84.8 83.9

Table 1: Model parameters and results on the char-acter identification task. First block: top systems atSemEval 2018. Results in the second block markedwith ∗ are statistically significantly better than BIL-STM at p < 0.001 (approximate randomization tests,Noreen, 1989).

one of the most interesting aspects of the SemEvaldata is the fact that it is dialogue (even if scripted),which allows us to explore the role of speaker in-formation, one of the aspects of the extralinguisticcontext of utterance that is crucial for reference.We additionally used the publicly available 300-dimensional word vectors that were pre-trained ona Google News corpus with the word2vec Skip-gram model (Mikolov et al., 2013a) to representthe input tokens. Entity (speaker/referent) embed-dings were randomly initialized.

We train the models with backpropagation, usingthe standard negative log-likelihood loss function.For each of the three model architectures we per-formed a random search (> 1500 models) over thehyperparameters using cross-validation (see Ap-pendix for details), and report the results of the bestsettings after retraining without cross-validation.The findings we report are representative of themodel populations.

Results. We follow the evaluation defined in theSemEval task. Metrics are macro-average F1-score(which computes the F1-score for each entity sep-arately and then averages these over all entities)and accuracy, in two conditions: All entities, with78 classes (77 for entities that are mentioned inboth training and test set of the SemEval Task, andone grouping all others), and main entities, with 7classes (6 for the main characters and one for allthe others). Macro-average F1-score on all entities,the most stringent, was the criterion to define theleaderboard.

Table 1 gives our results in the two evaluations,comparing the models described in Section 3 tothe best performing models in the SemEval 2018Task 4 competition (Aina et al., 2018; Park et al.,

Figure 5: Accuracy on entities with high (>1000),medium (20–1000), and low (<20) frequency.

2018). Recall that our goal in this paper is notto optimize performance, but to understand modelbehavior; however, results show that these modelsare worth analyzing, as that they outperform thestate of the art. All models perform on a par onmain entities, but entity-centric models outperformBILSTM by a substantial margin when all char-acters are to be predicted (the difference betweenENTLIB and ENTNET is not significant).

The architectures of ENTLIB and ENTNET helpwith lower frequency characters, while not hurtingperformance on main characters. Indeed, Figure 5shows that the accuracy of BILSTM rapidly deteri-orates for less frequent entities, whereas ENTLIB

and ENTNET degrade more gracefully. Deep learn-ing approaches are data-hungry, and entity men-tions follow the Zipfian distribution typical of lan-guage, with very few high frequency and manylower-frequency items, such that this is a welcomeresult. Moreover, these improvements do not comeat the cost of model complexity in terms of numberof parameters, since all models have roughly thesame number of parameters (3.3− 3.4 million).8

Given these results and the motivations for themodel architectures, it would be tempting to con-clude that encouraging models to learn and useentity representations helps in this referential task.However, a closer look at the models’ behaviorreveals a much more nuanced picture.

Figure 6 suggests that: (1) models are quite goodat using speaker information, as the best perfor-mance is for first person pronouns and determiners(I, my, etc.); (2) instead, models do not seem tobe very good at handling other contextual infor-mation or entity-specific properties, as the worst

8See Appendix for a computation of the models’ parame-ters.

Page 6: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3777

Figure 6: F1-score (all entities condition) of the threemodels, per mention type, and token frequency of eachmention type.

performance is for third person mentions and com-mon nouns, which require both;9 (3) ENTLIB andENTNET behave quite similarly, with performanceboosts in (1) and smaller but consistent improve-ments in (2). Our analyses in the next two sectionsconfirm this picture and relate it to the models’architectures.

5 Analysis: Architecture

We examine how the entity-centric architecturesimprove over the BILSTM baseline on the refer-ence task, then move to entity representations (Sec-tion 6).

Shared speaker/referent representation. Wefound that an important advantage of the entity-centric models, in particular for handling low-frequency entities, lies in the integrated represen-tations they enable of entities both in their role ofspeakers and in their role of referents. This ex-plains the boost in first person pronoun and propernoun mentions, as follows.

Recall that the integrated representation isachieved by parameter sharing, using the sameweight matrix We as speaker embedding and as en-tity library/keys. This enables entity-centric modelsto learn the linguistic rule “a first person pronoun(I, me, etc.) refers to the speaker” regardless ofwhether they have a meaningful representation ofthis particular entity: It is enough that speaker rep-resentations are distinct, and they are because theyhave been randomly initialized. In contrast, the

91st person: I, me, my, myself, mine; 2nd person: you, your,yourself, yours; 3rd person: she, her, herself, hers, he, him,himself, his, it, itself, its.

model type main allBILSTM 0.39 0.02ENTLIB 0.82 0.13ENTNET 0.92 0.16#pairs 21 22155

Table 2: RSA correlation between speaker/referent em-beddings We and token embeddings Wt of the entities’names, for main entities vs. all entities (right)

simple BILSTM baseline needs to independentlylearn the mapping between speaker embedding andoutput entities, and so it can only learn to resolveeven first-person pronouns for entities for which ithas enough data.

For proper nouns (character names), entity-centric models learn to align the token embeddingswith the entity representations (identical to thespeaker embeddings). We show this by using Rep-resentation Similarity Analysis (RSA) (Kriegesko-rte et al., 2008), which measures how topologicallysimilar two different spaces are as the Spearmancorrelation between the pair-wise similarities ofpoints in each space (this is necessary because en-tities and tokens are in different spaces). For in-stance, if the two spaces are topologically similar,the relationship of entities 183 and 335 in the en-tity library will be analogous to the relationshipbetween the names Joey and Ross in the tokenspace. Table 2 shows the topological similaritiesbetween the two spaces, for the different modeltypes.10 This reveals that in entity-centric modelsthe space of speaker/referent embeddings is topo-logically very similar to the space of token embed-dings restricted to the entities’ names, and more sothan in the BILSTM baseline. We hypothesize thatentity-centric models can do the alignment betterbecause referent (and hence speaker) embeddingsare closer to the error signal, and thus backprop-agation is more effective (this again helps withlower-frequency entities).

Further analysis revealed that in entity-centricmodels the beneficial effect of weight sharing be-tween the speaker embedding and the entity repre-sentations (both We) is actually restricted to first-person pronouns. For other expressions, having

10As an entity’s name we here take the proper noun that ismost frequently used to refer to the entity in the training data.Note that for the all entities condition the absolute values arelower, but the space is much larger (over 22K pairs). Alsonote that this is an instance of slow learning; models are notencoding the fact that a proper noun like Rachel can refer todifferent people.

Page 7: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3778

Figure 7: ENTLIB, 2D TSNE projections of the activations for first-person mentions in the test set, colored by theentity referred to. The mentions cluster into entities already in the hidden layer hi (left graph; query layer qi shownin the right graph). Best viewed in color.

Figure 8: ENTLIB, 2D TSNE projections of the activations for mentions in the test set (excluding first personmentions), colored by the entity referred to. While there is already some structure in the hidden layer hi (leftgraph), the mentions cluster into entities much more clearly in the query qi (right graph). Best viewed in color.

two distinct matrices yielded almost the same per-formance as having one (but still higher than theBILSTM, thanks to the other architectural advan-tage that we discuss below).

In the case of first-person pronouns, the speakerembedding given as input corresponds to the targetentity. This information is already accessible inthe hidden state of the LSTM. Therefore, mentionscluster into entities already at the hidden layer hi,with no real difference with the query layer qi (seeFigure 7).

Advantage of query layer. The entity queryingmechanism described above entails having an ex-tra transformation after the hidden layer, with thequery layer q. Part of the improved performanceof entity-centric models, compared to the BILSTMbaseline, is due not to their bias towards ‘entityrepresentations’ per se, but due to the presence of

this extra layer. Recall that the BILSTM baselinemaps the LSTM’s hidden state hi to output scoresoi with a single transformation. Gulordava et al.(2018a) observe in the context of Language Mod-eling that this creates a tension between two con-flicting requirements for the LSTM: keeping trackof contextual information across time steps, andencoding information useful for prediction in thecurrent timestep. The intermediate query layer q inentity-centric models alleviates this tension. Thisexplains the improvements in context-dependentmentions like common nouns or second and thirdpronouns.

We show this effect in two ways. First, we com-pare the average mean similarity s of mention pairsTe = {(tk, tk′)| tk → e ∧ k 6= k′} referring to thesame entity e in the hidden layer (Eq. 13) and the

Page 8: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3779

BILSTM ENTLIB ENTNET

hi hi qi hi qi

0.34 0.24 0.48 0.27 0.60

Table 3: Average cosine similarity of mentions with thesame referent.

query layer.11

s =1

|E|∑e∈E

1

|Te|∑

(tk,tk′ )∈Te

cos(htk , htk′ ) (13)

Table 3 shows that, in entity-centric models, thissimilarity is lower in the hidden layer hi than inthe case of the BILSTM baseline, but in the querylayer qi it is instead much higher. The hidden layerthus is representing other information than referent-specific knowledge, and the query layer can be seenas extracting referent-specific information from thehidden layer. Figure 8 visually illustrates the divi-sion of labor between the hidden and query layers.Second, we compared the models to variants wherethe cosine-similarity comparison is replaced by anordinary dot-product transformation, which con-verts the querying mechanism into a simple furtherlayer. These variants performed almost as well onthe reference task, albeit with a slight but consistentedge for the models using cosine similarity.

No dynamic updates in ENTNET. A surprisingnegative finding is that ENTNET is not using itsdynamic potential on the referential task. We con-firmed this in two ways. First, we tracked the valuesVi of the entity representations and found that thepointwise difference in Vi at any two adjacent timesteps i tended to zero. Second, we simply switchedoff the update mechanism during testing and did notobserve any score decrease on the reference task.ENTNET is thus only using the part of the entitymemory that it shares with ENTLIB, i.e., the keysWe, which explains their similar performance.

This finding is markedly different from Henaffet al. (2017), where for instance the BaBI taskscould be solved only by dynamically updatingthe entity representations. This may reflect ourdifferent language modules: since our LSTMmodule already has a form of dynamic memory,unlike the simpler sentence processing module inHenaff et al. (2017), it may be that the LSTM takesthis burden off of the entity module. An alternativeis that it is due to differences in the datasets.

11For the query layer, Eq. 13 is equivalent, withcos(qtk , qtk′ ).

This person is {a/an/the} <PROPERTY> [and{a/an/the} <PROPERTY>]{0,2}.This person is the brother of Monica Geller.This person is a paleontologist and a man.

Figure 9: Patterns and examples (in italics) of thedataset for information extraction as entity linking.

We leave an empirical comparison of thesepotential explanations for future work, and focusin Section 6 on the static entity representationsWe that ENTNET essentially shares with ENTLIB.

6 Analysis: Entity Representations

The foregoing demonstrates that entity-centric ar-chitectures help in a reference task, but not that theinduced representations in fact contain meaning-ful entity information. In this section we deploythese representations on a new dataset, showingthat they do not—not even for basic informationabout entities such as gender.

Method. We evaluate entity representations withan information extraction task including attributesand relations, using information from an indepen-dent, unstructured knowledge base—the FriendsCentral Wikia.12 To be able to use the models asis, we set up the task in terms of entity linking,asking models to solve the reference of natural lan-guage descriptions that uniquely identify an entity.For instance, given This person is the brother ofMonica Geller., the task is to determine that personrefers to Ross Geller, based on the information inthe sentence.13 The information in the descriptionswas in turn extracted from the Wikia. We do notretrain the models for this task in any way—wesimply deploy them.

We linked the entities from the Friends datasetused above to the Wikia through a semi-automaticprocedure that yielded 93 entities, and parsedthe Wikia to extract their attributes (gender andjob ) and relations (e.g., sister, mother-in-law;see Appendix for details). We automaticallygenerate the natural language descriptions witha simple pattern (Figure 9) from combinationsof properties that uniquely identify a givenentity within the set of Friends characters.14 We

12http://friends.wikia.com.13The referring expression is the whole DP, This person,

but we follow the method in Aina et al. 2018 of asking forreference resolution at the head noun.

14Models require inputting a speaker; we use speaker UN-

Page 9: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3780

model description gender jobRANDOM 1.5 50 20BILSTM 0.4 - -ENTLIB 2.2 55 27ENTNET 1.3 61 24

Table 4: Results on the attribute and relation predictiontask: percentage accuracy for natural language descrip-tions, mean reciprocal rank of characters for single at-tributes (lower is worse).

consider unique descriptions comprising at most3 properties. Each property is expressed by a nounphrase, whereas the article is adapted (definite orindefinite) depending on whether that propertyapplies to one or several entities in our data. Thisyields 231 unique natural language descriptionsof 66 characters, created on the basis of overall61 relation types and 56 attribute values.

Results. The results of this experiment are nega-tive: The first column of Table 4 shows that modelsget accuracies near 0.

A possibility is that models do encode informa-tion in the entity representations, but it doesn’t getused in this task because of how the utterance isencoded in the hidden layer, or that results are dueto some quirk in the specific setup of the task. How-ever, we replicated the results in a setup that doesnot encode whole utterances but works with singleattributes and relations. While the methodologicaldetails are in the Appendix, the ‘gender’ and ‘job’columns of Table 4 show that results are a bit betterin this case but models still perform quite poorly:Even in the case of an attribute like gender, which iscrucial for the resolution of third person pronouns(he/she), the models’ results are quite close to thatof a random baseline.

Thus, we take it to be a robust result that entity-centric models trained on the SemEval data do notlearn or use entity information—at least as recov-erable from language cues. This, together with theremainder of the results in the paper, suggests thatmodels rely crucially on speaker information, buthardly on information from the linguistic context.15

Future work should explore alternatives such aspre-training with a language modeling task, which

KNOWN.15Note that 44% of the mentions in the dataset are first per-

son, for which linguistic context is irrelevant and the modelsonly need to recover the relevant speaker embedding to suc-ceed. However, downsampling first person mentions did notimprove results on the other mention types.

could improve the use of context.

7 Conclusions

Recall that the motivation for entity-centric modelsis the hypothesis that incorporating entity represen-tations into the model will help it better model thelanguage we use to talk about them. We still thinkthat this hypothesis is plausible. However, the ar-chitectures tested do not yet provide convincingsupport for it, at least for the data analyzed in thispaper.

On the positive side, we have shown that framingmodels from an entity-centric perspective makes itvery natural to adopt architectural decisions that aregood inductive biases. In particular, by exploitingthe fact that both speakers and referents are entities,these models can do more with the same modelsize, improving results on less frequent entitiesand emulating rule-based behavior such as “a firstperson expression refers to the speaker”. On thenegative side, we have also shown that they donot yield operational entity representations, andthat they are not making good use of contextualinformation for the referential task.

More generally, our paper underscores the needfor model analysis to test whether the motivationsfor particular architectures are borne out in how themodel actually behaves when it is deployed.

Acknowledgments

We gratefully acknowledge Kristina Gulordava andMarco Baroni for the feedback, advice and sup-port. We are also grateful to the anonymous re-viewers for their valuable comments. This projecthas received funding from the European ResearchCouncil (ERC) under the European Union’s Hori-zon 2020 research and innovation programme(grant agreement No 715154), and from the Span-ish Ramon y Cajal programme (grant RYC-2015-18907). We are grateful to the NVIDIA Corpora-tion for the donation of GPUs used for this research.We are also very grateful to the Pytorch developers.This paper reflects the authors’ view only, and theEU is not responsible for any use that may be madeof the information it contains.

Page 10: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3781

ReferencesLaura Aina, Carina Silberer, Ionut-Teodor Sorodoc,

Matthijs Westera, and Gemma Boleda. 2018.AMORE-UPF at SemEval-2018 Task 4: BiLSTMwith Entity Library. In Proc. of SemEval.

Trapit Bansal, Arvind Neelakantan, and Andrew Mc-Callum. 2017. RelNet: End-to-End Modeling of En-tities & Relations. CoRR, abs/1706.07179.

Razvan Bunescu and Marius Pasca. 2006. Using Ency-clopedic Knowledge for Named Entity Disambigua-tion. In Proc. of EACL.

Henry Y. Chen, Ethan Zhou, and Jinho D. Choi. 2017.Robust Coreference Resolution and Entity Linkingon Dialogues: Character Identification on TV ShowTranscripts. In Proc. of CoNLL 2017.

Hui Chen, Baogang Wei, Yonghuai Liu, Yiming Li,Jifang Yu, and Wenhao Zhu. 2018. Bilinear JointLearning of Word and Entity Embeddings for EntityLinking. Neurocomputing, 294:12–18.

Jinho D Choi and Henry Y Chen. 2018. SemEval 2018Task 4: Character Identification on Multiparty Dia-logues. In Proc. of SemEval.

Elizabeth Clark, Yangfeng Ji, and Noah A Smith. 2018.Neural Text Generation in Stories Using Entity Rep-resentations as Context. In Proc. of NAACL.

Kevin Clark and Christopher D. Manning. 2016. Im-proving Coreference Resolution by Learning Entity-Level Distributed Representations. In Proc. of ACL.

Alexis Conneau, German Kruszewski, Guillaume Lam-ple, Loıc Barrault, and Marco Baroni. 2018. WhatYou Can Cram into a Single $&!#* Vector: ProbingSentence Embeddings for Linguistic Properties. InProc. of ACL.

Nick Craswell. 2009. Mean Reciprocal Rank. In En-cyclopedia of Database Systems, pages 1703–1703.Springer.

Matthew Francis-Landau, Greg Durrett, and Dan Klein.2016. Capturing Semantic Similarity for EntityLinking with Convolutional Neural Networks. InProc. of NAACL:HLT.

Mario Giulianelli, Jack Harding, Florian Mohnert,Dieuwke Hupkes, and Willem Zuidema. 2018. Un-der the Hood: Using Diagnostic Classifiers to In-vestigate and Improve how Language Models TrackAgreement Information. In Proc. of EMNLP Work-shop BlackboxNLP: Analyzing and Interpreting Neu-ral Networks for NLP.

Alex Graves, Santiago Fernandez, and Jurgen Schmid-huber. 2005. Bidirectional LSTM Networks for Im-proved Phoneme Classification and Recognition. InProc. of ICANN.

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014.Neural Turing Machines. CoRR, abs/1410.5401.

Kristina Gulordava, Laura Aina, and Gemma Boleda.2018a. How to Represent a Word and Predict it, too:Improving Tied Architectures for Language Mod-elling. In Proc. of EMNLP.

Kristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018b. ColorlessGreen Recurrent Networks Dream Hierarchically.In Proc. of NAACL.

Aria Haghighi and Dan Klein. 2010. Coreference Reso-lution in a Modular, Entity-Centered Model. In Proc.of NAACL.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Delving Deep into Rectifiers: Surpass-ing Human-Level Performance on ImageNet Classi-fication. In Proc. of ICCV.

Mikael Henaff, Jason Weston, Arthur Szlam, AntoineBordes, and Yann LeCun. 2017. Tracking the WorldState with Recurrent Entity Networks. In Proc. ofICLR.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long Short-Term Memory. Neural Computation,9(8):1735–1780.

Armand Joulin and Tomas Mikolov. 2015. InferringAlgorithmic Patterns with Stack-Augmented Recur-rent Nets. In Proc. of NIPS.

Akos Kadar, Grzegorz Chrupała, and Afra Alishahi.2017. Representation of Linguistic Form and Func-tion in Recurrent Neural Networks. ComputationalLinguistics, 43(4):761–780.

Hans Kamp and Uwe Reyle. 2013. From Discourseto Logic: Introduction to Model-theoretic Semanticsof Natural Language, Formal Logic and DiscourseRepresentation Theory, volume 42. Springer Sci-ence+Business Media.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A Method for Stochastic Optimization. CoRR,abs/1412.6980.

Nikolaus Kriegeskorte, Marieke Mur, and Peter A Ban-dettini. 2008. Representational Similarity Analysis –Connecting the Branches of Systems Neuroscience.Frontiers in Systems Neuroscience, 2:4.

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the Ability of LSTMs to LearnSyntax-Sensitive Dependencies. TACL, 4(1):521–535.

Rada Mihalcea and Andras Csomai. 2007. Wikify!:Linking Documents to Encyclopedic Knowledge. InProc. of CIKM.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013a. Distributed Repre-sentations of Words and Phrases and Their Compo-sitionality. In Proc. of NIPS.

Page 11: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3782

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013b. Linguistic Regularities in Continuous SpaceWord Representations. In Proc. of NAACL.

Aida Nematzadeh, Kaylee Burns, Erin Grant, AlisonGopnik, and Tom Griffiths. 2018. Evaluating The-ory of Mind in Question Answering. In Proc. ofEMNLP.

E.W. Noreen. 1989. Computer-Intensive Methods forTesting Hypotheses: An Introduction. Wiley.

Cheon-Eum Park, Heejun Song, and Changki Lee.2018. KNU CI System at SemEval-2018 Task4:Character Identification by Solving Sequence-Labeling Problem. In Proc. of SemEval.

Sameer Pradhan, Lance Ramshaw, Mitchell Marcus,Martha Palmer, Ralph Weischedel, and NianwenXue. 2011. CoNLL-2011 Shared Task: ModelingUnrestricted Coreference in OntoNotes. In Proc. ofCoNLL.

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-End Memory Networks. In Proc. ofNIPS.

Sam Wiseman, Alexander M Rush, and Stuart MShieber. 2016. Learning Global Features for Coref-erence Resolution. In Proc. of NAACL:HLT.

Zichao Yang, Phil Blunsom, Chris Dyer, and WangLing. 2017. Reference-Aware Language Models. InProc. of EMNLP.

A Appendices

A.1 Hyperparameter searchBesides the LSTM parameters, we optimize thetoken embeddings Wt, the entity/speaker embed-dings We, as well as Wo, Wq, and their corre-sponding biases, where applicable (see Section 3).We used five-fold cross-validation with early stop-ping based on the validation score. We found thatmost hyperparameters could be safely fixed thesame way for all three types. Specifically, our finalmodels were all trained in batch mode using theAdam optimizer (Kingma and Ba, 2014), with eachbatch covering 25 scenes given to the model inchunks of 750 tokens. The token embeddings (Wt)are initialized with the 300-dimensional word2vecvectors, hi is set to 500 units, and entity (orspeaker) embeddings (We) to k = 150 units.Withthis hyperparameter setting, ENTLIB has fewer pa-rameters than BILSTM: the linear map Wo of thelatter (500×401) is replaced by the query extractorWq (500× 150) followed by (non-parameterized)similarity computations. This holds even if we takeinto account that the entity embedding We used

in both models contains 274 entities that are neverspeakers and that are, hence, used by ENTLIB butnot by BILSTM.

Our search also considered different types ofactivation functions in different places, with thearchitecture presented above, i.e., tanh before theLSTM and ReLU in the gate, robustly yieldingthe best results. Other settings tested—randomlyinitialized token embeddings, self-attention on theinput layer, and a uni-directional LSTM—did notimprove performance.

We then performed another random search(> 200 models) over the remaining hyperparame-ters: learning rate (sampled from the logarithmicinterval 0.001–0.05), dropout before and afterLSTM (sampled from 0.0–0.3 and 0.0–0.1, respec-tively), weight decay (sampled from 10−6–10−2)and penalization, i.e., whether to decrease therelative impact of frequent entities by dividing theloss for an entity by the square root of its frequency.This paper reports the best model of each type, i.e.,BILSTM, ENTLIB, and ENTNET, after training onall the training data without cross-validation for 20,80 and 80 epochs respectively (numbers selectedbased on tendencies in training histories). Thesemodels had the following parameters:

BILSTM ENTLIB ENTNET

learning rate: 0.0080 0.0011 0.0014dropout pre 0.2 0.2 0.0dropout post: 0.0 0.02 0.08weight decay: 1.8e-6 4.3e-6 1.0e-5penalization: no yes yes

B Attribute and relation extraction

B.1 Details of the datasetWe performed a two-step procedure to extractall the available data for the SemEval characters.First, using simple word overlap, we automaticallymapped the 401 SemEval names to the charactersin the database. In a second, manual step, we cor-rected these mappings and added links that werenot found automatically due to name alternatives,ambiguities or misspellings (e.g., SemEval Danawas mapped to Dana Keystone, and Janitor to TheZoo Employee). In total, we found 93 SemEvalentities in Friends Central, and we extracted theirattributes (gender and job) and their mutual rela-tionships (relatives).

Page 12: What do Entity-Centric Models Learn? Insights from Entity ... · Wikipedia (Bunescu and Pas¸ca,2006;Mihalcea and Csomai,2007and much subsequent work; for recent approaches seeFrancis-Landau

3783

Model Gender (93;2) Occupation Relatives(wo)man (s)he (24;17) (56;24)

RANDOM .50 .50 .20 .16ENTLIB .55 .58 .27 .22ENTNET .61 .56 .24 .26

Table 5: Results on the attribute prediction task (mean reciprocal rank; from 0 (worst) to 1 (best)). The number ofconsidered test items and candidate values, respectively, are given in the parentheses. For gender, we used (wo)manand (s)he as word cues for the values (fe)male.

B.2 Alternative setupWe use the same models, i.e. ENTLIB and ENTNET

trained on Experiment 1, and (without further train-ing) extract representations for the entities fromthem. The former are directly obtained from theentity embedding We of each model.

In the attribute prediction task, we are given anattribute (e.g., gender), and all its possible valuesV (e.g., V = {woman, man }). We formulate thetask as, given a character (e.g., Rachel ), produc-ing a ranking of the possible values in descendingorder of their similarity to the character, where sim-ilarity is computed by measuring the cosine of theangle between their respective vector representa-tions in the entity space. We obtain representationsof attributes values, in the same space as the enti-ties, by inputting each attribute value as a separateutterance to the models, and extracting the corre-sponding entity query (qi). Since the models alsoexpect a speaker for each utterance, we set it toeither all entities, main entities, a random entity, orno entity (i.e., speaker embedding with zero in allunits), and report the best results.

We evaluate the rankings produced for both tasksin terms of mean reciprocal rank (Craswell, 2009),scoring from 0 to 1 (from worst to best) the posi-tion of the target labels in the ranking. The twofirst columns Table 5 presents the results. Our mod-els generally perform poorly on the tasks, thoughoutperforming a random baseline. Even in the caseof an attribute like gender, which is crucial for theresolution of third person pronouns, the models’results are still very close to that of the randombaseline.

Instead, the task of relation prediction is to,given a pair of characters (e.g., Ross and Monica),predict the relation R which links them (e.g., sis-ter, brother-in-law, nephew; we found 24 relationsthat applied to at least two pairs). We approachthis following the vector offset method introducedby Mikolov et al. (2013b) for semantic relations

between words. This leverages on regularities inthe embedding space, taking the embeddings ofpairs that are connected by the same relation tohave analogous spatial relations. For two pairs ofcharacters (a, b) and (c, d) which bear the samerelation R, we assume a− b ≈ c− d to hold fortheir vector representations. For a target pair (a, b)and a relation R, we then compute the followingmeasure:

srel((a, b), R) =

∑(x,y)∈R cos(a− b,x− y)

|R|(14)

Equation (14) computes the average relational sim-ilarity between the target character pair and theexemplars of that relation (excluding the target it-self), where the relational similarity is estimated asthe cosine between the vector differences of the twopairs of entity representations respectively. Due tothis setup, we restrict to predicting relation typesthat apply to at least two pairs of entities. For eachtarget pair (a, b), we produce a rank of candidaterelations in descending order of their scores srel.Table 5 contains the results, again above baselinebut clearly very poor.