Learning to Describe Unknown Phrases with Local and Global ... · Oxford dictionary (Gadetsky et...

10
Proceedings of NAACL-HLT 2019, pages 3467–3476 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics 3467 Learning to Describe Unknown Phrases with Local and Global Contexts Shonosuke Ishiwatari Hiroaki Hayashi Naoki Yoshinaga § Graham Neubig Shoetsu Sato Masashi Toyoda § Masaru Kitsuregawa ¶§ The University of Tokyo Carnegie Mellon University § Institute of Industrial Science, the University of Tokyo National Institute of Informatics †§¶ {ishiwatari,ynaga,shoetsu,toyoda,kitsure}@tkl.iis.u-tokyo.ac.jp {hiroakih, gneubig}@cs.cmu.edu Abstract When reading a text, it is common to become stuck on unfamiliar words and phrases, such as polysemous words with novel senses, rarely used idioms, internet slang, or emerging enti- ties. If we humans cannot figure out the mean- ing of those expressions from the immediate local context, we consult dictionaries for def- initions or search documents or the web to find other global context to help in interpre- tation. Can machines help us do this work? Which type of context is more important for machines to solve the problem? To answer these questions, we undertake a task of de- scribing a given phrase in natural language based on its local and global contexts. To solve this task, we propose a neural description model that consists of two context encoders and a description decoder. In contrast to the existing methods for non-standard English ex- planation (Ni and Wang, 2017) and defini- tion generation (Noraset et al., 2017; Gadetsky et al., 2018), our model appropriately takes im- portant clues from both local and global con- texts. Experimental results on three existing datasets (including WordNet, Oxford and Ur- ban Dictionaries) and a dataset newly created from Wikipedia demonstrate the effectiveness of our method over previous work. 1 Introduction When we read news text with emerging entities, text in unfamiliar domains, or text in foreign lan- guages, we often encounter expressions (words or phrases) whose senses we do not understand. In such cases, we may first try to figure out the mean- ings of those expressions by reading the surround- ing words (local context) carefully. Failing to do so, we may consult dictionaries, and in the case of polysemous words, choose an appropriate mean- ing based on the context. Learning novel word senses via dictionary definitions is known to be Figure 1: Local & Global Context-aware Description generator (LOG-CaD). more effective than contextual guessing (Fraser, 1998; Chen, 2012). However, very often, hand- crafted dictionaries do not contain definitions of expressions that are rarely used or newly created. Ultimately, we may need to read through the entire document or even search the web to find other oc- curances of the expression (global context) so that we can guess its meaning. Can machines help us do this work? Ni and Wang (2017) have proposed a task of generating a definition for a phrase given its local context. However, they follow the strict assumption that the target phrase is newly emerged and there is only a single local context available for the phrase, which makes the task of generating an accurate and co- herent definition difficult (perhaps as difficult as a human comprehending the phrase itself). On the other hand, Noraset et al. (2017) attempted to generate a definition of a word from an embed- ding induced from massive text (which can be seen as global context). This is followed by Gadet- sky et al. (2018) that refers to a local context to disambiguate polysemous words by choosing rel- evant dimensions of their word embeddings. Al-

Transcript of Learning to Describe Unknown Phrases with Local and Global ... · Oxford dictionary (Gadetsky et...

Proceedings of NAACL-HLT 2019, pages 3467–3476Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

3467

Learning to Describe Unknown Phrases with Local and Global Contexts

Shonosuke Ishiwatari† Hiroaki Hayashi‡ Naoki Yoshinaga§ Graham Neubig‡Shoetsu Sato† Masashi Toyoda§ Masaru Kitsuregawa¶§† The University of Tokyo ‡ Carnegie Mellon University

§ Institute of Industrial Science, the University of Tokyo ¶ National Institute of Informatics†§¶{ishiwatari,ynaga,shoetsu,toyoda,kitsure}@tkl.iis.u-tokyo.ac.jp

‡{hiroakih, gneubig}@cs.cmu.edu

Abstract

When reading a text, it is common to becomestuck on unfamiliar words and phrases, suchas polysemous words with novel senses, rarelyused idioms, internet slang, or emerging enti-ties. If we humans cannot figure out the mean-ing of those expressions from the immediatelocal context, we consult dictionaries for def-initions or search documents or the web tofind other global context to help in interpre-tation. Can machines help us do this work?Which type of context is more important formachines to solve the problem? To answerthese questions, we undertake a task of de-scribing a given phrase in natural languagebased on its local and global contexts. Tosolve this task, we propose a neural descriptionmodel that consists of two context encodersand a description decoder. In contrast to theexisting methods for non-standard English ex-planation (Ni and Wang, 2017) and defini-tion generation (Noraset et al., 2017; Gadetskyet al., 2018), our model appropriately takes im-portant clues from both local and global con-texts. Experimental results on three existingdatasets (including WordNet, Oxford and Ur-ban Dictionaries) and a dataset newly createdfrom Wikipedia demonstrate the effectivenessof our method over previous work.

1 Introduction

When we read news text with emerging entities,text in unfamiliar domains, or text in foreign lan-guages, we often encounter expressions (words orphrases) whose senses we do not understand. Insuch cases, we may first try to figure out the mean-ings of those expressions by reading the surround-ing words (local context) carefully. Failing to doso, we may consult dictionaries, and in the case ofpolysemous words, choose an appropriate mean-ing based on the context. Learning novel wordsenses via dictionary definitions is known to be

Figure 1: Local & Global Context-aware Descriptiongenerator (LOG-CaD).

more effective than contextual guessing (Fraser,1998; Chen, 2012). However, very often, hand-crafted dictionaries do not contain definitions ofexpressions that are rarely used or newly created.Ultimately, we may need to read through the entiredocument or even search the web to find other oc-curances of the expression (global context) so thatwe can guess its meaning.

Can machines help us do this work? Ni andWang (2017) have proposed a task of generatinga definition for a phrase given its local context.However, they follow the strict assumption that thetarget phrase is newly emerged and there is only asingle local context available for the phrase, whichmakes the task of generating an accurate and co-herent definition difficult (perhaps as difficult asa human comprehending the phrase itself). Onthe other hand, Noraset et al. (2017) attempted togenerate a definition of a word from an embed-ding induced from massive text (which can be seenas global context). This is followed by Gadet-sky et al. (2018) that refers to a local context todisambiguate polysemous words by choosing rel-evant dimensions of their word embeddings. Al-

3468

though these research efforts revealed that both lo-cal and global contexts are useful in generatingdefinitions, none of these studies exploited bothcontexts directly to describe unknown phrases.

In this study, we tackle the task of describing(defining) a phrase when given its local and globalcontexts. We present LOG-CaD, a neural descrip-tion generator (Figure 1) to directly solve this task.Given an unknown phrase without sense defini-tions, our model obtains a phrase embedding asits global context by composing word embeddingswhile also encoding the local context. The modeltherefore combines both pieces of information togenerate a natural language description.

Considering various applications where weneed definitions of expressions, we evaluatedour method with four datasets including Word-Net (Noraset et al., 2017) for general words, theOxford dictionary (Gadetsky et al., 2018) for pol-ysemous words, Urban Dictionary (Ni and Wang,2017) for rare idioms or slang, and a newly-created Wikipedia dataset for entities.

Our contributions are as follows:

• We propose a general task of defining un-known phrases given their contexts. Thistask is a generalization of three relatedtasks (Noraset et al., 2017; Ni and Wang,2017; Gadetsky et al., 2018) and involves var-ious situations where we need definitions ofunknown phrases (§ 2).

• We propose a method for generating nat-ural language descriptions for unknownphrases with local and global contexts(§ 3).

• As a benchmark to evaluate the ability of themodels to describe entities, we build a large-scale dataset from Wikipedia and Wikidatafor the proposed task. We release our datasetand the code1 to promote the reproducibilityof the experiments (§ 4).

• The proposed method achieves the state-of-the-art performance on our new dataset andthe three existing datasets used in the relatedstudies (Noraset et al., 2017; Ni and Wang,2017; Gadetsky et al., 2018) (§ 5).

1https://github.com/shonosuke/ishiwatari-naacl2019

2 Context-aware Phrase DescriptionGeneration

In this section, we define our task of describinga phrase in a specific context. Given an unde-fined phrase Xtrg = {xj , · · · , xk} with its con-text X = {x1, · · · , xI} (1 ≤ j ≤ k ≤ I), ourtask is to output a description Y = {y1, · · · , yT }.Here, Xtrg can be a word or a short phrase and isincluded in X . Y is a definition-like concrete andconcise sentence that describes the Xtrg.

For example, given a phrase “sonic boom” withits context “the shock wave may be caused bysonic boom or by explosion,” the task is to gen-erate a description such as “sound created by anobject moving fast.” If the given context has beenchanged to “this is the first official tour to sup-port the band’s latest studio effort, 2009’s SonicBoom,” then the appropriate output would be “al-bum by Kiss.”

The process of description generation can bemodeled with a conditional language model as

p(Y |X,Xtrg) =

T∏t=1

p(yt|y<t, X,Xtrg). (1)

3 LOG-CaD: Local & GlobalContext-aware Description Generator

In this section, we describe our idea of utilizinglocal and global contexts in the description gener-ation task, and present the details of our model.

3.1 Local & global contexts

When we find an unfamiliar phrase in text and itis not defined in dictionaries, how can we humanscome up with its meaning? As discussed in Sec-tion 1, we may first try to figure out the mean-ing of the phrase from the immediate context, andthen read through the entire document or searchthe web to understand implicit information behindthe text.

In this paper, we refer to the explicit contextualinformation included in a given sentence with thetarget phrase (i.e., the X in Eq. (1)) as “local con-text,” and the implicit contextual information inmassive text as “global context.” While both localand global contexts are crucial for humans to un-derstand unfamiliar phrases, are they also usefulfor machines to generate descriptions? To verifythis idea, we propose to incorporate both local andglobal contexts to describe an unknown phrase.

3469

3.2 Proposed modelFigure 1 shows an illustration of our LOG-CaDmodel. Similarly to the standard encoder-decodermodel with attention (Bahdanau et al., 2015; Lu-ong and Manning, 2016), it has a context encoderand a description decoder. The challenge here isthat the decoder needs to be conditioned not onlyon the local context, but also on its global context.To incorporate the different types of contexts, wepropose to use a gate function similar to Norasetet al. (2017) to dynamically control how the globaland local contexts influence the description.

Local & global context encoders We first de-scribe how to model local and global contexts.Given a sentence X and a phrase Xtrg, a bi-directional LSTM (Gers et al., 1999) encoder gen-erates a sequence of continuous vectors H ={h1 · · · ,hI} as

hi = Bi-LSTM(hi−1,hi+1,xi), (2)

where xi is the word embedding of word xi. Inaddition to the local context, we also utilize theglobal context obtained from massive text. Thiscan be achieved by feeding a phrase embeddingxtrg to initialize the decoder (Noraset et al., 2017)as

y0 = xtrg. (3)

Here, the phrase embedding xtrg is calculated bysimply summing up all the embeddings of wordsthat consistute the phrase Xtrg. Note that we usea randomly-initialized vector if no pre-trained em-bedding is available for the words in Xtrg.

Description decoder Using the local and globalcontexts, a description decoder computes theconditional probability of a description Y withEq. (1), which can be approximated with anotherLSTM as

st = LSTM(yt−1, s′t−1), (4)

dt = ATTENTION(H, st), (5)

ctrg = CNN(Xtrg), (6)

s′t = GATE(st,xtrg, ctrg,dt), (7)

p(yt|y<t, Xtrg) = softmax(Ws′s′t + bs′), (8)

where st is a hidden state of the decoder LSTM

(s0 = ~0), and yt−1 is a jointly-trained word em-bedding of the previous output word yt−1. In whatfollows, we explain each equation in detail.

Attention on local context Considering the factthat the local context can be relatively long (e.g.,around 20 words on average in our Wikipediadataset introduced in Section 4), it is hard forthe decoder to focus on important words in localcontexts. In order to deal with this problem, theATTENTION(·) function in Eq. (5) decides whichwords in the local context X to focus on at eachtime step. dt is computed with an attention mech-anism (Luong and Manning, 2016) as

dt =

T∑i=1

αihi, (9)

αi = softmax(UhhTi Usst), (10)

where Uh and Us are matrices that map the en-coder and decoder hidden states into a commonspace, respectively.

Use of character information In order to cap-ture the surface information of Xtrg, we constructcharacter-level CNNs (Eq. (6)) following (No-raset et al., 2017). Note that the input to theCNNs is a sequence of words in Xtrg, which areconcatenated with special character “ ,” such as“sonic boom.” Following Noraset et al. (2017),we set the CNN kernels of length 2-6 and size10, 30, 40, 40, 40 respectively with a stride of 1 toobtain a 160-dimensional vector ctrg.

Gate function to control local & global contextsIn order to capture the interaction between the lo-cal and global contexts, we adopt a GATE(·) func-tion (Eq. (7)) which is similar to Noraset et al.(2017). The GATE(·) function updates the LSTMoutput st to s′t depending on the global contextxtrg, local context dt, and character-level infor-mation ctrg as

ft = [xtrg;dt; ctrg] (11)

zt = σ(Wz[ft; st] + bz), (12)

rt = σ(Wr[ft; st] + br), (13)

st = tanh(Ws[(rt � ft); st] + bs), (14)

s′t = (1− zt)� st + zt � st, (15)

where σ(·), � and ; denote the sigmoid function,element-wise multiplication, and vector concate-nation, respectively. W∗ and b∗ are weight ma-trices and bias terms, respectively. Here, the up-date gate zt controls how much the original hid-den state st is to be changed, and the reset gate rtcontrols how much the information from ft con-tributes to word generation at each time step.

3470

Figure 2: Context-aware description dataset extracted from Wikipedia and Wikidata.

4 Wikipedia Dataset

Our goal is to let machines describe unfamiliarwords and phrases, such as polysemous words,rarely used idioms, or emerging entities. Amongthe three existing datasets, WordNet and Oxforddictionary mainly target the words but not phrases,thus are not perfect test beds for this goal. On theother hand, although the Urban Dictionary datasetcontains descriptions of rarely-used phrases, thedomain of its targeted words and phrases is lim-ited to Internet slang.

In order to confirm that our model can generatethe description of entities as well as polysemouswords and slang, we constructed a new dataset forcontext-aware phrase description generation fromWikipedia2 and Wikidata3 which contain a widevariety of entity descriptions with contexts. Theoverview of the data extraction process is shownin Figure 2. Each entry in the dataset consists of(1) a phrase, (2) its description, and (3) context (asentence).

For preprocessing, we applied Stanford Tok-enizer4 to the descriptions of Wikidata items andthe articles in Wikipedia. Next, we removedphrases in parentheses from the Wikipedia arti-cles, since they tend to be paraphrasing in otherlanguages and work as noise. To obtain the con-texts of each item in Wikidata, we extracted the

2https://dumps.wikimedia.org/enwiki/20170720/

3https://dumps.wikimedia.org/wikidatawiki/entities/20170802/

4https://nlp.stanford.edu/software/tokenizer.shtml

sentence which has a link referring to the itemthrough all the first paragraphs of Wikipedia arti-cles and replaced the phrase of the links with a spe-cial token [TRG]. Wikidata items with no descrip-tion or no contexts are ignored. This utilization oflinks makes it possible to resolve the ambiguity ofwords and phrases in a sentence without humanannotations, which is a major advantage of usingWikipedia. Note that we used only links whose an-chor texts are identical to the title of the Wikipediaarticles, since the users of Wikipedia sometimeslink mentions to related articles.

5 Experiments

We evaluate our method by applying it to describewords in WordNet5 (Miller, 1995) and OxfordDictionary,6 phrases in Urban Dictionary7 andWikipedia/Wikidata.8 For all of these datasets, agiven word or phrase has an inventory of senseswith corresponding definitions and usage exam-ples. These definitions are regarded as ground-truth descriptions.

Datasets To evaluate our model on the word de-scription task on WordNet, we followed Norasetet al. (2017) and extracted data from WordNet us-ing the dict-definition9 toolkit. Each entryin the data consists of three elements: (1) a word,(2) its definition, and (3) a usage example of the

5https://wordnet.princeton.edu/6https://en.oxforddictionaries.com/7https://www.urbandictionary.com/8https://www.wikidata.org9https://github.com/NorThanapon/

dict-definition

3471

Corpus #Phrases #Entries Phrase Context Desc.length length length

WordNet

Train 7,938 13,883 1.00 5.81 6.61Valid 998 1,752 1.00 5.64 6.61Test 1,001 1,775 1.00 5.77 6.85

Oxford Dictionary

Train 33,128 97,855 1.00 17.74 11.02Valid 8,867 12,232 1.00 17.80 10.99Test 8,850 12,232 1.00 17.56 10.95

Urban Dictionary

Train 190,696 411,384 1.54 10.89 10.99Valid 26,876 57,883 1.54 10.86 10.95Test 26,875 38,371 1.68 11.14 11.50

Wikipedia

Train 151,995 887,455 2.10 18.79 5.89Valid 8,361 44,003 2.11 19.21 6.31Test 8,397 57,232 2.10 19.02 6.94

Table 1: Statistics of the word/phrase descriptiondatasets.

Corpus Domain Inputs Cov. emb.

WordNet General words 100.00%Oxford Dictionary General words 83.04%Urban Dictionary Internet slang phrases 21.00%Wikipedia Proper nouns phrases 26.79%

Table 2: Domains, expressions to be described, and thecoverage of pre-trained embeddings of the expressionsto be described.

word. We split this dataset to obtain Train, Valida-tion, and Test sets. If a word has multiple defini-tions/examples, we treat them as different entries.Note that the words are mutually exclusive acrossthe three sets. The only difference between ourdataset and theirs is that we extract the tuples onlyif the words have their usage examples in Word-Net. Since not all entries in WordNet have usageexamples, our dataset is a small subset of Norasetet al. (2017).

In addition to WordNet, we use the Oxford Dic-tionary following Gadetsky et al. (2018), the Ur-ban Dictionary following Ni and Wang (2017) andour Wikipedia dataset described in the previoussection. Table 1 and Table 2 show the propertiesand statistics of the four datasets, respectively.

To simulate a situation in a real applicationwhere we might not have access to global contextfor the target phrases, we did not train domain-specific word embeddings on each dataset. In-stead, for all of the four datasets, we use the same

Global Local I-Attn. LOG-CaD

# Layers of Enc-LSTMs - 2 2 2Dim. of Enc-LSTMs - 600 600 600Dim. of Attn. vectors - 300 300 300Dim. of input word emb. 300 - 300 300Dim. of char. emb. 160 160 - 160# Layers of Dec-LSTMs 2 2 2 2Dim. of Dec-LSTMs 300 300 300 300Vocabulary size 10k 10k 10k 10kDropout rate 0.5 0.5 0.5 0.5

Table 3: Hyperparameters of the models

pre-trained CBOW10 vectors trained on Googlenews corpus as global context following previouswork (Noraset et al., 2017; Gadetsky et al., 2018).If the expression to be described consists of mul-tiple words, its phrase embedding is calculatedby simply summing up all the CBOW vectors ofwords in the phrase, such as “sonic” and “boom.”(See Figure 1). If pre-trained CBOW embeddingsare unavailable, we instead use a special [UNK]vector (which is randomly initialized with a uni-form distribution) as word embeddings. Note thatour pre-trained embeddings only cover 26.79% ofthe words in the expressions to be described inour Wikipedia dataset, while it covers all wordsin WordNet dataset (See Table 2). Even if noreliable word embeddings are available, all mod-els can capture the character information throughcharacter-level CNNs (See Figure 1).

Models We implemented four methods: (1)Global (Noraset et al., 2017), (2) Local (Ni andWang, 2017) with CNN, (3) I-Attention (Gadetskyet al., 2018), and our proposed model, (4) LOG-CaD. The Global model is our reimplementationof the best model (S + G + CH) in Noraset et al.(2017). It can access the global context of a phraseto be described, but has no ability to read the lo-cal context. The Local model is the reimplemen-tation of the best model (dual encoder) in Ni andWang(2017). In order to make a fair comparisonof the effectiveness of local and global contexts,we slightly modify the original implementation byNi and Wang(2017); as the character-level encoderin the Local model, we adopt CNNs that are ex-actly the same as the other two models instead ofthe original LSTMs.

The I-Attention is our reimplementation ofthe best model (S + I-Attention) in Gadetsky

10GoogleNews-vectors-negative300.bin.gzat https://code.google.com/archive/p/word2vec/

3472

Model WordNet Oxford Urban Wikipedia

Global 24.10 15.05 6.05 44.77Local 22.34 17.90 9.03 52.94I-Attention 23.77 17.25 10.40 44.71LOG-CaD 24.79 18.53 10.55 53.85

Table 4: BLEU scores on four datasets.

Model Annotated score

Local 2.717LOG-CaD 3.008

Table 5: Averaged human annotated scores onWikipedia dataset.

et al.(2018). Similar to our model, it uses bothlocal and global contexts. Unlike our model, how-ever, it does not use character information to pre-dict descriptions. Also, it cannot directly use thelocal context to predict the words in descriptions.This is because the I-Attention model indirectlyuses the local context only to disambiguate thephrase embedding xtrg as

x′trg = xtrg �m, (16)

m = σ(Wm

∑Ii=1 FFNN(hi)

I+ bm). (17)

Here, the FFNN(·) function is a feed-forward neu-ral network that maps the encoded local contextshi to another space. The mapped local contextsare then averaged over the length of the sentenceX to obtain a representation of the local context.This is followed by a linear layer and a sigmoidfunction to obtain the soft binary mask m whichcan filter out the unrelated information included inglobal context. Finally, the disambiguated phraseembedding x′trg is then used to update the decoderhidden state as

st = LSTM([yt−1;x′trg], st−1). (18)

All four models (Table 3) are implemented withthe PyTorch framework (Ver. 1.0.0).11

Automatic Evaluation Table 4 shows theBLEU (Papineni et al., 2002) scores of the out-put descriptions. We can see that the LOG-CaDmodel consistently outperforms the three baselinesin all four datasets. This result indicates that us-ing both local and global contexts helps describethe unknown words/phrases correctly. While the

11http://pytorch.org/

Input: waste

Context: #1 #2

if the effort brings nocompensating gain itis a waste

We waste the dirtywater by channeling itinto the sewer

Reference: useless or profitlessactivity

to get rid of

Global: to give a liquid for a liquid

Local: a state of being as-signed to a particularpurpose

to make a break of awooden instrument

I-Attention: a person who makessomething that can bebe be done

to remove or removethe contents of

LOG-CaD: a source of somethingthat is done or done

to remove a liquid

Table 6: Descriptions for a word in WordNet.

Input: daniel o’neill

Context: #1 #2

after being enlargedby publisher danielo’neill it was report-edly one of the largestand most prosperousnewspapers in theunited states.

in 1967 he returned tobelfast where he metfellow belfast artistdaniel o’neill.

Reference: american journalist irish artist

Global: american musician

Local: american publisher british musician

I-Attention: american musician american musician

LOG-CaD: american writer british musician

Table 7: Descriptions for a phrase in Wikipedia.

I-Attention model also uses local and global con-texts, its performance was always lower than theLOG-CaD model. This result shows that usinglocal context to predict description is more effec-tive than using it to disambiguate the meanings inglobal context.

In particular, the low BLEU scores of Globaland I-Attention models on Wikipedia dataset sug-gest that it is necessary to learn to ignore thenoisy information in global context if the cover-age of pre-trained word embeddings is extremelylow (see the third and fourth rows in Table 2). Wesuspect that the Urban Dictionary task is too dif-ficult and the results are unreliable considering itsextremely low BLEU scores and high ratio of un-known tokens in generated descriptions.

3473

Input: q

Context: #1 #2 #3 #4

q-lets and co. is a fil-ipino and english infor-mative children ’s showon q in the philippines .

she was a founding pro-ducer of the cbc radioone show ” q ” .

the q awards are the uk’s annual music awardsrun by the music maga-zine ” q ” .

charles fraser-smith was anauthor and one-time mis-sionary who is widely cred-ited as being the inspira-tion for ian fleming ’s jamesbond quartermaster q .

Reference: philippine tv network canadian radio show british music magazine fictional character fromjames bond

Global: american rapper

Local: television channel television show show magazine american writer

I-Attention: american rapper american rapper american rapper american rapper

LOG-CaD: television station in thephilippines

television program british weekly musicjournalism magazine

[unk] [unk]

Table 8: Descriptions for a word in Wikipedia.

Manual Evaluation To compare the proposedmodel and the strongest baseline in Table 4 (i.e.,the Local model), we performed a human evalu-ation on our dataset. We randomly selected 100samples from the test set of the Wikipedia datasetand asked three native English speakers to rate theoutput descriptions from 1 to 5 points as: 1) com-pletely wrong or self-definition, 2) correct topicwith wrong information, 3) correct but incom-plete, 4) small details missing, 5) correct. The av-eraged scores are reported in Table 5. Pair-wisebootstrap resampling test (Koehn, 2004) for theannotated scores has shown that the superiority ofLOG-CaD over the Local model is statisticallysignificant (p < 0.01).

Qualitative Analysis Table 6 shows a word inthe WordNet, while Table 7 and Table 8 show theexamples of the entities in Wikipedia as examples.When comparing the two datasets, the quality ofgenerated descriptions of Wikipedia dataset is sig-nificantly better than that of WordNet dataset. Themain reason for this result is that the size of train-ing data of the Wikipedia dataset is 64x larger thanthe WordNet dataset (See Table 1).

For all examples in the three tables, the Globalmodel can only generate a single description foreach input word/phrase because it cannot accessany local context. In the Wordnet dataset, onlythe I-Attention and LOG-CaD models can suc-cessfully generate the concept of “remove” giventhe context #2. This result suggests that consid-ering both local and global contexts are essentialto generate correct descriptions. In our Wikipedia

dataset, both the Local and LOG-CaD models candescribe the word/phrase considering its local con-text. For example, both the Local and LOG-CaDmodels could generate “american” in the descrip-tion for “daniel o’neill” given “united states” incontext #1, while they could generate “british”given “belfast” in context #2. A similar trendcan also be observed in Table 8, where LOG-CaDcould generate the locational expressions such as“philippines” and “british” given the different con-texts. On the other hand, the I-Attention modelcould not describe the two phrases, taking into ac-count the local contexts. We will present an anal-ysis of this phenomenon in the next section.

6 Discussion

In this section, we present analyses on how the lo-cal and global contexts contribute to the descrip-tion generation task. First, we discuss how the lo-cal context helps the models to describe a phrase.Then, we analyze the impact of global context un-der the situation where local context is unreliable.

6.1 How do the models utilize local contexts?Local context helps us (1) disambiguate polyse-mous words and (2) infer the meanings of un-known expressions. Can machines also utilize thelocal context? In this section, we discuss the tworoles of local context in description generation.

Considering that the pre-trained word em-beddings are obtained from word-level co-occurrences in massive text, more information ismixed up into a single vector as the more sensesthe word has. While Gadetsky et al. (2018) de-

3474

1 2 3 4+# senses

20

30

40

50

60BL

EUGlobalLocalI-AttentionLOG-CaD

(a) Number of senses of the phrase.

-20 20-30 30-40 40-Unknown words in a phrase [%]

20

30

40

50

60

BLEU

GlobalLocalI-AttentionLOG-CaD

(b) Unknown words ratio in the phrase.

-10 10-15 15-20 20-25 25- Length of local context [# words]

20

30

40

50

60

BLEU

LocalI-AttentionLOG-CaD

(c) Length of the local context.

Figure 3: Impact of various parameters of a phrase to be described on BLEU scores of the generated descriptions.

signed the I-Attention model to filter out unre-lated meanings in the global context given localcontext, they did not discuss the impact of thenumber of senses has on the performance of defi-nition generation. To understand the influence ofthe ambiguity of phrases to be defined on the gen-eration performance, we did an analysis on ourWikipedia dataset. Figure 3(a) shows that the de-scription generation task becomes harder as thephrases to be described become more ambiguous.In particular, when a phrase has an extremely largenumber of senses, (i.e., #senses ≥ 4), the Globalmodel drops its performance significantly. This re-sult indicates that the local context is necessary todisambiguate the meanings in global context.

As shown in Table 2, a large proportion ofthe phrases in our Wikipedia dataset includes un-known words (i.e., only 26.79% of words in thephrases have their pre-trained embeddings). Thisfact indicates that the global context in this datasetis not fully reliable. Then our next question is,how does the lack of information from global con-text affect the performance of phrase description?Figure 3(b) shows the impact of unknown wordsin the phrases to be described on the performance.As we can see from the result, the advantage ofLOG-CaD and Local models over Global and I-Attention models becomes larger as the unknownwords increases. This result suggests that we needto fully utilize local contexts especially in prac-tical applications where the phrases to be definedhave many unknown words. Here, Figure 3(b) alsoshows a counterintuitive phenomenon that BLEU

scores increase as the ratio of unknown words in aphrase increase. This is mainly because unknownphrases tend to be person names such as writ-ers, actors, or movie directors. Since these enti-ties have fewer ambiguities in categories, they canbe described in extremely short sentences that are

easy for all four models to decode (e.g., “finnishwriter” or “american television producer”).

6.2 How do the models utilize globalcontexts?

As discussed earlier, local contexts are importantto describe unknown expressions, but how aboutglobal contexts? Assuming a situation where wecannot obtain much information from local con-texts (e.g., infer the meaning of “boswellia” froma short local context “Here is a boswellia”), globalcontexts should be essential to understand themeaning. To confirm this hypothesis, we analyzedthe impact of the length of local contexts on BLEU

scores. Figure 3(c) shows that when the lengthof local context is extremely short (l ≤ 10), theLOG-CaD model becomes much stronger thanthe Local model. This result indicates that notonly local context but also global context helpmodels describe the meanings of phrases.

7 Related Work

In this study, we address a task of describing agiven phrase with its context. In what follows, weexplain existing tasks that are related to our work.

Our task is closely related to word sense dis-ambiguation (WSD) (Navigli, 2009), which iden-tifies a pre-defined sense for the target word withits context. Although we can use it to solve ourtask by retrieving the definition sentence for thesense identified by WSD, it requires a substantialamount of training data to handle a different set ofmeanings of each word, and cannot handle words(or senses) which are not registered in the dictio-nary. Although some studies have attempted to de-tect novel senses of words for given contexts (Erk,2006; Lau et al., 2014), they do not provide def-inition sentences. Our task avoids these difficul-ties in WSD by directly generating descriptions for

3475

phrases or words. It also allows us to flexibly tailora fine-grained definition for the specific context.

Paraphrasing (Androutsopoulos and Malakasi-otis, 2010; Madnani and Dorr, 2010) (or textsimplification (Siddharthan, 2014)) can be usedto rephrase words with unknown senses. How-ever, the target of paraphrase acquisition arewords/phrases with no specified context. Al-though a few studies (Connor and Roth, 2007;Max, 2009; Max et al., 2012) consider sub-sentential (context-sensitive) paraphrases, they donot intend to obtain a definition-like description asa paraphrase of a word.

Recently, Noraset et al. (2017) introduced atask of generating a definition sentence of a wordfrom its pre-trained embedding. Since their taskdoes not take local contexts of words as inputs,their method cannot generate an appropriate def-inition for a polysemous word for a specific con-text. To cope with this problem, Gadetsky et al.(2018) proposed a definition generation methodthat works with polysemous words in dictionar-ies. They presented a model that utilizes localcontext to filter out the unrelated meanings froma pre-trained word embedding in a specific con-text. While their method use local context for dis-ambiguating the meanings that are mixed up inword embeddings, the information from local con-texts cannot be utilized if the pre-trained embed-dings are unavailable or unreliable. On the otherhand, our method can fully utilize the local con-text through an attentional mechanism, even if thereliable word embeddings are unavailable.

The most related work to this paper is Ni andWang (2017). Focusing on non-standard Englishphrases, they proposed a model to generate theexplanations solely from local context. They fol-lowed the strict assumption that the target phrasewas newly emerged and there was only a single lo-cal context available, which made the task of gen-erating an accurate and coherent definition diffi-cult. Our proposed task and model are more gen-eral and practical than Ni and Wang (2017); where(1) we use Wikipedia, which includes expressionsfrom various domains, and (2) our model takes ad-vantage of global contexts if available.

Our task of describing phrases with its contextis a generalization of the three tasks (Noraset et al.,2017; Ni and Wang, 2017; Gadetsky et al., 2018),and the proposed method utilizes both local andglobal contexts of an expression in question.

8 Conclusions

This paper sets up a task of generating a naturallanguage description for an unknown phrase witha specific context, aiming to help us acquire un-known word senses when reading text. We ap-proached this task by using a variant of encoder-decoder models that capture the given local con-text with the encoder and global contexts with thedecoder initialized by the target phrase’s embed-ding induced from massive text. We performed ex-periments on three existing datasets and one newlybuilt from Wikipedia and Wikidata. The experi-mental results confirmed that the local and globalcontexts complement one another and are both es-sential; global contexts are crucial when local con-texts are short and vague, while the local contextis important when the target phrase is polysemous,rare, or unseen.

As future work, we plan to modify ourmodel to use multiple contexts in text to im-prove the quality of descriptions, consideringthe “one sense per discourse” hypothesis (Galeet al., 1992). We will release the newlybuilt Wikipedia dataset and the experimentalcodes for the academic and industrial communi-ties at https://github.com/shonosuke/ishiwatari-naacl2019 to facilitate the re-producibility of our results and their use in variousapplication contexts.

Acknowledgements

The authors are grateful to Thanapon Norasetfor sharing the details of his implementation ofthe previous work. We also thank the anony-mous reviewers for their careful reading of ourpaper and insightful comments, and the membersof Kitsuregawa-Toyoda-Nemoto-Yoshinaga-Godalaboratory in the University of Tokyo for proof-reading the draft.

This work was partially supported by Grant-in-Aid for JSPS Fellows (Grant Number 17J06394)and Commissioned Research (201) of the Na-tional Institute of Information and Communica-tions Technology of Japan.

ReferencesIon Androutsopoulos and Prodromos Malakasiotis.

2010. A survey of paraphrasing and textual entail-ment methods. Journal of Artificial Intelligence Re-search, 38:135–187.

3476

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings of theThird International Conference on Learning Repre-sentations (ICLR).

Yuzhen Chen. 2012. Dictionary use and vocabularylearning in the context of reading. InternationalJournal of Lexicography, 25(2):216–247.

Michael Connor and Dan Roth. 2007. Context sensi-tive paraphrasing with a global unsupervised clas-sifier. In Proceedings of the 18th European Confer-ence on Machine Learning (ECML), pages 104–115.

Katrin Erk. 2006. Unknown word sense detection asoutlier detection. In Proceedings of the Human Lan-guage Technology Conference of the North Amer-ican Chapter of the Association of ComputationalLinguistics (NAACL), pages 128–135.

Carol A. Fraser. 1998. The role of consulting a dictio-nary in reading and vocabulary learning. CanadianJournal of Applied Linguistics, 2(1-2):73–89.

Artyom Gadetsky, Ilya Yakubovskiy, and DmitryVetrov. 2018. Conditional generators of words def-initions. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(ACL), Short Papers, pages 266–271.

William A. Gale, Kenneth W. Church, and DavidYarowsky. 1992. One sense per discourse. In Pro-ceedings of the workshop on Speech and NaturalLanguage, HLT, pages 233–237.

Felix A. Gers, Jurgen Schmidhuber, and Fred Cum-mins. 1999. Learning to forget: Continual predic-tion with lstm. In Proceedings of the Ninth Inter-national Conference on Artificial Neural Networks(ICANN), pages 850–855.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proceedings ofthe 2004 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 388–395.

Jey Han Lau, Paul Cook, Diana McCarthy, SpandanaGella, and Timothy Baldwin. 2014. Learning wordsense distributions, detecting unattested senses andidentifying novel senses using topic models. In Pro-ceedings of the 52nd Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages259–270.

Minh-Thang Luong and Christopher D. Manning.2016. Achieving open vocabulary neural machinetranslation with hybrid word-character models. InProceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages1054–1063.

Nitin Madnani and Bonnie J. Dorr. 2010. Generat-ing phrasal and sentential paraphrases: A surveyof data-driven methods. Computational Linguistics,36(3):341–387.

Aurelien Max. 2009. Sub-sentencial paraphrasing bycontextual pivot translation. In Proceedings of the2009 Workshop on Applied Textual Inference, pages18–26.

Aurelien Max, Houda Bouamor, and Anne Vilnat.2012. Generalizing sub-sentential paraphrase acqui-sition across original signal type of text pairs. InProceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning, pages721–731.

George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

Roberto Navigli. 2009. Word sense disambiguation: Asurvey. ACM COMPUTING SURVEYS, 41(2):1–69.

Ke Ni and William Yang Wang. 2017. Learning to ex-plain non-standard English words and phrases. InProceedings of the 8th International Joint Confer-ence on Natural Language Processing (IJCNLP),pages 413–417.

Thanapon Noraset, Chen Liang, Larry Birnbaum, andDoug Downey. 2017. Definition modeling: Learn-ing to define word embeddings in natural language.In Proceedings of the 31st AAAI Conference on Ar-tificial Intelligence (AAAI), pages 3259–3266.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of40th Annual Meeting of the Association for Compu-tational Linguistics (ACL), pages 311–318.

Advaith Siddharthan. 2014. A survey of research ontext simplification. International Journal of AppliedLinguistics, 165(2):259–298.