Discrete-State Variational Autoencoders for Joint ...ivan-titov.org/papers/tacl16diego.pdf ·...

Discrete-State Variational Autoencodersfor Joint Discovery and Factorization of Relations

Diego MarcheggianiILLC

University of [email protected]

Ivan TitovILLC

University of [email protected]

Abstract

We present a method for unsupervised open-domain relation discovery. In contrast toprevious (mostly generative and agglomera-tive clustering) approaches, our model relieson rich contextual features and makes mini-mal independence assumptions. The modelis composed of two parts: a feature-rich re-lation extractor, which predicts a semanticrelation between two entities, and a factor-ization model, which reconstructs arguments(i.e., the entities) relying on the predicted re-lation. The two components are estimatedjointly so as to minimize errors in recoveringarguments. We study factorization models in-spired by previous work in relation factoriza-tion and selectional preference modeling. Ourmodels substantially outperform the genera-tive and agglomerative-clustering counterpartsand achieve state-of-the-art performance.

1 Introduction

The task of Relation Extraction (RE) consists of de-tecting and classifying semantic relations present intext. As common, in this work we limit ourselvesto only considering binary relations between enti-ties occurring in the same sentence. RE has beenshown to benefit a wide range of NLP tasks, such asinformation retrieval (Liu et al., 2014), question an-swering (Ravichandran and Hovy, 2002) and textualentailment (Szpektor et al., 2004).

Supervised methods to RE have been success-ful when small restricted sets of relations are con-sidered. However, human annotation is expensiveand time-consuming, and, consequently, these ap-proaches do not scale well to the open-domain set-ting where a large number of relations needs to be

detected from a heterogeneous text collection (e.g.,the entire Web). Though weakly-supervised ap-proaches, such as distantly supervised methods andbootstrapping (Mintz et al., 2009; Agichtein andGravano, 2000), further reduce the amount of nec-essary supervision, they still require examples forevery relation considered.

These limitations led to the emergence of unsu-pervised approaches to RE. These methods extractsurface or syntactic patterns between two entitiesand either directly use these patterns as substitutesof semantic relations (Banko et al., 2007; Bankoand Etzioni, 2008) or cluster the patterns (sometimesin context-sensitive way) to form relations (Lin andPantel, 2001; Yao et al., 2011; Nakashole et al.,2012; Yao et al., 2012). The existing methods, giventheir generative (or agglomerative clustering) nature,rely on simpler features than their supervised coun-terparts and also make strong modeling assumptions(e.g., assuming that arguments are conditionally in-dependent of each other given the relation). Theseshortcomings are likely to harm their performance.

In this work, we tackle the aforementioned chal-lenges and introduce a new model for unsupervisedrelation extraction. We also describe an efficient es-timation algorithm which lets us experiment withlarge unannotated collections. Our model is com-posed of two components:

• an encoding component: a feature-rich relationextractor which predicts a semantic relation be-tween two entities in a specific sentence givencontextual features;

• a reconstruction component: a factorizationmodel which reconstructs arguments (i.e. theentities) relying on the predicted relation.

The two components are estimated jointly so as tominimize errors in reconstructing arguments. Whilelearning to predict left-out arguments, the inferencealgorithm will search for latent relations which sim-plify the argument prediction task as much as pos-sible. Roughly, such objective will favour inducingrelations which maximally constrain the set of ad-missible argument pairs. Our hypothesis is that re-lations induced in this way will be interpretable byhumans and useful in practical applications. Whyis this hypothesis plausible? Primarily because hu-mans typically define relations as an abstraction cap-turing the essence of the underlying situation. Andthe underlying situation (rather than surface linguis-tic details like syntactic functions) is precisely whatimposes constraints on admissible argument pairs.

This framework allows us to both exploit rich fea-tures (in the encoding component) and capture inter-dependencies between arguments in a flexible way(both in the reconstruction and encoding compo-nents).

Interestingly, the use of a reconstruction-error ob-jective, previously considered primarily in the con-text of training neural autoencoders (Hinton, 1989;Vincent et al., 2008), gives us an opportunity to bor-row ideas from the well-established area of statis-tical relational learning (Getoor and Taskar, 2007),or, more specifically, relation factorization. In thisarea, tensor and matrix factorization methods havebeen shown to be effective in inferring missing factsin knowledge bases (Bordes et al., 2011; Riedel etal., 2013; Chang et al., 2014; Bordes et al., 2014;Sutskever et al., 2009). In our work, we also adopt afairly standard RESCAL factorization (Nickel et al.,2011) and use it within our reconstruction compo-nent.

Though there is a clear analogy between statisti-cal relational learning and our setting, there is a verysignificant difference. In contrast to relational learn-ing, rather than factorizing existing relations (an ex-isting ‘database’), our method simultaneously dis-covers the relational schema (i.e., an inventory of re-lations) and a mapping from text to the relations (i.e.,a relation extractor), and it does it in such way as tomaximize performance on reconstruction (i.e., infer-ence) tasks. This analogy also highlights one impor-tant property of our framework: unlike generativemodels, we explicitly force our semantic representa-

tions to be useful for at least the most basic form ofsemantic inference (i.e., inferring an argument basedon the relation and another argument). It is impor-tant to note that the model is completely agnosticabout the real semantic relation between two argu-ments, as the relational schema is discovered duringlearning.

We consider both the factorization method in-spired by previous research in the knowledge basemodeling (as discussed above) and another, evensimpler one, based on ideas from previous researchon modeling selectional preferences (e.g., (Resnik,1997; Seaghdha, 2010; Van de Cruys, 2010)), plustheir combination. Our models are applied to aversion of the New York Times corpus (Sandhaus,2008). In order to evaluate our approach, we fol-low Yao et al. (2011) and align named entities inour collection to Freebase (Bollacker et al., 2008),a large collaborative knowledge base. In this waywe can evaluate a subset of our induced relationsagainst relations in Freebase. Note that Freebasehas not been used during learning making this a fairevaluation scenario for an unsupervised relation in-duction method. We also qualitatively evaluate ourmodel by both considering several examples of in-duced relations (both appearing and not appearingin Freebase) and visualizing embeddings of namedentities induced by our model. As expected, thechoice of a factorization model affects the modelperformance. Our best models substantially outper-form the state-of-the-art generative Rel-LDA modelof Yao et al. (2011): 35.8% F1 and 29.6% F1 for ourbest model and Rel-LDA, respectively.

The rest of the paper is structured as follows.In the following section, we formally describe theproblem. In Section 3, we motivate our approach.In Section 4, we formally describe the method. InSection 5 we describe our experimental setting anddiscuss the results. We give more background onRE, knowledge base completion and autoencodersin Section 6.

2 Problem Definition

In the most standard form of RE considered in thiswork, an extractor, given a sentence and a pair ofnamed entities e1 and e2, needs to predict the under-lying semantic relation r between these entities. For

Figure 1: Inducing relations with discrete-state autoencoders.

example, in the sentence

Roger Ebert wrote a review of The Fall

we have two entities e1 = Roger Ebert ande2 = The Fall, and the extractor should predictthe semantic relation r = REVIEWED.1 The stan-dard approach to this task is to either rely on hu-man annotated data (i.e., supervised learning) or usedata generated automatically by aligning knowledgebases (e.g., Freebase) with text (called distantly-supervised methods). Both classes of approaches as-sume a predefined inventory of relations and a man-ually constructed resource.

In contrast, the focus of this paper is on open-domain unsupervised RE (also known as relationdiscovery) where no fixed inventory of relations isprovided to the learner. The methods induce rela-tions from the data itself. Previous work on thistask (Banko et al., 2007), as well as on its general-ization called unsupervised semantic parsing (Poonand Domingos, 2009; Titov and Klementiev, 2011),groups patterns between entity pairs (e.g., wrotea review, wrote a critique and reviewed) and usesthese clusters as relations. Other approaches (e.g.,(Shinyama and Sekine, 2006; Yao et al., 2011; Yaoet al., 2012; de Lacalle and Lapata, 2013)), in-cluding the one introduced in this paper, performcontext-sensitive clustering, that is, they treat rela-tions as latent variables and induce them for eachentity-pair occurrence individually. Rather than re-lying solely on a pattern between entity pairs, thelatter class of methods can use additional context to

1In some of our examples we will use relation names, al-though our method, as virtually any other latent variable model,will not induce names but only indices.

decide that Napoleon reviewed the Old Guard andthe above sentence about Roger Ebert should not belabeled with the same relation.

Unsupervised relation discovery is an importanttask because existing knowledge bases (e.g., Free-base, Yago (Suchanek et al., 2007), DBpedia (Aueret al., 2007)) do not have perfect coverage even formost standard domains (e.g., music or sports), butarguably more importantly because there are manydomains not covered by these resources. Thoughone option is to provide a list of relations with seedexamples for each of them and then use bootstrap-ping (Agichtein and Gravano, 2000), it requires do-main knowledge and may thus be problematic. Inthese cases unsupervised relation discovery is theonly non-labour-intensive way to construct a rela-tion extractor. Moreover, unsupervised methods canalso aid in building new knowledge bases by pro-viding an initial set of relations which can then berefined.

We focus only on extracting semantic relations as-suming that named entities have already been recog-nized by an external method (Finkel et al., 2005). Asin previous work (Yao et al., 2011), we are not tryingto detect if there is a relation between two entitiesor not; our aim is to detect a relation between eachpair of entities appearing in a sentence. In principle,heuristics (i.e., based on the syntactic dependencypaths connecting arguments) can be used to get ridof unlikely pairs.

3 Our Approach

We approach the problem by introducing a latentvariable model which defines the interactions be-tween a latent relation r and the observables: the

entity pair (e1, e2) and other features of the sentencex. The idea which underlines much of latent vari-able modeling is that a good latent representation isthe one which helps us to reconstruct input (i.e., x,including (e1, e2)). In practice, we are not interestedin predicting x, as x is observable, but rather inter-ested in inducing an appropriate latent representa-tion (i.e., r). Thus, it is crucial to design the modelin such a way that a good r (the one predictive ofx) indeed encodes relations rather than some otherform of abstraction.

In our approach, we encode this reconstructionidea very explicitly. As a motivating example, con-sider the following sentence:

Ebert is the first journalist to win the Pulitzer prize.

As shown in Figure 1, let us assume that we hideone argument, chosen at random: for example, e2 =Pulitzer prize. Now the purpose of the re-construction component is to reconstruct (i.e., infer)this argument relying on another argument (e1 =Ebert), the latent relations r and nothing else.At the learning time, our inference algorithm willsearch through the space of potential relation clus-terings to find the one which makes these reconstruc-tion tasks as simple as possible. For example, if thealgorithm clusters expressions is the first journalistto win together with was awarded, the prediction islikely to be successful, assuming that the passageEbert was awarded the Pulitzer prize has been ob-served elsewhere in the training data. On the con-trary, if the algorithm clustered is the first journalistto win with presented, we are likely to make a wronginference (i.e., predict Golden Thumb award).Given that we optimize the reconstruction objec-tive, the former clustering is much more likely thanthe latter. Reconstruction can be seen as a knowl-edge base factorization approach similar to the onesof Bordes et al. (2014). Notice that the model’s finalgoal is to learn a good relation clustering, and thatthe reconstruction objective is used as a means toreach this goal. For the reasons which will be clearin a moment, we will refer to the model performingprediction of entities relying on other entities andrelations as a decoder (aka the reconstruction com-ponent).

Though in the previous paragraph, we describedit as clustering of patterns, we are inducing clusters

in a context-sensitive way. In other words, we arelearning an encoder: a feature-rich classifier, whichpredicts a relation for a given sentence and an en-tity pair in this sentence. Clearly, this is a better ap-proach because some of the patterns between entitiesare ambiguous and require extra features to disam-biguate them (recall the example from the previoussection), whereas other patterns may not be frequentenough to induce reliable clustering (e.g., is the firstjournalist to win). The encoding and reconstructioncomponents are learned jointly so as to minimize theprediction error. In this way, the encoder is special-ized to the defined reconstruction problem.

4 Reconstruction Error Minimization

In order to implement the desiderata sketched inthe previous section, we take an inspiration from aframework popular in the neural network commu-nity, namely autoencoders (Hinton, 1989). Autoen-coders are composed of two components: an en-coder which predicts a latent representation y froman input x, and a decoder which relies on the la-tent representation y to recover the input (x). In thelearning phase the parameters of both the encodingand reconstruction part are chosen so as to minimizea reconstruction error (e.g., the Euclidean distance||x− x||2).

Though popular within the neural network com-munity (i.e., y is defined as a real-valued vector),they have recently been applied to the discrete-statesetting (i.e., y is defined as a categorical randomvariable, a tuple of variables or a graph). For ex-ample, such models have been used in the contextof dependency parsing (Daume III, 2009), or in thecontext of POS tagging and word alignment (Am-mar et al., 2014; Lin et al., 2015a). The most relatedprevious work (Titov and Khoddam, 2015) consid-ers induction of semantics roles of verbal arguments(e.g., an agent, a performer of an action vs. a patient,an affected entity), though no grouping of predicatesinto relations was considered. We refer to such mod-els as discrete-state autoencoders.

We use different model families for the decod-ing and reconstruction components. The encodingpart is a log-linear feature-rich model, while the re-construction part is a tensor (or matrix) factorizationmodel which seeks to reconstruct entities relying on

the outcome of the encoding component.

4.1 Encoding componentThe encoding component, that is, the actual relationextractor that will be used to process new sentences,is a feature-rich classifier that given a set of fea-tures extracted from the sentence, predicts the cor-responding semantic relation r ∈ R. We use a log-linear model (‘softmax regression’)

q(r|x,w) =exp(wTg(r, x)))∑

r′∈R exp(wTg(r′, x)), (1)

where g(r, x) is a high-dimensional feature repre-sentation and w is the corresponding vector of pa-rameters. In principle, the encoding model can beany model as long as relation posteriors q(r|x,w)and their gradients can be efficiently computed orapproximated. We discuss the features we use in theexperimental section (Section 5).

4.2 Reconstruction componentIn the reconstruction component (i.e., decoder), weseek to predict an entity ei ∈ E in a specificposition i ∈ {1, 2} given the relation r and an-other entity e−i, where e−i denotes the complement{e1, e2}\{ei}. Note that this model does not haveaccess to any features of the sentence; this is crucialas in this way we ensure that all the essential infor-mation is encoded by the relation variable. This bot-tleneck is needed as it forces the learning algorithmto induce informative relations rather than clusterrelation occurrences in a random fashion or assignthem all to the same relation.

To simplify our notation, let us assume that wepredict e1; the model for e2 will be analogous. Wewrite the conditional probability models in the fol-lowing form

p(e1|e2, r, θ) =exp(ψ(e1, e2, r, θ))∑e′∈E exp(ψ(e

′, e2, r, θ)), (2)

where E is the set of all entities, ψ is a general scor-ing function which, as we will show, can be instan-tiated in several ways; θ represents its parameters.The actual set of parameters represented by θ willdepend on the choice of scoring function. However,in all the cases we consider in this paper, the param-eters will include entity embeddings (ue ∈ Rd for

every e ∈ E). These embeddings will be learnedwithin our model.

In this work we explore three different factoriza-tions for the decoding component: a tensor factor-ization model, inspired by previous work on relationfactorization, a simple selectional-preference modelwhich scores each argument independently from theother, and a third model which is a combination ofthe first two.

4.2.1 ψRS: RESCALThe first reconstruction model we consider is

RESCAL, a model very successful in the relationalmodeling context (Nickel et al., 2011; Chang etal., 2014). It is a restricted version of the classicTucker tensor decomposition (Tucker, 1966; Koldaand Bader, 2009) and is defined as

ψRS(e1, e2, r, θ) = uTe1Crue2 , (3)

where ue1 ,ue2 ∈ Rd are the entity embeddings cor-responding to the entities e1 and e2. Cr ∈ Rd×d is amatrix associated with the latent semantic relation r,it evaluates (i.e., scores) the compatibility betweentwo arguments of the relation.

4.2.2 ψSP : Selectional preferencesThe following factorization scores how well each

argument fits selectional preferences of a given rela-tion r

ψSP (e1, e2, r, θ) =

2∑i=1

uTeicir, (4)

where c1r and c2r ∈ Rd encode selectional prefer-ences for the first and second argument of the rela-tion r, respectively. This factorization is also knownas model E in Riedel et al. (2013). Differently fromthe previous model, it does not model interaction be-tween arguments: it is easy to see that p(e1|e2, r, θ)for this model (expression (2)) does not depend one2 (i.e., on ue2 and c2r). Consequently, such de-coder would be more similar to generative modelsof relations which typically assume that argumentsare conditionally independent (Yao et al., 2011).Note though that our joint model can still captureargument interdependencies in the encoding com-ponent. Still, this approach does not fully imple-ment the desiderate described in the previous chap-ter, so we generally expect this model to be weaker

on reasonably-sized collections (as confirmed in ourexperimental evaluation).

4.2.3 ψHY : Hybrid modelThe RESCAL model may be too expressive to

be accurately estimated for infrequent relations,whereas the selectional preference model cannot, inturn, capture interdependencies between arguments,so it seems natural to hope that their combinationψHY will be more accurate overall:

ψHY (e1, e2, r, θ) = uTe1Crue2 +

2∑i=1

uTeicir. (5)

This model is very similar to the tensor factorizationapproach proposed in Socher et al. (2013).

4.3 Learning

We first provide an intuition behind the objective weoptimize. We derive it more formally in the sub-sequent section, where we show that it can be re-garded as a variational lower bound on pseudolike-lihood (Section 4.3.1). As the resulting objective isstill computationally expensive to optimize (due to asummation over all potential entities), we introducefurther approximations in Section 4.3.2.

The parameters of the encoding and decodingcomponents (i.e., w and θ) are estimated jointly.Our general idea is to optimize the quality of argu-ment prediction while averaging over relations

2∑i=1

∑r∈R

q(r|x,w) log p(ei|e−i, r, θ). (6)

Though this objective seems natural, it has one seri-ous drawback: the induced posteriors q(r|x,w) endup being extremely sharp which, in turn, makes thesearch algorithm more prone to getting stuck in localminima. As we will see in the experimental results,this version of the objective results in lower averageperformance. As we will discuss in the next section,this behaviour can be explained by drawing connec-tions with variational inference. Roughly speaking,direct optimization of the above objective behavesvery much like using hard EM for generative latent-variable models.

Intuitively, one solution is, instead of optimizingexpression (6), to consider an entropy-regularized

version that favours more uniform posterior distri-butions q(r|x,w)

2∑i=1

∑r∈R

q(r|x,w)log p(ei|e−i,r,θ)+H(q(·|x,w)), (7)

where the last term H denotes the entropy. Theentropy term can be seen as posterior regulariza-tion (Ganchev et al., 2010) which pushes the pos-terior q(r|x,w) to be more uniform. As we willsee in a moment, this approach can be formallyjustified by drawing connections to variational in-ference (Jaakkola and Jordan, 1996) and, morespecifically, to variational autoencoders (Kingmaand Welling, 2014).

4.3.1 Variational inferenceThis subsection presents a justification for the ob-

jectives (6) and (7), however, a reader not interestedin this explanation can safely skip it and proceed di-rectly to Section 4.3.2.

For the moment let us assume that we performgenerative modeling, and we consider optimizationof the following pseudo-likelihood (Besag, 1975)objective

2∑i=1

log∑r

p(ei|e−i, r, θ)pu(r), (8)

where pu(r) is the uniform distribution over rela-tions. Note that currently the encoding model is notpart of this objective. The pseudo-likelihood (byJensen’s inequality) can be lower-bounded by thefollowing variational bound

2∑i=1

∑r∈R

qi(r) log p(ei|e−i, r, θ)pu(r) +H(qi), (9)

where qi is an arbitrary distribution over relations.Note that pu(r) can be dropped from the expressionas it corresponds to the constant with respect to thechoice of both the variational distributions qi and the(reconstruction) model parameters θ.

In variational inference, the maximization of theoriginal (pseudo-)likelihood objective (8) is replacedwith the maximization of expression (9) both withrespect to qi and θ. This is typically achieved withan EM-like step-wise procedure: steps where qi is

selected for a given θ are alternated with steps whereparameters θ are updated while keeping qi fixed.One idea, introduced by Kingma and Welling (2014)for the continuous case, is to replace the search foran optimal qi with a predictor (a classifier in ourdiscrete case) trained within the same optimizationprocedure. Our encoding model q(r|x,w) is ex-actly such a predictor. With these two modifications(dropping the nuisance term pu and replacing qi withq(r|x,w)), we obtain the objective (7).

4.3.2 ApproximationThe objective (7) cannot be efficiently optimized

in its exact form as the partition function of ex-pression (2) requires the summation over the en-tire set of possible entities E . In order to deal withthis challenge we rely on the negative sampling ap-proach of Mikolov et al. (2013). Specifically weavoid the softmax in expression (2) and substitutelog p(e1|e2, r, θ) in the objective (7) with the follow-ing expression

log σ(ψ(e1, e2, r, θ))

+∑

eneg1 ∈S

log σ(−ψ(eneg1 , e2, r, θ)),

where S is a random sample of n entities from thedistribution of entities in the collection and σ isthe sigmoid function. Intuitively, this expressionpushes up the scores of arguments seen in the textand pushes down the scores of ‘negative’ arguments.When there are multiple entities e1 which satisfy therelation r with e2, for example, Natasha Obamaand Malia Ann Obama, in relation CHILD OFwith Barack Obama, the scores for all such en-tities will be pushed up. Assuming both daughtersare mentioned with a similar frequency, they willget similar scores. Generally, arguments more fre-quently mentioned in text will get higher scores.

In the end, instead of directly optimizing the ex-pression (7), we use the following objective

2∑i=1

Eq(·|x,w)

[log σ(ψ(ei, e−i, r, θ))

+∑

enegi ∈S

log σ(−ψ(enegi , e−i, r, θ))]

+ αH(q(·|x,w)), (10)

where Eq(·|x,w)

[. . .]

denotes an expectation com-puted with respect to the encoder distributionq(r|x,w). Note the non-negative parameter α: aftersubstituting the softmax with the negative samplingterm, the entropy parameter and the expectation arenot on the same scale anymore. Though we couldtry estimating the scaling parameter α, we chose totune it on the validation set.

The gradients of the above objective can be cal-culated using backpropagation. With the aboveapproximation, their computation is quite efficientsince the reconstruction model has a fairly simpleform (e.g., bilinear) and learning the encoder is nomore expensive than learning a supervised classifier.We used AdaGrad (Duchi et al., 2011) as an opti-mization algorithm.

5 Experiments

In this work we evaluate how effective our model isin discovering relations between pairs of entities ina sentence. We consider the unsupervised setting, sowe use clustering measures for evaluation.

Since we want to directly compare to Rel-LDA (Yao et al., 2011), we use the transductiveset-up: we train our model on the entire trainingset (with labels removed) and we evaluate the es-timated model on a subset of the training set. Giventhat we train the relation classifier (i.e., the encod-ing model), unlike some of the previous approaches,there is nothing in our approach which prevents usfrom applying it in an inductive scenario (i.e., to un-seen data).

Towards the end of this section we also providesome qualitative evaluation of the induced relationsand entity embeddings.

5.1 Data and evaluation measures

We tested our model on the New York Times cor-pus (Sandhaus, 2008) using articles from 2000 to2007. We use the same filtering and preprocessingsteps (POS tagging, NER and syntactic parsing) asthe ones described in Yao et al. (2011). In that waywe obtained about 2 million entity pairs (i.e., poten-tial relation realizations).

In order to evaluate our models, we aligned eachentity pair with Freebase, and, as in Yao et al.(2012), we discarded unaligned ones from the eval-

uation. We consider Freebase relations as gold-standard clusterings and evaluated induced relationsagainst them. Note that we use the micro-readingscenario (Nakashole and Mitchell, 2014), that is, wepredict relation on the basis of a single occurrenceof an entity pair rather than aggregating informa-tion across all the occurrences of the pair in the cor-pus. Though it is likely to harm our performancewhen evaluating against Freebase, this is a deliberatechoice as we believe extracting relations about lessfrequent entities (where there is little redundancyin a collection) and modelling content of specificdocuments is a more challenging and important re-search direction. Moreover, feature-rich models arelikely to be especially beneficial in these scenarios,as for micro-reading the information extraction sys-tems cannot fall back to easier non-ambiguous con-texts.

As the scoring function, we use the B3 met-ric (Bagga and Baldwin, 1998). B3 is a standardmeasure for evaluating precision and recall of clus-tering tasks (Yao et al., 2012). As the final evalua-tion score we use F1, the harmonic mean of preci-sion and recall.

5.2 Features

The crucial characteristic of the learning method wepropose is the ability to handle a rich (and overlap-ping) set of features. With this in mind we adoptedthe following set of features:

1. bag of words between e1 and e2;

2. the surface form of e1 and e2;

3. the lemma of the ‘trigger’2 (i.e., for the passageMicrosoft is based in Redmond, the trigger isbased and its lemma is base);

4. the part-of-speech sequence between e1 and e2;

5. the entity type of e1 and e2 (as a pair);

6. the entity type of e1;

7. the entity type of e2;

2We define triggers as in Yao et al. (2011), namely “all thewords on the dependency path except stop words”.

8. words on the syntactic dependency path be-tween e1 and e2, i.e., the lexicalized path be-tween the entities stripped out of relations andrelations’ direction.

For example, from the sentence

Stephen Moore, director of fiscal policy studiesat the conservative Cato Institute,

we would extract the following features:

1. BOW:director, BOW:of, BOW:fiscal,BOW:policy, BOW:studies, BOW:at,BOW:the;

2. E1:Stephen Moore, E2:Cato Institute;

3. Trigger:director;

4. PoS:NN IN JJ NN NNS IN DT JJ;

5. PairType:PERSON ORGANIZATION;

6. E1Type:PERSON;

7. E2Type:ORGANIZATION;

8. Path:director at.

5.3 Parameters and baselines

All model parameters (w, θ) are initialized ran-domly. The embedding dimensionality d was set to30. The number of relations to induce is 100, thesame as used for Rel-LDA in Yao et al. (2011). Wealso set the mini batch size to 100, the initial learn-ing rate of AdaGrad to 0.1 and the number of neg-ative samples n to 20. The results reported in Ta-ble 1 are average results of three runs obtained af-ter 5 iterations over the entire training set. For eachmodel we tuned the weight for the L2 regularizationpenalty and chose 0.1 as it worked well across all themodels. We tuned the α coefficient (i.e., the weightfor the entropy term) of Table 4.3.2 for each modeland we chose 0.25 for RESCAL and 0.01 for the se-lectional preferences and 0.1 for the hybrid model.All the model selection was performed on a vali-dation set: as a validation set we selected random20% of the entire dataset, and considered entity pairsaligned to Freebase. The final evaluation was doneon the remaining 80%.

RESCAL Selectional Pref. HybridRel-LDA(our feats)

Rel-LDA(Yao et al., 2012) feats

HAC (DIRT)

34.5± 1.3 33.4± 1.1 35.8± 2.0 29.6± 0.9 26.3± 0.8 28.3

Table 1: Average F1 results (%), and the standard deviation, across 3 runs of different models on the test set.

In order to compare our models with the stateof the art in unsupervised RE, we used as a base-line the Rel-LDA model introduced in Yao et al.(2011). Rel-LDA is an application of the LDA topicmodel (Blei et al., 2003) to the relation discoverytask. In Rel-LDA topics correspond to latent rela-tions, and, instead of relying on words as LDA does,Rel-LDA uses predefined features, including argu-ment words. Similarly to our selectional-preferencedecoder, it assumes that arguments are conditionallyindependent given the relation. As another baseline,following Yao et al. (2012), we used hierarchical ag-glomerative clustering (HAC). This baseline is verysimilar to the standard unsupervised relation extrac-tion method DIRT (Lin and Pantel, 2001). The HACcut-off parameter was set to 0.95 based on the devel-opment set performance. We used the same featurerepresentation for all the models, including the base-lines. We also report results of Rel-LDA using thefeatures from Yao et al. (2012).3

5.4 Results and discussion

The results we report on Table 1 are averages across3 runs with different random initialization of theparameters (except for the deterministic HAC ap-proach), we also report the standard deviation. First,we can observe that using richer features is benefi-cial for the generative baseline. It led to a substan-tial improvement in F1 (from 26.3% to 29.6% F1).Also the HAC baseline is outperformed by Rel-LDA(29.6% vs. 28.3% F1). However, all the proposedmodels substantially outperform both baselines: thebest result is 35.8% F1.

The selectional preference model on average per-forms better than the baseline (33.4% vs. 29.6% F1).As we predicted in Section 4, compared with theRESCAL model, the selectional preference modelhas slightly lower performance (34.5% vs. 33.4%F1). This is not surprising as the argument inde-pendence assumption is very strong, and the general

3Yao et al. (2012) is a follow-up work for Yao et al. (2011).

0 0.01 0.1 1

alpha

22

24

26

28

30

32

34

36

38

40

F1

Figure 2: Results of the hybrid model on the validationset, with different α.

motivation we provided in Section 2 does not reallyapply to the selectional preference model.

Combining RESCAL and selection preferencemodels, as we expected, gave some advantage interms of performance. The hybrid model is the bestperforming model with 35.8% F1, and it is, in aver-age, 6.2% more accurate than Rel-LDA.

Introduction of the entropy in expression (7) doesnot only add an extra justification to the objectivewe optimize, but also helps to improve the models’performance. In fact, as shown in Figure 2 for theHybrid model, the difference between having or notthe entropy term makes a big difference, going from23.9% (without regularization) to 34.3% F1 (withregularization). Note that the method is quite stablewithin the range α ∈ [0.1, 1], however more fine-grained tuning of α seems only mildly beneficial.However the performance with small values of alpha(0.01) is more problematic: Hybrid both does notoutperform Rel-LDA and has a large variance acrossruns. Somewhat counter-intuitively, with α = 0 (noentropy regularization) the variance is almost negli-gible. However, given the low performance in thisregime, it probably just means that we get consis-tently stuck in equally bad local minima.

Though it may seem that the need to tune theentropy term weight is an unfortunate side ef-

Relation 66 Relation 62 Relation 19

president review professordirector review restaurant dean

chairman review production graduateexecutive review book director

spokesman review performance specialistmanager column review attendanalyst review concert expertowner review revival professor study

professor review rise chairman

Table 2: Relation clusters ordered from left to right bytheir frequency.

fect of using the non-probabilistic objective (Sec-tion 4.3.2), the reality is more subtle. In fact,even for fully probabilistic variational autoencoderswith real-valued states y, using the weight of 1,as prescribed by their variational interpretation (seeSection 4.3.1), does not lead to stable perfor-mance (Bowman et al., 2015). Instead, annealingover α seems necessary. Though annealing is likelyto benefit our method as well, we leave it for futurework.

Since the proposed model is unsupervised, it isinteresting to inspect relations induced by our bestmodel. In order to visualize them, we select themost likely relation according to our relation extrac-tor (i.e., encoding model) for every context in thevalidation set and then, for every relation, we countoccurrences of every trigger. The most-frequent trig-ger for three induced relations are presented in Table2. Relation 62 encodes the relation REVIEWED (notpresent in Freebase), as in

Anthony Tommasini reviews Metropolitan Opera’sproduction of Cosi Fan Tutte.

Clusters 19 and 66 are examples of more coarse rela-tions. Relation 19 represents a broader ACADEMICrelation, as in the passage

Dr. Susan Merritt, dean of the School of ComputerScience and Information Systems.

or as in the passage

George Myers graduated from Yale University.

Cluster 66 instead groups together expressions suchas leads or president (of) (so it can vaguely be de-

Semisup RESCAL 62.3Semisup Selectional Pref. 58.1Semisup Hybrid 61.5

Unsup Hybrid 34.3

Table 3: Average F1 results (%) for semi-supervised andunsupervised models, across 3 runs of different modelstested on Te.

scribed as a LEADERSHIP relation), but it alsocontains the relation triggered by the word profes-sor (in). In fact, this is the most frequent rela-tion induced by our model. We can check furtherby looking into learned embeddings of named enti-ties visualized with the t-SNE algorithm (Van derMaaten and Hinton, 2008). In Figure 3, we can seethat some entities representing universities and othernon-academic organizations end up being very closein the embedding space. This artefact is likely to berelated to coarseness of Relations 66 and 19, thoughit does not really provide a good explanation for whythis has happened, as the entity embeddings are alsoinduced within our model.

However, overlaps in embeddings do not seem tobe a general problem: the t-SNE visualization showsthat most entities are well clustered in fine-grainedtypes, for example, football teams, nations, and mu-sic critics.

5.5 Decoder influence

In order to shed some light on the influence of adecoder on the model performance, we performedadditional experiments in a more controlled setting.We reduced the dataset to entity pairs participatingin Freebase relations, ending up with a total of about42,000 relation realizations. We randomly split thedataset in two. We used the first half as a test setTe, while we used the second half as a training setTr. We further randomly split the training set Tr intwo parts, Tr1 and Tr2. We use Tr1 as a (distantly)labeled dataset to learn only the decoding part foreach proposed model. To make it comparable to ourunsupervised models with 100 induced relations, wetrained the decoder on the 99 most frequent Freebaserelations plus a further OTHER relation, which is aunion of the remaining, less frequent, relations. Thisapproach is similar to the KB factorization adoptedin Bordes et al. (2011). With the decoder learned

Political organizations

Universities

General

Organizations

Figure 3: t-SNE visualization of entity embeddings learned during the training process.

and fixed, we trained the encoder part on unlabeledexamples in Tr2, while leveraging the previouslytrained decoder. In other words, we optimize theobjective (10) on Tr2 but update only the encoderparameters w.4 In this setting the decoder providesa learning signal to the encoder. The better general-ization properties of the decoder are, the better theresulting encoder should be. We expect more ex-pressive decoders (i.e., RESCAL and Hybrid) to beable to capture relations better than the selectionalpreference model and, thus, yield better encoders. Inorder to have a lower bound for the semi-supervisedmodels, we also trained our best model from the pre-vious experiments (Hybrid) on Tr2 in a fully unsu-pervised way. All these models are tested on the testset Te.

As expected, all the models with the superviseddecoder are much more accurate than the unsuper-vised model (Table 3). The best results with asupervised decoder are obtained by the RESCALmodel with 62.3% F1, while the result of the un-supervised hybrid model is 34.3% F1. As expectedthe RESCAL and Hybrid outperform the selectionalpreference model in this setting, 62.3% and 61.5%vs. 58.1% F1 respectively. Somewhat surprisingly,the RESCAL model is slightly more accurate (0.8%

4In fact, we were also updating embeddings of entities notappearing in Tr1.

F1) than the hybrid model. These experiments con-firm that more accurate decoder models lead to bet-ter performance of the encoder. The results also hintat a potential extension of our approach to a more re-alistic semi-supervised setting, though we leave anyserious investigation of this set-up for future work.

6 Additional Related WorkIn this section, we consider mostly lines of relatedwork not discussed in other sections of the paper,and we emphasize their relation to our approach.

Distant supervision. These methods can be re-garded as a half way point between unsupervisedand supervised methods. In order to train their mod-els they use data generated automatically by aligningknowledge bases (e.g., Freebase and Wikipedia in-foboxes) with text (Mintz et al., 2009; Riedel et al.,2010; Surdeanu et al., 2012; Zeng et al., 2015). Sim-ilarly to our method they can use feature-rich mod-els without the need for manually annotating data.However, a relation extractor trained in this way willonly be able to predict relations already present in aknowledge base. These methods cannot be used todiscover new relations. The framework we proposeis completely unsupervised and does not have thisshortcoming.

Bootstrapping. Bootstrapping methods for rela-tion extraction (Agichtein and Gravano, 2000; Brin,

1998; Batista et al., 2015) iteratively label new ex-amples by finding the ones which are the most sim-ilar, according to some similarity function, to a seedset of labeled examples. The process continues untilsome convergence criteria is met. Even though thisapproach is not very labor-intensive, (i.e., it requiresonly few manually labeled examples for the initialseed set), it requires some domain knowledge fromthe model designer. In contrast, unsupervised mod-els are domain-agnostic and require only unlabeledtext.

Knowledge base factorization. Knowledge basecompletion via matrix or tensor factorization has re-ceived a lot of attention in the past years (Bordes etal., 2011; Jenatton et al., 2012; Weston et al., 2013;Bordes et al., 2013; Socher et al., 2013; Garcıa-Duran et al., 2014; Bordes et al., 2014; Lin et al.,2015b; Chang et al., 2014; Nickel et al., 2011). Butdifferently from what we propose here, namely, in-duction of new relations, these models factorize re-lations already present in knowledge bases.

Universal schema methods (Riedel et al., 2013)use factorization models to infer facts (e.g., predictmissing entities), however they do not attempt to in-duce relations. In other words, they consider eachgiven context as a relation and induce an embeddingfor each of them. They do not attempt to induce aclustering over the contexts. Our work can be re-garded as an extension of these methods.

Autoencoders with discrete states. Aside fromthe work cited above (Daume III, 2009; Ammaret al., 2014; Titov and Khoddam, 2015; Lin et al.,2015a), we are not aware of previous work usingautoencoder with discrete states (i.e. a categori-cal latent variable or a graph). The semisupervisedversion of variational autoencoders (Kingma et al.,2014) used a combination of a real-valued vectorand a categorical variable as its hidden representa-tion and yielded impressive results on the MNISTimage classification task. However, their approachcannot be directly applied to unsupervised classifi-cation, as there is no reason to believe that latentclasses would be captured by the categorical vari-able rather than in some way represented by the real-valued vector.

The only other application of variational autoen-coders to natural language is the very recent work of

Bowman et al. (2015). They study language mod-eling with recurrent language models and consideronly real-valued vectors as states. Generative mod-els with rich features have also been considered inthe past (Berg-Kirkpatrick et al., 2010). However,autoencoders are potentially more flexible than gen-erative models as they can use very different encod-ing and decoding components and can be faster totrain.

7 Conclusions and Discussion

We presented a new method for unsupervised rela-tion extraction.5 The model consists of a feature-rich classifier, which predicts relations, and a ten-sor factorization component, which relies on rela-tions to infer left-out arguments. These models arejointly estimated by optimizing the argument recon-struction objective.

We studied three alternative factorization modelsbuilding on ideas from knowledge base factorizationand selectional preference modeling. We empiri-cally showed that our factorization models yield re-lation extractors more accurate than state-of-the-artgenerative and agglomerative clustering baselines.

As the proposed modeling framework is quiteflexible, the model can be extended in many differ-ent ways. Our approach can be regarded as learningsemantic representations to be informative for ba-sic inference tasks (in our case, the inference taskwas recovering individual arguments). More gen-eral classes of inference tasks can be considered inthe future work. Moreover, it would be interestingto evaluate it on how accurately it infers these facts(rather than only on the quality of the induced la-tent representations). The work presented in thispaper can also be combined with the approach ofTitov and Khoddam (2015) to induce both relationsand semantic roles (i.e., essentially induce semanticframes (Fillmore, 1976)). Another potential direc-tion is the use of labeled data: our feature-rich model(namely its discriminative encoding component) islikely to have much better asymptotic performancethan its generative counterpart, and, consequently,labeled data should be much more beneficial.

5github.com/diegma/relation-autoencoder

Acknowledgments

This work is supported by NWO Vidi Grant016.153.327, Google Focused Award on NaturalLanguage Understanding and partially supported byISTI Grant for Young Mobility. The authors thankthe action editor and the anonymous reviewers fortheir valuable suggestions and Limin Yao for an-swering our questions about data and baselines.

References

Eugene Agichtein and Luis Gravano. 2000. Snowball:extracting relations from large plain-text collections.In ACM - DL.

Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014.Conditional random field autoencoders for unsuper-vised structured prediction. In NIPS.

Soren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary G. Ives.2007. Dbpedia: A nucleus for a web of open data.In ISWC.

Amit Bagga and Breck Baldwin. 1998. Algorithms forscoring coreference chains. In LREC.

Michele Banko and Oren Etzioni. 2008. The tradeoffsbetween open and traditional relation extraction. InACL.

Michele Banko, Michael J. Cafarella, Stephen Soderland,Matthew Broadhead, and Oren Etzioni. 2007. Openinformation extraction from the web. In IJCAI.

David S. Batista, Bruno Martins, and Mario J. Silva.2015. Semi-supervised bootstrapping of relationshipextractors with distributional semantics. In EMNLP.

Taylor Berg-Kirkpatrick, Alexandre Bouchard-Cote,John DeNero, and Dan Klein. 2010. Painless unsu-pervised learning with features. In HLT - NAACL.

Julian Besag. 1975. Statistical analysis of non-latticedata. The statistician, pages 179–195.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. Journal of MachineLearning Research, 3:993–1022.

Kurt D. Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a collabo-ratively created graph database for structuring humanknowledge. In SIGMOD.

Antoine Bordes, Jason Weston, Ronan Collobert, andYoshua Bengio. 2011. Learning structured embed-dings of knowledge bases. In AAAI.

Antoine Bordes, Nicolas Usunier, Alberto Garcıa-Duran,Jason Weston, and Oksana Yakhnenko. 2013. Irreflex-ive and hierarchical relations as translations. In SLG.

Antoine Bordes, Xavier Glorot, Jason Weston, andYoshua Bengio. 2014. A semantic matching energyfunction for learning with multi-relational data - appli-cation to word-sense disambiguation. Journal of Ma-chine Learning, 94(2):233–259.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-drew M. Dai, Rafal Jozefowicz, and Samy Bengio.2015. Generating sentences from a continuous space.CoRR, abs/1511.06349.

Sergey Brin. 1998. Extracting patterns and relationsfrom the world wide web. In WebDB.

Kai-Wei Chang, Wen-tau Yih, Bishan Yang, and Christo-pher Meek. 2014. Typed tensor decomposition ofknowledge bases for relation extraction. In EMNLP.

Hal Daume III. 2009. Unsupervised search-based struc-tured prediction. In ICML.

Oier Lopez de Lacalle and Mirella Lapata. 2013. Un-supervised relation extraction with general domainknowledge. In EMNLP.

John C. Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learning andstochastic optimization. Journal of Machine LearningResearch, 12:2121–2159.

Charles J Fillmore. 1976. Frame semantics and the na-ture of language*. Annals of the New York Academy ofSciences, 280(1):20–32.

Jenny Rose Finkel, Trond Grenager, and Christopher D.Manning. 2005. Incorporating non-local informationinto information extraction systems by gibbs sampling.In ACL.

Kuzman Ganchev, Joao Graca, Jennifer Gillenwater, andBen Taskar. 2010. Posterior regularization for struc-tured latent variable models. Journal of MachineLearning Research, 11:2001–2049.

Alberto Garcıa-Duran, Antoine Bordes, and NicolasUsunier. 2014. Effective blending of two and three-way interactions for modeling multi-relational data. InECML-PKDD.

Lise Getoor and Ben Taskar. 2007. Introduction to sta-tistical relational learning. MIT press.

Geoffrey E. Hinton. 1989. Connectionist learning proce-dures. Artificial Intelligence, 40(1-3):185–234.

Tommi S. Jaakkola and Michael I. Jordan. 1996. Com-puting upper and lower bounds on likelihoods in in-tractable networks. In UAI.

Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes,and Guillaume Obozinski. 2012. A latent factormodel for highly multi-relational data. In NIPS.

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In ICLR.

Diederik P. Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling. 2014. Semi-supervisedlearning with deep generative models. In NIPS.

Tamara G. Kolda and Brett W. Bader. 2009. Ten-sor decompositions and applications. SIAM Review,51(3):455–500.

Dekang Lin and Patrick Pantel. 2001. DIRT - discoveryof inference rules from text. In SIGKDD.

Chu-Cheng Lin, Waleed Ammar, Chris Dyer, and Lori S.Levin. 2015a. Unsupervised POS induction withword embeddings. In NAACL HLT.

Yankai Lin, Zhiyuan Liu, Huan-Bo Luan, Maosong Sun,Siwei Rao, and Song Liu. 2015b. Modeling relationpaths for representation learning of knowledge bases.In EMNLP.

Xitong Liu, Fei Chen, Hui Fang, and Min Wang. 2014.Exploiting entity relationship for query expansion inenterprise search. Information Retrieval, 17(3):265–294.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In ICLR.

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky.2009. Distant supervision for relation extraction with-out labeled data. In ACL-AFNLP.

Ndapandula Nakashole and Tom M Mitchell. 2014. Mi-cro reading with priors: Towards second generationmachine readers. In AKBC at NIPS.

Ndapandula Nakashole, Gerhard Weikum, and Fabian M.Suchanek. 2012. PATTY: A taxonomy of relationalpatterns with semantic types. In EMNLP.

Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In ICML.

Hoifung Poon and Pedro M. Domingos. 2009. Unsuper-vised semantic parsing. In EMNLP.

Deepak Ravichandran and Eduard H. Hovy. 2002.Learning surface text patterns for a question answer-ing system. In ACL.

Philip Resnik. 1997. Selectional preference and sensedisambiguation. In ACL SIGLEX Workshop on Tag-ging Text with Lexical Semantics: Why, What, andHow.

Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions withoutlabeled text. In ECML-PKDD.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M. Marlin. 2013. Relation extractionwith matrix factorization and universal schemas. InNAACL.

Evan Sandhaus. 2008. The new york times annotatedcorpus. Linguistic Data Consortium, Philadelphia,6(12).

Diarmuid O Seaghdha. 2010. Latent variable models ofselectional preference. In ACL.

Yusuke Shinyama and Satoshi Sekine. 2006. Preemp-tive information extraction using unrestricted relationdiscovery. In NAACL HLT.

Richard Socher, Danqi Chen, Christopher D. Manning,and Andrew Y. Ng. 2013. Reasoning with neural ten-sor networks for knowledge base completion. In NIPS.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: a core of semantic knowledge.In WWW.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, andChristopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction. In EMNLP-CoNLL.

Ilya Sutskever, Ruslan Salakhutdinov, and Joshua B.Tenenbaum. 2009. Modelling relational data usingbayesian clustered tensor factorization. In NIPS.

Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaven-tura Coppola. 2004. Scaling web-based acquisition ofentailment relations. In EMNLP.

Ivan Titov and Ehsan Khoddam. 2015. Unsupervised in-duction of semantic roles within a reconstruction-errorminimization framework. In NAACL.

Ivan Titov and Alexandre Klementiev. 2011. A Bayesianmodel for unsupervised semantic parsing. In ACL.

L. R. Tucker. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311.

Tim Van de Cruys. 2010. A non-negative tensor fac-torization model for selectional preference induction.Journal of Natural Language Engineering, 16(4):417–437.

Laurens Van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research, 9(2579-2605):85.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. 2008. Extracting and com-posing robust features with denoising autoencoders. InICML.

Jason Weston, Antoine Bordes, Oksana Yakhnenko, andNicolas Usunier. 2013. Connecting language andknowledge bases with embedding models for relationextraction. In EMNLP.

Limin Yao, Aria Haghighi, Sebastian Riedel, and AndrewMcCallum. 2011. Structured relation discovery usinggenerative models. In EMNLP.

Limin Yao, Sebastian Riedel, and Andrew McCallum.2012. Unsupervised relation discovery with sense dis-ambiguation. In ACL.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In EMNLP.

Discrete-State Variational Autoencoders for Joint ...ivan-titov.org/papers/tacl16diego.pdf ·...

Documents

Transcript of Discrete-State Variational Autoencoders for Joint ...ivan-titov.org/papers/tacl16diego.pdf ·...