Fast and Large-scale Unsupervised Relation ExtractionRiedel et al., 2013). However, the previous...

10
Fast and Large-scale Unsupervised Relation Extraction Sho Takase Naoaki Okazaki †‡ Kentaro Inui Graduate School of Information Sciences, Tohoku University Japan Science and Technology Agency (JST) {takase, okazaki, inui}@ecei.tohoku.ac.jp Abstract A common approach to unsupervised relation extraction builds clusters of patterns express- ing the same relation. In order to obtain clus- ters of relational patterns of good quality, we have two major challenges: the semantic rep- resentation of relational patterns and the scal- ability to large data. In this paper, we ex- plore various methods for modeling the mean- ing of a pattern and for computing the similar- ity of patterns mined from huge data. In order to achieve this goal, we apply algorithms for approximate frequency counting and efficient dimension reduction to unsupervised relation extraction. The experimental results show that approximate frequency counting and dimen- sion reduction not only speeds up similarity computation but also improves the quality of pattern vectors. 1 Introduction Semantic relations between entities are essential for many NLP applications such as question an- swering, textual inference and information extrac- tion (Ravichandran and Hovy, 2002; Szpektor et al., 2004). Therefore, it is important to build a compre- hensive knowledge base consisting of instances of semantic relations (e.g., authorOf) such as authorOf Franz Kafka, The Metamorphosis. To recognize these instances in a corpus, we need to obtain pat- terns (e.g., “X write Y”) that signal instances of the semantic relations. For a long time, many researches have targeted at extracting instances and patterns of specific rela- tions (Riloff, 1996; Pantel and Pennacchiotti, 2006; De Saeger et al., 2009). In recent years, to acquire a wider range knowledge, Open Information Extrac- tion (Open IE) has received much attention (Banko et al., 2007). Open IE identifies relational patterns and instances automatically without predefined tar- get relations (Banko et al., 2007; Wu and Weld, 2010; Fader et al., 2011; Mausam et al., 2012). In other words, Open IE acquires knowledge to han- dle open domains. In Open IE paradigm, it is nec- essary to enumerate semantic relations in open do- mains and to learn mappings between surface pat- terns and semantic relations. This task is called unsupervised relation extraction (Hasegawa et al., 2004; Shinyama and Sekine, 2006; Rosenfeld and Feldman, 2007). A common approach to unsupervised relation ex- traction builds clusters of patterns that represent the same relation (Hasegawa et al., 2004; Shinyama and Sekine, 2006; Yao et al., 2011; Min et al., 2012; Rosenfeld and Feldman, 2007; Nakashole et al., 2012). In brief, each cluster includes patterns corre- sponding to a semantic relation. For example, con- sider three patterns, “X write Y”, “X is author of Y” and “X is located in Y”. When we group these patterns into clusters representing the same relation, patterns “X write Y” and “X is author of Y” form a cluster representing the relation authorOf, and the pattern “X is located in Y” does a cluster for locate- dIn. In order to obtain these clusters, we need to know the similarity between patterns. The better we model the similarity of patterns, the better a cluster- ing result correspond to semantic relations. Thus, the similarity computation between patterns is cru- cial for unsupervised relation extraction. PACLIC 29 96 29th Pacific Asia Conference on Language, Information and Computation pages 96 - 105 Shanghai, China, October 30 - November 1, 2015 Copyright 2015 by Sho Takase, Naoaki Okazaki and Kentaro Inui

Transcript of Fast and Large-scale Unsupervised Relation ExtractionRiedel et al., 2013). However, the previous...

  • Fast and Large-scale Unsupervised Relation Extraction

    Sho Takase† Naoaki Okazaki†‡ Kentaro Inui†Graduate School of Information Sciences, Tohoku University†

    Japan Science and Technology Agency (JST)‡

    {takase, okazaki, inui}@ecei.tohoku.ac.jp

    Abstract

    A common approach to unsupervised relationextraction builds clusters of patterns express-ing the same relation. In order to obtain clus-ters of relational patterns of good quality, wehave two major challenges: the semantic rep-resentation of relational patterns and the scal-ability to large data. In this paper, we ex-plore various methods for modeling the mean-ing of a pattern and for computing the similar-ity of patterns mined from huge data. In orderto achieve this goal, we apply algorithms forapproximate frequency counting and efficientdimension reduction to unsupervised relationextraction. The experimental results show thatapproximate frequency counting and dimen-sion reduction not only speeds up similaritycomputation but also improves the quality ofpattern vectors.

    1 Introduction

    Semantic relations between entities are essentialfor many NLP applications such as question an-swering, textual inference and information extrac-tion (Ravichandran and Hovy, 2002; Szpektor et al.,2004). Therefore, it is important to build a compre-hensive knowledge base consisting of instances ofsemantic relations (e.g., authorOf) such as authorOf⟨Franz Kafka, The Metamorphosis⟩. To recognizethese instances in a corpus, we need to obtain pat-terns (e.g., “X write Y”) that signal instances of thesemantic relations.

    For a long time, many researches have targetedat extracting instances and patterns of specific rela-tions (Riloff, 1996; Pantel and Pennacchiotti, 2006;

    De Saeger et al., 2009). In recent years, to acquirea wider range knowledge, Open Information Extrac-tion (Open IE) has received much attention (Bankoet al., 2007). Open IE identifies relational patternsand instances automatically without predefined tar-get relations (Banko et al., 2007; Wu and Weld,2010; Fader et al., 2011; Mausam et al., 2012). Inother words, Open IE acquires knowledge to han-dle open domains. In Open IE paradigm, it is nec-essary to enumerate semantic relations in open do-mains and to learn mappings between surface pat-terns and semantic relations. This task is calledunsupervised relation extraction (Hasegawa et al.,2004; Shinyama and Sekine, 2006; Rosenfeld andFeldman, 2007).

    A common approach to unsupervised relation ex-traction builds clusters of patterns that represent thesame relation (Hasegawa et al., 2004; Shinyama andSekine, 2006; Yao et al., 2011; Min et al., 2012;Rosenfeld and Feldman, 2007; Nakashole et al.,2012). In brief, each cluster includes patterns corre-sponding to a semantic relation. For example, con-sider three patterns, “X write Y”, “X is author ofY” and “X is located in Y”. When we group thesepatterns into clusters representing the same relation,patterns “X write Y” and “X is author of Y” forma cluster representing the relation authorOf, and thepattern “X is located in Y” does a cluster for locate-dIn. In order to obtain these clusters, we need toknow the similarity between patterns. The better wemodel the similarity of patterns, the better a cluster-ing result correspond to semantic relations. Thus,the similarity computation between patterns is cru-cial for unsupervised relation extraction.

    PACLIC 29

    9629th Pacific Asia Conference on Language, Information and Computation pages 96 - 105

    Shanghai, China, October 30 - November 1, 2015Copyright 2015 by Sho Takase, Naoaki Okazaki and Kentaro Inui

  • We have two major challenges in computing thesimilarity of patterns. First, it is not clear how torepresent the semantic meaning of a relational pat-tern. Previous studies define a feature space for pat-terns, and express the meaning of patterns by usingsuch as the co-occurrence statistics between a pat-tern and an entity pair, e.g., co-occurrence frequencyand pointwise mutual information (PMI) (Lin andPantel, 2001). Some studies employed vector repre-sentations of a fixed dimension, e.g., Principal Com-ponent Analysis (PCA) (Collins et al., 2002) and La-tent Dirichlet Allocation (LDA) (Yao et al., 2011;Riedel et al., 2013). However, the previous workdid not compare the effectiveness of these represen-tations when applied to a collection of large-scaledunstructured texts.

    Second, we need design a method scalable to alarge data. In Open IE, we utilize a large amount ofdata in order to improve the quality of unsupervisedrelation extraction. For this reason, we cannot usea complex and inefficient algorithm that consumesthe computation time and memory storage. In thispaper, we explore methods for computing patternsimilarity of good quality that are scalable to hugedata, for example, with several billion sentences. Inorder to achieve this goal, we utilize approximatefrequency counting and dimension reduction. Ourcontributions are threefold.

    • We build a system for unsupervised relation ex-traction that is practical and scalable to largedata.

    • Even though the proposed system introducesapproximations, we demonstrate that the sys-tem exhibits the performance comparable to theone without approximations.

    • Comparing several representations of patternvectors, we discuss a reasonable design for rep-resenting the meaning of a pattern.

    2 Methods

    2.1 Overview

    As mentioned in Section 1, semantic representationsof relational patterns is key to unsupervised rela-tion extraction. Based on the distributional hypoth-esis (Harris, 1954), we model the meaning of a re-

    lational pattern with a distribution of entity pairs co-occurring with the pattern. For example, the mean-ing of a relational pattern “X write Y” is representedby the distribution of the entity pairs that fills thevariables (X, Y) in a corpus. By using vector repre-sentations of relational patterns, we can compute thesemantic similarity of two relational patterns; for ex-ample, we can infer that the patterns “X write Y” and“X is author of Y” present the similar meaning if thedistribution of entity pairs for the pattern “X writeY” is similar to that for the pattern “X is author ofY”.

    Researchers have explored various approaches tovector representations of relational patterns (Lin andPantel, 2001; Rosenfeld and Feldman, 2007; Yaoet al., 2011; Riedel et al., 2013). The simplest ap-proach is to define a vector of a relational patternin which an element in the vector presents the co-occurrence frequency between a pattern and an en-tity pair. However, the use of raw frequency countsmay be inappropriate when some entity pairs co-occur with a number of patterns. A solution to thisproblem is to use a refined co-occurrence measuresuch as PMI (Lin and Pantel, 2001). In addition, wemay compress a vector representation with a dimen-sionality reduction method such as PCA becausepattern vectors tend to be sparse and high dimen-sional (Yao et al., 2011; Riedel et al., 2013).

    Meanwhile, it may be difficult to implement theabove procedures that can handle a large amountof data. Consider the situation where we find 1.1million entity pairs and 0.7 million relational pat-terns from a corpus with 15 billion sentences. Eventhough the vector space is sparse, we need to keepa huge number of frequency counts that record co-occurrences of the entity pairs and relational patternsin the corpus.

    In order to acquire pattern vectors from a largeamount of data, we explore two approaches in thisstudy. One approach is to apply an algorithm for ap-proximate counting so that we can discard unimpor-tant information in preparing pattern vectors. An-other approach is to utilize distributed representa-tions of words so that we can work on a semanticspace of a fixed and compact size. In short, the for-mer approach reduces the memory usage for com-puting statistics, whereas the latter compresses thevector space beforehand.

    PACLIC 29

    97

  • large corpus(15 billion sentences)

    Extracting entities

    Extracting entity pairs

    Extracting patterns

    Obtaining word vectors

    One millionentities

    1.1 millionentity pars

    0.7 millionpatterns

    Extracting triples 1.5 billion

    triples

    Exact co-occurrence counting

    non-scalable method

    Dimension reduction Similarity

    calculation

    Approximate co-occurrence counting

    Building vectors from word vectors scalable methods

    Figure 1: Overview of the system for unsupervised relation extraction

    Figure 1 illustrates the overview of the systemof unsupervised relation extraction presented in thispaper. We extract a collection of triples each ofwhich consists of an entity pair and a relational pat-tern (Section 2.2). Because this step may extractmeaningless triples, we identify entity pairs and re-lational patterns occurring frequently in the corpus.We compute co-occurrence statistics of entity pairsand relational patterns to obtain pattern vectors. Sec-tion 2.3 describes this process, followed by an on-line variant of PCA in Section 2.4. Furthermore, wepresent two approaches that improve the scalabilityto large data in Section 2.5.

    2.2 Extracting triplesIn this study, we define a triple as a combinationof an entity pair and a relational pattern that con-nects the two entities. In order to extract meaningfultriples from a corpus, we mine a set of entities andrelational patterns in an unsupervised fashion.

    2.2.1 Extracting entitiesWe define an entity mention as a sequence of nouns.Because quite entity mentions consist of two or morenouns (e.g., “Roadside station” and “Franz Kafka”),we adapt a simple statistical method (Mikolov et al.,2013) to recognize noun phrases. Equation 1 com-putes the score of a noun bigram wiwj ,

    score(wi, wj) = cor(wi, wj) ∗ dis(wi, wj), (1)

    cor(wi, wj) = logf(wi,wj)−δf(wi)×f(wj) , (2)

    dis(wi, wj) =f(wi,wj)

    f(wi,wj)+1min{f(wi),f(wj)}

    min{f(wi),f(wj)}+1 . (3)

    Here, f(wi) denotes the frequency of the noun wi,and f(wi, wj) does the frequency of the noun bi-

    gram wiwj . The parameter δ is a constant value toremove infrequent noun sequences. Consequently,cor(wi, wj) represents the degree of the connectionbetween wi and wj . However, cor(wi, wj) becomesundesirably large if either f(wi) or f(wj) is small.We introduce the function dis(wi, wj) to ‘discount’such sequences.

    We form noun phrases whose scores are greaterthan a threshold. In order to obtain noun phraseslonger than two words, we run the procedure fourtimes, decreasing the threshold value1. In this way,we can find, for example, “Franz Kafka” as an en-tity in the first run and “Franz Kafka works” in thesecond run. After identifying a set of noun phrases,we count the frequency of the noun phrases in thecorpus, and extract noun phrases occurring no lessthan 1,000 times as a set of entities.

    2.2.2 Extracting entity pairsAfter determining a set of entities, we discover en-tity pairs that may have semantic relationships in or-der to locate relational patterns. In this study, weextract a pair of entities if the entities co-occur inmore than 5,000 sentences. We denote the set of en-tity pairs extracted by this procedure E.

    2.2.3 Extracting patternsAs a relational pattern, this study employs the short-est path between two entities in a dependency tree,following the previous work (Wu and Weld, 2010;Mausam et al., 2012; Akbik et al., 2012). Here, weintroduce a restriction that a relational pattern mustinclude a predicate in order to reject semantically-

    1The threshold values are 10 (first time), 5 (second time),and 0 (third and fourth times).

    PACLIC 29

    98

  • •  aMetamorphosiswrote The Germany.inKafka

    nsubj

    prep_indobj

    det

    entity

    X Ywrote

    dobjnsubj

    X Ywrote

    prep_innsubj

    X Ywrote

    prep_indobj

    Figure 2: Example of parsed sentence and extracting pat-terns

    ambiguous patterns such as “X of Y”. Addition-ally, we convert an entity into a variable (i.e., Xor Y). Consider the sentence shown in Figure 2 asan example. An arrow between words expressesa dependency relationship. This sentence containsthree entities: “Kafka”, “The Metamorphosis” and“German”. Therefore, we obtain three patterns, “Xnsubj←−−− wrote dobj−−→ Y”, “X nsubj←−−− wrote prep in−−−−→ Y”

    and “Xdobj←−− wrote prep in−−−−→ Y”2. Counting the fre-

    quency of a pattern, we extract one appearing no lessthan 1,500 times in the corpus. We denote the set ofrelation patterns P hereafter.

    2.3 Building pattern vectors

    We define a vector of a relational pattern as the dis-tribution of entity pairs co-occurring with the pat-tern. Processing the whole collection of the corpus,we extract mentions of triples (p, e), p ∈ P, e ∈ E.For example, we obtain a triple,

    p = Xnsubj←−−− wrote dobj−−→ Y,

    e = ⟨“Franz Kafka”, “The Metamorphosis”⟩,

    from the sentence “Franz Kafka wrote The Meta-morphosis.” The meaning of the pattern p is rep-resented by the distribution of the entity pairs co-occurring with the pattern.

    In this study, we compare two statistical measuresof co-occurrence: the raw frequency (FREQ) andPMI (PMI). In FREQ setting, a relational pattern pis represented by a vector whose elements presentthe frequency of co-occurrences f(p, e) of every en-

    2In the experiments, we use a collection of Japanese Webpages. However, we explain the procedure with an English sen-tence because the procedure for extracting patterns is universalto other languages.

    tity pair e ∈ E. PMI refines the strength of co-occurrences with this equation,

    PMI(p, e) = logf(p,e)M

    Σi∈P f(i,e)M

    Σj∈Ef(p,j)M

    × dis(p, e).

    (4)Here, f(p, e) presents the frequency of co-occurrences between a pattern p and an entity paire; and M =

    ∑Pi

    ∑Ej f(i, j). The discount factor

    dis(p, e) is defined similarly to Equation 3. In PMIsetting, we set zero to the value for an entity pair eif PMI(p, e) < 0.

    2.4 Dimensionality reduction for patternvectors

    The vector space defined in Section 2.3 is extremelyhigh dimensional and sparse, encoding all entitypairs as separate dimensions. The space may betoo sparse to represent the semantic meaning of re-lational patterns; for example, entity pairs (“FranzKafka”, “the Metamorphosis”) and (“Kafka”, “theMetamorphosis”) present two different dimensioneven though “Franz Kafka” and “Kafka” refer to thesame person. In addition, we need an associative ar-ray to compute the similarity of two sparse vectors.

    In order to map the sparse and high dimensionalspace into a dense and compact space, we use Prin-cipal Component Analysis (PCA). In essence, PCAis a statistical procedure that finds principal compo-nents and scores of a matrix. PCA is closely relatedto Singular Value Decomposition (SVD), which fac-tors an m× n matrix A with,

    A = UΣVt. (5)

    Here, U is an m × m orthogonal matrix, V is ann × n orthogonal matrix and Σ is an m × n diago-nal matrix storing singular values. Each column ofV corresponds to a principal component, and eachcolumn of UΣ corresponds to a score of a principalcomponent of A.

    However, a full SVD requires heavy computa-tions while we only need principal components cor-responding to the top r singular values of A. Thishinders the scalability of the system, which obtainsa huge co-occurrence matrix between patterns andentity pairs. We solve this issue by using the ran-domized algorithm proposed by Halko et al. (2011).

    PACLIC 29

    99

  • Algorithm 1 Space saving for each patternInput: N : counter size for each patternInput: D: a set of triples (p, e)Output: cp,e: counter for each pattern p

    1: for all (p, e) ∈ D do2: if Tp does not exist then3: Tp ← ∅4: end if5: if e ∈ Tp then6: cp,e ← cp,e + 17: else if |Tp| < N then8: Tp ← T ∪ {e}9: cp,e ← 1

    10: else11: i← argmini∈Tpcp,i12: cp,e ← cp,i + 113: Tp ← T ∪ {e} \ {i}14: end if15: end for

    The goal of this algorithm is to find an r × n matrixB storing the compressed information of the rows ofA. We first draw an n × r Gaussian random matrixΩ. Next, we derive the m× r matrix Y = AΩ. Wenext construct an m × r matrix Q whose columnsform an orthonormal basis for the range of Y. Here,QQtA ≈ A is satisfied. Finally, we obtain the ma-trix B = QtA, in which Qt compresses the rows ofA.

    We compute principal component scores for r di-mensions by applying SVD to B. The computationis easy because r ≪ m. In this study, we usedredsvd3, an implementation of Halko et al. (2011).We represents the meaning of a pattern with thescores of r principle components.

    2.5 Improving the scalability to large data

    As described previously, it may be difficult and inef-ficient to count the exact numbers of co-occurrencesfrom a large amount of data. In this study, we ex-plore two approaches: approximate counting (Sec-tion 2.5.1) and distributed representations of words(Section 2.5.2).

    2.5.1 Approximate countingWe may probably not need exact counts of co-occurrences for representing pattern vectors becausea small amount of elements in a pattern vector

    3https://code.google.com/p/redsvd/wiki/English

    greatly influence the similarity computation. Inother words, it may be enough to find top-k entitypairs with larger counts of co-occurrences for eachpattern, and to ignore other entity pairs with smallercounts. The task of finding top-k frequent items hasbeen studied extensively as approximate countingalgorithms.

    We employ Space Saving (Metwally et al., 2005),which is the most efficient algorithm to obtain top-kfrequent items. Algorithm 1 outlines the Space Sav-ing algorithm adapted for counting frequencies ofco-occurrences. The space saving algorithm main-tains at most N counters of co-occurrences for eachpattern p.

    For each triple (p, e), where p and e present a pat-tern and an entity pair, respectively, the algorithmchecks if the co-occurrence count for the triple (p, e)is available or not. If it is available (Line 5), we in-crement the counter as usual (Line 6). If it is unavail-able but the number of counters kept for the patternis less than N (Line 7), we initialize the counter withone (Lines 8 and 9). If the count is unavailable andif the pattern has already maintained N counters, thealgorithm removes a counter cp,i with the least value,and creates a new counter for the entity pair e withthe approximated count cp,i + 1 (Lines 11–13).

    This algorithm has the nice property that the errorof the frequency count of a frequent triple is within arange specified by the number of counters N . In ad-dition, Metwally et al. (2005) describes a data struc-ture for finding the least frequent triple efficientlyin Line 11. We can obtain the frequency counts ofthe top-k frequent triples if we set the number ofcounters N much larger than k. In this way, we canfind the frequency count of the top-k frequent triplef(p, e) approximately, and assume frequency countsof other triples zero.

    2.5.2 Building pattern vectors from wordvectors

    A number of NLP researchers explored approachesto representing the meaning of a word with fixed-length vectors (Bengio et al., 2003; Mikolov et al.,2013). In particular, word2vec4, an implementationof Mikolov et al. (2013), received much attention inthe NLP community.

    4https://code.google.com/p/word2vec/

    PACLIC 29

    100

  • Switching our attention to the pattern feature vec-tor, our goal is to express the semantic meaning of arelational pattern with a distribution of entity pairs.Here, we explore the use of low-dimensional wordvectors learned by word2vec from the large corpus:the meaning of a pattern is represented by the distri-bution of entity vectors. Thus, we obtain the vectorrepresentation of a relational pattern p,

    p =∑e∈E

    f(p, e)

    [ve0ve1

    ]. (6)

    Here, ve0 denotes the vector for an entity e0 in theentity pair e, and ve1 does the vector for another en-tity e1 in the pair e.

    3 Experiments

    3.1 DataFor our experimental corpus, we collected 15 bil-lion Japanese sentences by crawling web pages. Toremove noise such as spam and non-Japanese sen-tences, we apply a filter that checks the length of asentence, determines whether the sentence containsa specific character in Japanese (hiragana), and thenchecks the number of symbols. As a result of fil-tering, we obtained 6.3 billion sentences. We thenparsed these sentences using Cabocha5, a Japanesedependency parser. For preprocessing, we extracted1 million entities, 1.1 million entity pairs, and 0.7million patterns. Finally, we extracted about 1.5 bil-lion triples from the corpus.

    We then manually checked some of the patternpairs extracted from Wikipedia that were also con-tained within the 6.3 billion sentence corpus. Specif-ically, we first extracted frequent patterns from somedomains in Wikipedia for obtaining the patterns rep-resenting a specific relation. We selected patterns re-ferring to an illness, an author, and an architecture astarget domains. Next, we gathered patterns sharingmany entity pairs because these patterns may rep-resent the same relation. We obtained 527 patternsand 4,531 pattern pairs. Four annotators classifiedthese pairs into the same relation or not. We thenrandomly sampled 90 pairs and found an average of0.63 for the Cohen’s kappa value for two annotators.We annotated the 4,531 pairs by two or more anno-tators. Therefore, if two or more annotators labeled

    5https://code.google.com/p/cabocha/

    a pair with the same relation, we regarded the pair asthe same relation. Finally, we acquired 720 pairs ex-pressing the same relation. We applied each methodto 4,531 pairs and we identified patterns with higherthan threshold. We investigated whether the pair isincluded in the same relation pairs.

    3.2 Experimental settings

    We evaluated the quality of pattern vectors build bythe proposed approaches on the similarity calcula-tion. Concretely, we investigated the impact on ac-curacy and computation time. For the evaluation, wecomputed the cosine similarity based on the featurevectors obtained by each method. We compared thefollowing methods.Exact counting (baseline): We counted the co-occurrence frequency between an entity pair and apattern in triples using a machine with 256GB ofmemory (EXACT-FREQ). In addition, we calcu-lated PMI defined in equation 4 based on the co-occurrence frequency (EXACT-PMI).Exact counting + PCA: Using PCA, we convertedEXACT-FREQ and EXACT-PMI into the fixed di-mensional vector EXACT-FREQ+PCA and EXACT-PMI+PCA. We determined the number of dimen-sions as 1,024 based on comparisons6 between 256,512, 1,024, and 2,048.Approximate counting: We counted the co-occurrence frequency using approximate countingexplained in Section 2.5.1 (APPROX-FREQ). Thecounter size N was 10,240 and we used the top5,120 frequent entity pairs as a feature. Moreover,we obtained PMI based on the result of approximatecounting (APPROX-PMI).Approximate counting + PCA: Using PCA, weconverted APPROX-FREQ and APPROX-PMI intothe fixed dimensional vector APPROX-FREQ+PCAand APPROX-PMI+PCA. Similar to Exact counting+ PCA, we selected the dimension of feature vectorsas 1,024.Exact counting + word2vec: We obtained pat-tern feature vectors using the result of word2vec(EXACT-FREQ+WORD2VEC). Moreover, insteadof calculating the feature vector by co-occurrencefrequency, we weight the entity vector with PMI

    6The result of 1,024 dimensional vector is close to the oneof 2,048. We selected 1,024 since the smaller the number offeature dimensions are, the faster we calculate similarity.

    PACLIC 29

    101

  • Figure 3: Precision and recall of each method

    (EXACT-PMI+WORD2VEC). We trained word2vecusing all entities, verbs, and adverbs in the corpuson four AMD Opteron 6174 processors (12-core,2.2GHz). It took about 130 hours to train word2vec(the number of threads was 42 and window size was5). Similar to PCA, we selected the dimension ofeach word vector as 512: namely, the dimension ofpattern vectors was 1,024 because of concatenating.

    3.3 Evaluating accuracy of each methodFigure 3 shows the precision and recall of eachmethod. We illustrated this graph by changing thethreshold value. We focus attention on the differencebetween exact counting and approximate count-ing. These results of EXACT-FREQ and APPROX-FREQ were about the same. On the other hand,APPROX-PMI outperformed EXACT-PMI in mostareas. These results demonstrated that approximatecounting is enough to compute co-occurrence fre-quency between a pattern and an entity pair. Theresults also suggest that approximate counting pos-sibly improves the performance.

    Comparing the results with PCA and withoutPCA, the figure shows that PCA does not alwaysimprove the performance. However, APPROX-PMI+PCA achieved the best performance in mostareas. For computation time, we verify that PCA isimportant in this aspect.

    In contrast, feature vectors based on word2vecworsened the performance against not only approxi-mate counting but also exact counting. Although weexpected that the vector formed by word2vec wasappropriate for representing the meaning of a pat-tern, EXACT-FREQ+WORD2VEC was worse than allthe other methods in Figure 3. We suspect that this

    Method 10k 100k all (664k)EXACT-PMI 55m 121hr 8,499hrAPPROX-PMI 38m 110hr 7,441hrAPPROX-PMI+PCA 4m 7hr 785hr

    Table 1: Similarity calculation time of each method withone thread

    result was caused by separating entity pairs into en-tities in feature generation. Concretely, for obtainingpattern feature vectors using word2vec, we concate-nate the sum of vectors assigned to one side of entitypairs and the ones assigned to the other side. There-fore, there is a possibility that we obtain pattern pairswith a high similarity when both of the patterns con-tain one of the same entity types. In other words, weneed to encode not entities separately but maintain-ing entity pair as a pattern feature vector.

    From Figure 3, approximate counting is effectivefor the similarity calculation. In addition, PCA isuseful for representing the meaning of a pattern inthe compact space.

    3.4 Evaluating computation time

    Table 1 demonstrates the similarity calculationtime of EXACT-PMI, APPROX-PMI and APPROX-PMI+PCA for processing 10k patterns, 100k pat-terns, and 664k patterns (the maximum). We ex-ecuted a program written in C++ on four AMDOpteron 6174 processors (12-core, 2.2GHz) with256GB of the memory. We measured the calculationtime using a single thread. Note that, we split thecalculation targets and predicted computation timebased on the result of division and split number, be-cause much time is required to complete the calcu-lation for 100k and 664k patterns. For 100k pat-terns, we split the calculation targets into 48 groups.For 664k pattern, we split the calculation targets into4,096 groups.

    APPROX-PMI executed quicker than EXACT-PMI because APPROX-PMI decreased the num-ber of non-zero features for each pattern. Nev-ertheless, APPROX-PMI took a large amount oftime for the similarity calculation. On the otherhand, the computation time of APPROX-PMI+PCAwas much smaller than that of EXACT-PMI andAPPROX-PMI. As a result, it is necessary to reducethe amount of dimensions because APPROX-PMI

    PACLIC 29

    102

  • would take 7,441 hours (about a year) to calculateall pattern similarity with one thread. We concludethat it is necessary to prepare low dimensional fea-ture vectors using dimension reduction or word vec-tors for completing similarity calculation in a realis-tic time.

    4 Related work

    Unsupervised relation extraction poses three majorchallenges: extraction of relation instances, repre-senting the meaning of relational patterns, and ef-ficient similarity computation. A great number ofstudies proposed methods for extracting relation in-stances (Wu and Weld, 2010; Fader et al., 2011;Fader et al., 2011; Akbik et al., 2012). We do notdescribe the detail of these studies, which are out ofthe scope of this paper.

    Previous studies explored various approaches torepresent the meaning of relational patterns (Lin andPantel, 2001; Yao et al., 2012; Mikolov et al., 2013).Lin and Pantel (2001) used co-occurrence statisticsof PMI between an entity and a relational pattern.Even though the goal of their research is not on re-lation extraction but on paraphrase (inference rule)discovery, the work had a great impact to the re-search on unsupervised relation extraction. Yao etal. (2012) modeled sentence themes and documentthemes by using LDA, and represented the mean-ing of a pattern with the themes together with theco-occurrence statistics between patterns and enti-ties. Recently, methods inspired by neural languagemodeling received much attentions for representa-tion learning (Bengio et al., 2003; Mikolov et al.,2010; Mikolov et al., 2013). In this study, we com-pared the raw frequency counts, PMI, and word em-beddings (Mikolov et al., 2013).

    In order to achieve efficient similarity computa-tion, some researchers used entity types, for exam-ple, “Franz Kafka” as a co-referent of writer (Minet al., 2012; Nakashole et al., 2012). Min etal. (2012) obtained entity types by clustering enti-ties in a corpus. When computing the similarity val-ues of patterns, they restricted target pattern pairsto the ones sharing the same entity types. In thisway, they reduced the number of similarity compu-tations. Nakashole et al. (2012) also reduced thenumber of similarity computations by using entity

    types obtained from existing knowledge bases suchas Yago (Suchanek et al., 2007) and Freebase (Bol-lacker et al., 2008). However, it is not so straightfor-ward to determine the semantic type of an entity inadvance because the semantic type may depends onthe context. For example, “Woody Allen” stands foran actor, a movie director, or a writer depending onthe context. Therefore, we think it is also importantto reduce the computation time for pattern similar-ities by simplifying the semantic representation ofrelational patterns.

    The closest work to ours is probably Goyal etal. (2012). Their paper proposes to use algorithmsof count-min sketch (approximate counting) and ap-proximate nearest neighbor search for NLP tasks.They applied these techniques for obtaining featurevectors, reducing the dimension of the vector space,and searching similar items to a query. Even thoughthey demonstrated the efficiency of the algorithms,they did not demonstrate the effectiveness of the ap-proach in a specific NLP task.

    5 Conclusion

    In this paper, we presented several approaches tounsupervised relation extraction on a large amountof data. In order to handle large data, we ex-plored three approaches: dimension reduction, ap-proximate counting, and vector representations ofwords. The experimental results showed that ap-proximate frequency counting and dimension reduc-tion not only speeds up similarity computation butalso improved the quality of pattern vectors.

    The use of vector representation of words did notshow an improvement. This is probably becausewe need to learn a vector representation specializedfor patterns that encode the distributions of entitypairs. A future direction of this research is to es-tablish a method to learn the representations of pat-terns jointly with the representations of words. Fur-thermore, it would be interesting to incorporate themeaning of constituent words of a pattern into therepresentation.

    Acknowledgments

    This work was supported by JSPS KAKENHI GrantNumbers 26 · 5820, 15H05318 and Japan Scienceand Technology Agency (JST).

    PACLIC 29

    103

  • References

    Alan Akbik, Larysa Visengeriyeva, Priska Herger,Holmer Hemsen, and Alexander Löser. 2012. Un-supervised discovery of relations and discriminativeextraction patterns. In Proceedings of the 24th In-ternational Conference on Computational Linguistics(COLING2012), pages 17–32.

    Michele Banko, Michael J Cafarella, Stephen Soderland,Matt Broadhead, and Oren Etzioni. 2007. Open infor-mation extraction from the web. In Proceedings of the20th International Joint Conference on Artifical Intel-ligence (IJCAI2007), pages 2670–2676.

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model. 3:1137–1155.

    Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: A col-laboratively created graph database for structuring hu-man knowledge. In Proceedings of the 2008 ACMSIGMOD International Conference on Management ofData (SIGMOD ’08), pages 1247–1250.

    Michael Collins, Sanjoy Dasgupta, and Robert E.Schapire. 2002. A generalization of principal com-ponents analysis to the exponential family. In Ad-vances in Neural Information Processing Systems 14(NIPS2002), pages 617–624.

    Stijn De Saeger, Kentaro Torisawa, Jun’ichi Kazama,Kow Kuroda, and Masaki Murata. 2009. Large scalerelation acquisition using class dependent patterns. InProceedings of the 2009 Ninth IEEE InternationalConference on Data Mining (ICDM2009), pages 764–769.

    Anthony Fader, Stephen Soderland, and Oren Etzioni.2011. Identifying relations for open information ex-traction. In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language Processing(EMNLP2011), pages 1535–1545.

    Amit Goyal, Hal Daumé, III, and Raul Guerra. 2012.Fast large-scale approximate graph construction fornlp. In Proceedings of the 2012 Joint Conferenceon Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL2012), pages 1069–1080.

    Nathan Halko, Per Gunnar Martinsson, and Joel A.Tropp. 2011. Finding structure with randomness:Probabilistic algorithms for constructing approximatematrix decompositions. SIAM Review, 53(2):217–288.

    Zellig Harris. 1954. Distributional structure. Word,10(23):146–162.

    Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman.2004. Discovering relations among named entitiesfrom large corpora. In Proceedings of the 42nd Annual

    Meeting on Association for Computational Linguistics(ACL2004), pages 415–422.

    Dekang Lin and Patrick Pantel. 2001. Dirt - discoveryof inference rules from text. In Proceedings of theSeventh ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD2001),pages 323–328.

    Mausam, Michael Schmitz, Robert Bart, Stephen Soder-land, and Oren Etzioni. 2012. Open language learn-ing for information extraction. In Proceedings of the2012 Joint Conference on Empirical Methods in Natu-ral Language Processing and Computational NaturalLanguage Learning (EMNLP2012), pages 523–534.

    Ahmed Metwally, Divyakant Agrawal, and Amr El Ab-badi. 2005. Efficient computation of frequent andtop-k elements in data streams. In Proceedings ofthe 10th International Conference on Database The-ory (ICDT2005), pages 398–412.

    Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer-nocký, and Sanjeev Khudanpur. 2010. Recurrentneural network based language model. In INTER-SPEECH, pages 1045–1048.

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their compo-sitionality. In Proceedings of the Neural Informa-tion Processing Systems Conference and Workshops(NIPS2013), pages 3111–3119.

    Bonan Min, Shuming Shi, Ralph Grishman, and Chin-Yew Lin. 2012. Ensemble semantics for large-scale unsupervised relation extraction. In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL2012),pages 1027–1037.

    Ndapandula Nakashole, Gerhard Weikum, and FabianSuchanek. 2012. Patty: A taxonomy of relationalpatterns with semantic types. In 2012 Joint Confer-ence on Empirical Methods in Natural Language Pro-cessing and Computational Natural Language Learn-ing (EMNLP-CoNLL2012), pages 1135–1145.

    Patrick Pantel and Marco Pennacchiotti. 2006. Espresso:Leveraging generic patterns for automatically harvest-ing semantic relations. In Proceedings of the 21stInternational Conference on Computational Linguis-tics and the 44th Annual Meeting of the Associationfor Computational Linguistics (ACL2006), pages 113–120.

    Deepak Ravichandran and Eduard Hovy. 2002. Learningsurface text patterns for a question answering system.In Proceedings of the 40th Annual Meeting on Associ-ation for Computational Linguistics (ACL2002), pages41–47.

    PACLIC 29

    104

  • Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M. Marlin. 2013. Relation extraction withmatrix factorization and universal schemas. In Pro-ceedings of the Main Conference on Human LanguageTechnology Conference of the North American Chap-ter of the Association of Computational Linguistics(HLT-NAACL2013), pages 74–84.

    Ellen Riloff. 1996. Automatically generating extractionpatterns from untagged text. In Proceedings of the13th National Conference on Artificial Intelligence -Volume 2 (AAAI96), pages 1044–1049.

    Benjamin Rosenfeld and Ronen Feldman. 2007. Clus-tering for unsupervised relation identification. In Pro-ceedings of the Sixteenth ACM Conference on Con-ference on Information and Knowledge Management(CIKM2007), pages 411–418.

    Yusuke Shinyama and Satoshi Sekine. 2006. Preemp-tive information extraction using unrestricted relationdiscovery. In Proceedings of the Main Conferenceon Human Language Technology Conference of theNorth American Chapter of the Association of Com-putational Linguistics (HLT-NAACL2006), pages 304–311.

    Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: A core of semantic knowledge.In Proceedings of the 16th International Conferenceon World Wide Web (WWW 2007), pages 697–706.

    Idan Szpektor, Hristo Tanev, Ido Dagan, and BonaventuraCoppola. 2004. Scaling web-based acquisition of en-tailment relations. In Proceedings of the 2004 Confer-ence on Empirical Methods in Natural Language Pro-cessing (EMNLP2004), pages 41–48.

    Fei Wu and Daniel S. Weld. 2010. Open informationextraction using wikipedia. In Proceedings of the 48thAnnual Meeting of the Association for ComputationalLinguistics (ACL2010), pages 118–127.

    Limin Yao, Aria Haghighi, Sebastian Riedel, and AndrewMcCallum. 2011. Structured relation discovery usinggenerative models. In Proceedings of the 2011 Confer-ence on Empirical Methods in Natural Language Pro-cessing (EMNLP2011), pages 1456–1466.

    Limin Yao, Sebastian Riedel, and Andrew McCallum.2012. Unsupervised relation discovery with sense dis-ambiguation. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguistics(ACL2012), pages 712–720.

    PACLIC 29

    105