Creating Knowledge Graphs using Distributional Semantic Models1065299/FULLTEXT01.pdf ·...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

Creating Knowledge Graphs using Distributional Semantic Models

HUGO SANDELIUS

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Creating Knowledge Graphs usingDistributional Semantic Models

Skapande av kunskapsgrafer med hjälp av distributionellasemantiska modeller

Master Thesis in Computer Science

Author: Hugo Sandelius ([email protected])Supervisor: Jussi Karlgren ([email protected])Examiner: Viggo Kann ([email protected])Principal: GavagaiSupervisor at principal: Magnus Sahlgren ([email protected])

October 9, 2016

Abstract

This report researches a method for creating knowledge graphs, a specific wayof structuring information, using distributional semantic models. Two differentalgorithms for selecting graph edges and two different algorithms for labellingedges are tried, and variations of those are evaluated. We perform experimentscomparing our knowledge graphs with existing manually constructed knowledgegraphs of high quality, with respect to graph structure and edge labels. We findthat the algorithms usually produces graphs with a structure similar to that ofmanually constructed knowledge graphs, as long as the data set is sufficientlylarge and general, and that the similarity of edge labels to manually chosen edgelabels vary widely depending on input.

Sammanfattning

Den här rapporten undersöker möjligheterna att skapa kunskapsgrafer, ett sättatt strukturera information, med hjälp av distributionella semantiska modeller.Två olika algoritmer för att välja kanter i grafen och två algoritmer för att väljanoder evalueras, samt varianter av dessa. Vi utför experiment som jämför vårakunskapsgrafer med manuellt konstruerade kunskapsgrafer, med avseende pågrafstruktur och etiketter på kanterna. Vi finner att algoritmerna producerargrafer med en struktur liknande manuellt konstruerade kunskapsgrafer, så längesom indata är tillräckligt stort och generellt, och att likheten av kantetikettertill manuellt valda kantetiketter varierar stort beroende på indata.

Contents1 Introduction 1

2 Background 22.1 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Distributional Semantic Models . . . . . . . . . . . . . . . . . . . 2

3 Related Work 5

4 Implementation 84.1 Node selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Edge selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 Edge labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3.1 Node grouping . . . . . . . . . . . . . . . . . . . . . . . . 94.3.2 kNN-list merging . . . . . . . . . . . . . . . . . . . . . . . 104.3.3 Vector multiplication . . . . . . . . . . . . . . . . . . . . . 114.3.4 Distance measuring . . . . . . . . . . . . . . . . . . . . . . 11

5 Evaluation 135.1 Distance measures . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Knowledge graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.2.1 GENIA ontology . . . . . . . . . . . . . . . . . . . . . . . 135.2.2 Wikidata . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Results 176.1 Distance measures . . . . . . . . . . . . . . . . . . . . . . . . . . 176.2 GENIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.3 Wikidata structure evaluation . . . . . . . . . . . . . . . . . . . . 186.4 Wikidata labelling evaluation . . . . . . . . . . . . . . . . . . . . 19

7 Discussion 21

8 Future Work 22

1

1 IntroductionWith the advent and growth of the internet, there is an abundance of easilyavailable knowledge. However, most of this information is only available in anunstructured form, which is difficult for computers to use directly and hardfor humans to easily get an overview of. For this reason, attempts have beenmade at building knowledge bases with a defined structure. When you haveknowledge in a structured format, writing algorithms for extracting knowledgeand drawing conclusions becomes easier, and the knowledge is easier to visualizefor humans. One such structure which has become popular in the last fewyears is the knowledge graph, where entities and the relations between themare encoded. (Singhal, 2012) Many different ways of constructing knowledgegraphs have been proposed, from manual large-scale crowdsourcing to automaticanalysis of written text.

Knowledge graphs enables us to write useful applications for e.g. informationretrieval purposes, which has benefits to society at large as information becomesfaster and easier to access. For example, one could build a query engine on topof a knowledge graph which could directly answer questions like “What is thecapital of Uganda?” instead of users having to look through text documents tofind the answer.

Distributional semantics is a research area within computational linguisticswhere the semantics of a word is said to be defined by the words commonly usedtogether with it.

This report researches a method for automatically building knowledge graphsusing distributional semantic models. We want to know if this method can buildknowledge graphs that are of high quality using only plain text input, and howclosely these graphs resemble graphs constructed manually by humans.

The report was made at Gavagai, a natural language processing companybuilt on distributional semantic model technology.

1

2 Background

2.1 Knowledge GraphsA knowledge graph is a graph structure where the nodes represent real-worldentities, which can be physical, like Stockholm or Barack Obama, or abstract,like Weather or Happiness, and the edges represent relations between theseentities. Each edge typically have a label describing the relation. For example,in a knowledge graph Barack Obama might have an edge with a born-in labelleading to Honolulu, and Honolulu might in turn have an edge with a capital-oflabel leading to Hawaii, etc. More abstract relationships might also be modelled,for example water might have an edge with a warmer-than label leading to ice.(Singhal, 2012)

Figure 1: Example of a knowledge graph

2.2 Distributional Semantic ModelsDistributional semantic models are linguistic models where the semantic similar-ity between linguistic items - words, noun phrases or documents, for example -are said to be based on their distributional properties in large amounts of text.It is based on the distributional hypothesis, which says that linguistic itemswith similar distributions have similar meanings (Turney and Pantel, 2010). Inother words, two words are considered semantically similar if they occur in thesame contexts. A context can be defined in different ways. Common contextsare documents, sentences, or an n-size window sliding through the text, wherewords occurring within n words from each other are considered to be in thesame context. Typically in distributional semantic models, every linguistic itemis modelled as a high-dimensional vector. The distance between these vectors,using some distance measure, is taken to mean the semantic distance betweenthe items. A common way of creating these vectors is by populating a matrix,and taking either the rows or the columns of the matrices as vectors. Thereare different types of matrices, including term-document matrices, where everyrow represents a term and every column represents a document (Salton, Wong,and Yang, 1975) (Deerwester et al., 1990), or co-occurrence matrices, whereboth rows and columns represents terms and the number in an index representsthe number of co-occurrences between two terms in some context. (Lund andBurgess, 1996).

2

These matrices quickly grow very large when constructed from larger amountsof text, which means they are unwieldy to deal with in practice. For this reasondimension-reduction techniques such as Singular Value Decomposition (Turneyand Pantel, 2010) are used.

More recently, models where the vectors are trained using neural networkshave been proposed. (Mikolov, Chen, et al., 2013)

Random IndexingThe distributional semantic model used for the experiments in this report isthe Random Indexing model (Sahlgren, 2005). In the Random Indexing model,vectors are created directly, instead of creating large matrices which need to bereduced in size. In Random Indexing, every linguistic item is represented bytwo vectors:

• An index vector, which is a vector containing mostly zeros, with just a few+1s and -1s in random positions. An index vector for a word is populatedthe first time that word is encountered, and never changes after that.

• A context vector, which represents the sum of all the contexts the par-ticular term has occurred in. A context vector is initially all zeroes, andis then changed as more occurrences of the corresponding word is foundin the text. Every time a word a occurs together with a word b in somecontext, a’s context vector is updated by adding it to b’s index vector, andvice versa.

The dimensionality of these vectors is a model parameter and not dependenton the amount of terms, thus the word space size grows linearly with the amountof unique terms, regardless of the size of the text corpus.

Similarity measuresA number of different semantic similarity measures can be measured in a Ran-dom Indexing word space.

Paradigmatic similarity is high between words which can be substituted ina sentence and the sentence will still be valid, such as “quick” and “fast”, or“eat” and “drink”. This can be measured by measuring the distance betweentwo context vectors in the word space. The shorter the distance, the moreparadigmatically similar the words are.

Syntagmatic similarity is high between words that often co-occur, but withdifferent grammatical roles, and are not necessarily substitutable, e.g. “coffee”and “drink” or “sun” and “hot”. This can be found in a Random Indexing wordspace by looking at the distance between a word’s context vector and anotherword’s index vector. For example, we would expect “drink” ’s index vector to becloser to “coffee” ’s context vector, than to “knee” ’s index vector.

Topical similarity is high between words related to a shared topic. Thesecan be found by giving each document, instead of each word, an index vector,and adding a document’s index vector to a word’s context vector when it occursin the document, and then looking at the distance between context vectors asin paradigmatic similarity. (Turney and Pantel, 2010)

Different similarity measures between words exist, a common similarity mea-sure which has found to have good performance is the cosine distance (Robertson

3

and Sparck Jones, 1988). In a geometrical interpretation, this corresponds tothe cosine of the angle between two vectors.

4

3 Related WorkThis section presents an overview of related work in the areas of creating knowl-edge graphs, especially in relation to word space models, and other related work.

Knowledge Graph constructionMost previous efforts in creating knowledge graphs have used either a collab-orative human editing approach, or an approach based on lexical parsing ofsentences.

One example of a manually constructed knowledge graph is Wikidata, acollaboratively-edited knowledge graph-like database under the Wikimedia foun-dation which the public is invited to expand (Wikidata, 2016). WordNet oper-ates in a similar way, but focuses on linguistic knowledge of a word, encodingword classes and senses of words. (Wordnet, 2016).

Agichtein and Gravano present “Snowball” (Agichtein and Gravano, 2000),a system for creating knowledge graphs from plain-text documents by parsingDIPRE patterns, which is a technique where hard-coded text patterns contain-ing tuples are matched against large amounts of plain text to find relations. Forexample, the text string "<A>’s headquarters in <B>” might be provided tothe system, as well as information that this string represents a “headquarter-in”relation between A and B, and the system then finds pairs of terms with a“headquarter-in” relation. A similar method is used by (Etzioni et al., 2008) intheir TextRunner software.

Embedding knowledge graphs into word spaceA related problem which has received a lot of research is what could be calledthe opposite problem - embedding existing knowledge graphs into word space.One benefit of this is that existing algorithms for statistical learning typicallyare defined on feature vectors and not graphs, and thus the information needsto be available in a word space format (Bordes, Weston, et al., 2011).

Although it doesn’t directly touch upon knowledge graphs, Paccanaro andHinton (Paccanaro and Hinton, 2001) have done perhaps the first research inthis area. They try to solve the problem of predicting unobserved instances ofrelationships between concepts, given data consisting of concepts and relationsamong concepts. The proposed model models concepts as n-dimensional vec-tors and relations as (n × n) matrices. If a relation modeled by the matrix Rholds between two vectors A and B, then AR ≈ B, using normal matrix-vectormultiplication.

Bordes et al. (Bordes, Weston, et al., 2011) propose an similar idea, ap-plied to knowledge graphs, where each entity in the graph is modeled as alow-dimensional vector E, and every relation type is modeled as a differentsimilarity measure operating on these vectors. These similarity measures aremodelled by assigning, for every relation, a pair of matrices (Rlhs, Rrhs) anddefining the similarity function as S(Ei, Ej) = ||RlhsEi −RrhsEj ||p using somep-norm. This is then modelled as a neural network, and the parameters Rlhs,Rrhs and E are trained using stochastic gradient descent. The paper notes that

5

the approach could potentially also be used for extracting knowledge from rawtext.

They later expand on this idea with a simple method which they dubTransE (Bordes, Usunier, et al., 2013), where each entity is modeled as a low-dimensional vector, and each relation is also modeled as a low-dimensional vec-tor. If (h, r, t) holds, that is, entity h has relation r with t, then the correspond-ing vectors h+ r ≈ t. The entity and relation vectors are then learned by usingstochastic gradient descent to minimize an objective function.

Wang et al. (Wang et al., 2014b) argue that while TransE works well formodelling one-to-one relations, it falls short when modelling one-to-many, many-to-one, many-to-many or reflexive relations. For this reason they expand onTransE with something they call TransH, where every relation is modelled by ahyperplane R and a translation vector t in that hyperplane. If a relation withthe hyperplane R′ and the translation vector t′ holds between two vectors h andt, then the projections of h and t on R′ should be connected by t′’s translationvector, or at least be very close to it.

He et al. (He et al., 2015) address a problem in these models in that theyassume that all relations are definitely true and there is no room for uncertaintyin relations. A model called K2GE is proposed, where every entity and relationis represented by a Gaussian distribution, where a mean denotes the position ofa vector and a covariance matrix represents its certainty.

Wang et al. (Wang et al., 2014a) propose an idea for jointly embeddinginformation both from an existing knowledge graph and from unstructured textinto a word space with encoded relations. The presented idea is to use a modelcomposed of three components: a knowledge model, based on the knowledgegraph, a text model, based on the unstructured text, and an alignment model,aligning the knowledge model and the text model. The goal of this model isthen to minimize the sum of the probabilities of the component models. Byjointly embedding plain text and a knowledge graph, the hope is to be able tocomplete the Knowledge Graph by finding relations missing in the knowledgegraph but supported by the plain text. A similar idea is proposed in (Fan et al.,2015) but here a single model is used instead, based on both the knowledgegraph and the unstructured text.

Word Sense InductionWord sense induction is the problem of trying to deduce different senses, ormeanings, of words. For example, the word “suit” has both a “clothes” senseand a “legal” sense. There has been some research into solving the word senseinduction problem using graphs. This could be relevant to the research questionas it also concerns creating graphs from word vectors, although the end goal isdifferent.

Dorrow and Widdows (Dorow and Widdows, 2003) propose a graph cluster-ing method for word sense induction. Initially, an edge is created between everypair of words co-occurring enough times in a specific lexical context. Then, the

6

Markov Cluster Algorithm (S Van Dongen, 2000) is used to create graph clus-ters, where each cluster corresponds to related words (mostly co-hyponyms) toone word sense.

Gyllensten and Sahlgren (Gyllensten and Sahlgren, 2015) propose a methodfor creating graphs describing the local semantic neighborhood of a word. Theidea is to add an edge between two terms A and B if there is no term vectoranywhere between the two corresponding vectors of A and B. A vector V isdefined as being between A and B if it is closer to both A and B than A and Bare to each other. The time complexity of the proposed algorithm is O(|V |3) ingeneral, but if the graph is restricted to only include the k nearest neighbors, weget an algorithm taking only O(k2 + |V |lg(k)) time. The report shows resultsindicating that this can be used with good effect for word sense induction, asyou get terms related to one sense of a term connected over one edge, and termsrelated to another sense over another edge.

Taxonomy constructionThere has also been some research specifically into building taxonomies fromword vectors. Taxonomies can be seen as a limited form of knowledge graph,only encoding is− a relations.

Lenci and Benotto (Lenci and Benotto, 2012) propose a method for findinghypernyms in distributional semantic spaces by using the Distributional Inclu-sional Hypothesis, which says that u is a semantically narrower term than v ifu appears in a subset of the contexts that v appears in.

Finding relationsMikolov et al. (Mikolov, Chen, et al., 2013) propose two neural-network basedmodels for vector representations of words and show that semantic relationequivalences, called analogies, can be extracted between these vector represen-tations by using simple vector algebra. For example, using vector subtraction,the vectors France− Paris, Italy −Rome and Japan− Tokyo are all close toeach other.

Penninton et al. (Pennington, Socher, and Manning, 2014) propose anothermodel which was explicitly designed to have a strong analogy property, using aweighted least squares regression model. The analogy capacity is evaluated bycomparing to human-annotated analogy examples, e.g. “Athens is to Greece asBerlin is to _?”, and the results show that it finds the right analogy in a highpercentage of cases.

7

4 ImplementationThis section will describe the final implemented algorithms and their variations,and discuss some choices made. The algorithms take as input a word space andas output produce a knowledge graph, and thus assume that some distributionalsemantic model has previously been applied to plain text to produce vectors.

The problem of constructing a knowledge graph can in general be dividedinto three sub-problems:

• Node selection, selecting nodes for inclusion in the graph.

• Edge selection, selecting which nodes should have edges, representing re-lations, between them.

• Edge labelling, putting a label on the edge between two nodes.

The work in this report primarily deals with the two latter problems. Thealgorithms were developed with Random Indexing word spaces in mind andtested exclusively on this model, although parts of the algorithms can also workwith other word space models.

4.1 Node selectionThe node selection in our algorithm is done in a simple fashion. Each nodesimply corresponds to one vector in the word space. To find out which vectorsshoud be included in the word space, the word2phrase tool (Mikolov, Sutskever,et al., 2013) is run on the input corpus to find commonly occurring n-gramphrases in the text, e.g. New York or nordic countries. These n-gram phrases,as well as all the words not considered to be part of any phrase, are taken asnodes. word2phrase uses a simple approach where each bi-gram is given a score:score(wi, wj) =

count(wi·wj)−δcount(wi)·count(wj)

,where δ is a coefficient to prevent forming phrases consisting of very infrequentwords. Bi-grams with a score above a certain threshold are formed into phrases.This algorithm is run on the corpus several times in order to form n-gramslonger than two words.

4.2 Edge selectionTwo different methods of edge selection were tried.

k-Nearest Neighbork-Nearest Neighbor is a simple edge selection algorithm, where an edge is createdfrom every node to its k nearest nodes, using the cosine distance. k is a constantwhich can be varied to achieve different results.

Relative Neighborhood GraphIn this algorithm, inspired by Gyllensten, Sahlgren (Gyllensten and Sahlgren,2015), we begin with, like in the k-Nearest Neighbor algorithm, adding edgesto the k closest nodes. After this, we remove all edges from our root node rto nodes n, where any node exists inbetween r and n. A node a is defined asbeing inbetween nodes b and c if the distance between a and b, and the distance

8

between a and c, are both shorter than the distance between b and c. We usethe cosine distance between vectors for distance measuring.

4.3 Edge labellingIn our constructed graph, we can see which terms are related based on theirdistance in the graph. However, it would be interesting to also see how terms arerelated. For this reason, we try to develop an algorithm for labelling relations.Two different labelling algorithms were tried and evaluated. What they bothhave in common is that they are based on word co-occurrence, more preciselythe hypothesis that words occurring near both word A and B often are a goodfit for a label on the relation between A and B.

4.3.1 Node grouping

Instead of looking at only the local relations between individual nodes, onecan look at a whole inter-related cluster of nodes, and try to find a label onthe relation between all of these nodes. This label can then be put on all thelocal relations within the cluster. E.g. instead of trying to find a label onthe relation between Volvo and BMW, we will try to find the common relationbetween Volvo, BMW, Skoda, Mercedes and Toyota. It is hypothesized thatthis will reduce noise and lead to better relational labels. However, for thelabelling algorithm to work, we hypothesize that the group of terms must all beclosely related. The detection of groups is done by applying the Girvan-Newmancommunity detection algorithm to the graph.

Girvan-NewmanGirvan-Newman is a community detection algorithm in graphs, i.e. given agraph, it finds a number of internally densely connected parts of the graph calledcommunities. Girvan-Newman works by progressively removing edges from thenetwork, and taking the remaining connected components to be communities.

Girvan-Newman uses the concept of the edge betweenness of an edge. Thisis defined as the number of shortest paths between any pairs of nodes that runthrough that edge. If there is more than one shortest path between two nodes,each path is given equal weight such that the total weight of all the paths isequal to unity. Edges with high edge betweenness are considered to be verycentral edges in the graph.

The Girvan-Newman algorithm follows four steps:

1. The betweenness of all edges in the graph are calculated

2. The edge with the highest betweenness is removed

3. The betweenness of all edges are recalculated (disregarding the just re-moved edge)

4. Repeat steps 2 and 3 until desired number of iterations

The Girvan-Newman algorithm begins with considering the entire graph onecommunity, and then splits it into smaller communities, each of which is in turnsplit into a smaller community. Thus, the longer the algorithm runs, the morecommunities are detected. In our usage of the algorithm, we decide beforehand

9

a specific number of communities, and let the algorithm run until this numberof communities is detected. (Girvan and Newman, 2002)

4.3.2 kNN-list merging

When finding a label on the relation between a group of terms, this algorithmlooks at the lists of nearest syntagmatic neighbors of each term, i.e. the mostsyntagmatically similar words, and tries to find the best label by merging theselists. For simplicity’s sake, in the remainder of this section we assume we want tofind a label on a relation between only two words, but it can easily be extendedto find a relation between more words in the obvious way. A simple labellingfunction is to find the label L between the terms a and b through:L(a, b) = arg max

l∈W(cooc(l, a) + cooc(l, b))

where W is the set of all terms, and cooc(a, b) is the amount of co-occurrencesof a and b within the same context in the corpus.However, there is a problem with this function. More common words will alsohave higher co-occurrence counts, so if you have a word pair where one wordis much more common, that word’s neighbors will dominate the labelling. Forthis reason, we want to somehow normalize the co-occurrences of both words.One simple way of doing that is instead looking at the rank of the co-occurringword among all co-occurring words, and optimizing for the minimum rank. Thenew labelling function is defined as:L(a, b) = arg min

l∈W(cooc_rank(l, a) + cooc_rank(l, b))

where cooc_rank(l, x) = n if x is the nth most common co-occurrence of l.Finding the co-occurrence counts for a certain word requires reading through

the entire corpus, alternately a co-occurrence matrix may be constructed bygoing through the corpus once, however the size of this matrix will be quadraticin the amount of unique terms. Both these methods are unfeasible in practiceon larger corpuses, so for this reason we instead try to approximate the co-occurrence counts from the word space. The intuitive way of approximating aco-occurrence count between a term a and a term b is to look at the syntagmaticdistance, i.e. the distance between a’s context vector and b’s index vector. Were-formulate the labelling function again:L(a, b) = arg min

l∈W(dist_rank(l, a) + dist_rank(l, b))

where W like before is the set of all terms, and dist_rank(a, b) = n if a isthe nth closest neighbor of b using some distance function between the contextvector representing a and the index vector representing b.

It was noted that a lot of the most commonly co-occurring words to a wordare also paradigmatic neighbors of that word, leading to a lot of labels beingparadigmatic neighbors of any or both of the labelled nodes. Reviewing can-didates for labels manually, we found that close paradigmatic neighbors areusually a bad description for the relation between words, for example the labelon the relation between “Sweden” and “Norway” would often be “Denmark” in-stead of something more descriptive like “country” or “Scandinavia” - so for thisreason words too paradigmatically related to either of the labelled words aredisallowed as labels. Out final labelling function is thus as follows:L(a, b) = arg min

l∈W ′(dist_rank(l, a) + dist_rank(l, b)),

where W ′ = {w|w ∈ W ∧ pdist(w, a) > t ∧ pdist(w, b) > t}, pdist is some

10

paradigmatic distance function between context vectors, and t is a constantrepresenting the minimum allowed distance. pdist in our experiments is thecosine distance, but other distances are also possible.

4.3.3 Vector multiplication

Merging kNN lists is problematic in that a nearest-neighbor search needs tobe done on every term to be labelled, which takes a lot of time, as nearest-neighbor searching in high-dimensional cases is O(n), where n is the amount ofterms in the word space. (Roussopoulos, Kelley, and Vincent, 1995) The nearestneighbors of a single term can also have a large impact on the labelling, whichmakes the labelling brittle and noisy. A different approach is tried which solvesthe first problem and is hypothesized to solve the second. In this approach, thecontext vectors of the terms to be labelled are multiplied in a specific fashionand we then take the nearest index vector to the vector resulting from themultiplication as the label.

A set of vectors Vi through Vj are multiplied as following:

1. V is split into Vpos, where all non-positive elements in any vector in V isset to 1, and Vneg, where all non-negative elements in any vector in V isset to -1.

2. All vectors in Vpos are multiplied together into vpos using element-wisevector multiplication, and all vectors in Vneg are multiplied together intovneg using element-wise vector multiplication. We then sum vpos and vnegusing element-wise vector addition, and get our result vector.

The reason for this rather contrived multiplication instead of simply addingthe vectors together is because we want to strengthen elements which are high-value in both vectors, and which we see as strong signals, and conversely filterout weaker noise. Multiplication is intuitively a good way of doing exactlythis. However, if we simply take the product of the vectors, the results we getwhen multiplying negative elements with positive elements do not make sensewith how Random Indexing populates the vectors. For example, consider anexample where the index vector for Stockholm has very high positive values inpositions 13, 29 and 88, and the index vector for Gothenburg has very highnegative values in the same positions. If we multiply together Stockholm andGothenburg with element-wise multiplication, we will get very high negativevalues in these positions in our result vector r. Index vectors close to r willbe index vectors with negative values in these positions, which is likely to haveco-occurred with Gothenburg, but unlikely to have co-occurred with Stockholm.This is clearly not what we want.

When we have our result vector, the nearest index vectors to this resultvector are the candidate labels, with the closer index vectors being strongercandidates.

4.3.4 Distance measuring

These labelling functions, in particular the kNN-list merging, heavily rely on adistance measure between index vectors and context vectors able to approximateco-occurrence counts well. For this reason, a few different distance measureswere tried and evaluated.

11

cosineThe cosine distance between two vectors is defined ascosdist(a, b) = 1 − a · b

||a|| · ||b||. Geometrically, this is equivalent to 1 subtracted

by the cosine of the angle between a and b. (Robertson and Sparck Jones, 1988)

andcosThis distance measure uses a logical and operation on vectors, defined as such:and(a, b) = {ai|ai 6= 0 ∧ bi 6= 0 ∧ ai ∈ a ∧ bi ∈ b}where Xi is the i:th element of the X vector. The distance is then defined asandcos(a, b) = cosdist(and(a, b), b).

The intuition behind this is that it is only those elements which are non-zeroin the index vector which are changed in the context vector when two wordsco-occur, so we should only look at those elements.

andnormIn the andnorm distance measure, the euclidean norm of the logical and betweenthe vectors is taken, formally: andnorm(a, b) = ||and(a, b)||2

This is an evolution of the idea behind andcos, noting that although we arelooking to approximate co-occurrences, the cosine distance ignores the length ofthe vector, so here that measure is instead taken directly.

andmedianThis distance measure uses an elementwise absolute value operation on vectors,defined as abs(a) = {abs(ai|ai ∈ a)}, where Xi is the i:th element of the Xvector. The distance is then defined asandmedian(a, b) = median(abs(and(a, b))).

This is based on the empirical observation that andnorm often performspoorly because one or two elements are much larger or smaller than than theywould have been if there had been no overlapping dimensions between indexvectors. Intuitively, the median element is the one least likely to be changeddue to noise from other index vectors with overlapping dimensions.

andsumandsum takes the sum of the logical and between the index vector and contextvector, defined as andsum(a, b) = sum(abs(and(a, b)))

Filtering by signIt was noted that when comparing the context vector of a word with the indexvectors of that word’s most often co-occurring words in a corpus, for the vastmajority of the words, all non-zero elements in the index vector had the samesign as the corresponding context vector element. For this reason, we try filteringout words where this is not the case. When filtering by sign, if any non-zeroelement in the index vector has a different sign than the corresponding contextvector element, that word is considered to be at the maximum distance.

12

5 EvaluationThis section will describe the procedure used for evaluating the implementedalgorithms.

5.1 Distance measuresHaving a distance measure between context vectors and index vectors which ac-curately approximates actual co-occurrence counts is vital for the performanceof our labelling algorithms. We compare a few different distance measures be-tween context vectors and index vectors with regards to their performance in ap-proximating actual co-occurrence counts. We go through 50 randomly selectedwords and, for each of these words, find the 25 most commonly co-occurringwords within a 5-word sliding text window in some corpus, which is taken asthe gold standard. A Random Indexing wordspace is then constructed from thiscorpus. We then calculate the 25 nearest index vectors to the context vectorcorresponding to each of our 50 words, using the different distance measuresin this wordspace. The overlap between this result and the gold standard iscalculated, and is normalized by dividing by the amount of words in the goldstandard. We also want to test how the approximation performance varies withthe dimensionality and amount of words in the wordspaces, so a few differ-ent wordspaces with different dimensionality and amount of words in them arecompared. All wordspaces are constructed from a 2010 plain text dump of theEnglish Wikipedia. To create wordspaces of different sizes, different occurrencethreshold for words are used, i.e. different values n, where n is the minimumamount a word must occur to be included in the wordspace.

5.2 Knowledge graphThe knowledge graph construction algorithm is evaluated by comparing it toexisting knowledge graphs and ontologies. The experiments are done on twodifferent datasets, one on a smaller-scale dataset with a strong specific focus,and one on a larger, more general dataset.

5.2.1 GENIA ontology

Ontologies are often used in medicine for organizing the enormous amounts ofknowledge in the area. Since the amount of knowledge in the area is so vast, andthe work necessary to create such ontologies manually, many attempts have beenmade at automatically creating such ontologies. For the purposes of evaluatingautomatic construction of such ontologies, the GENIA corpus was created (Kimet al., 2003). The GENIA corpus is a corpus consisting of 1,999 abstractsfrom the medical publication database MEDLINE. The corpus has an attachedontology where a large number of medical terms and concepts from the corpusare hierarchically arranged in labelled categories. We want to see if we, usingour algorithm, can create a knowledge graph from this which resembles themanually constructed reference ontology.

We construct a random indexing word space from the corpus and a knowledgegraph is constructed from this word space. We then evaluate the quality of theknowledge graph’s structure by comparing it to the gold standard ontology.

13

The GENIA corpus contains the text divided by abstract, with a sentencegiving a short description of each abstract, and the bibliographic reference of thearticle corresponding to each abstract. The corpus, in addition to the plain textfrom the abstracts, contain some metadata. N-grams representing leaf nodesin the ontology are tagged with their lexical standard form, which consists ofspaces being replaced by underscores and the words being lemmatized, and arealso tagged with their parent node in the ontology. We preprocess the corpusby replacing each tagged n-gram with its lexical standard form in the corpus,all other information besides the abstract in the text is dropped, so that thecorpus after pre-processing consists only of plain text.

We construct a random indexing word space from this pre-processed versionof the corpus. All vectors in the word space are then removed except for thosecorresponding to leaf nodes in the ontology. This is done since we in this exper-iment are mostly interested in the structural similarity of our knowledge graphto the gold standard ontology, and this evaluation is clearer when we have theexact same nodes in the knowledge graph. From this word space we constructa knowledge graph using a few different variations of the described algorithm.

We then use the Girvan-Newman algorithm, which we use to group nodeswhich should have an internal label in common, to find exactly 28 communitiesin the knowledge graph. These 28 communities are compared with the 28 cate-gories in the lowest level of hierarchy in the GENIA ontology. We compare our28 groups with the gold standard 28 groups from the GENIA corpus using theadjusted rand score.

We vary some parameters of the graph-constructing algorithm and evaluatehow the results vary. In particular, we try our two different edge selectionalgorithms, RNG and kNN, and also vary the k in kNN. To put our results incontext, we compare it to the adjusted rand score we get if we start with thegold standard and randomly swap some of the elements. More precisely, weswap a fraction k of the elements, from their original community to a randomother one.

Adjusted rand scoreThe rand score is used for measuring the similarity of two data clusterings.Given a set of elements T = {o1, ..., on} and two partitions of T , A = {X1, ..., Xr}and B = {Y1, ..., Yr}, we define:a as the number of pairs of elements of T that are in the same set in A and inthe same set in Bb as the number of pairs of elements of T that are in different sets in A and indifferent sets in Bc as the number of pairs of elements of T that are in the same set in A and indifferent sets in Bd as the number of pairs of elements of T that are in different sets in A and inthe same sets in B

The Rand score R is now defined as:

R = a+ba+b+c+d = a+b

(n2)

However, this is not adjusted for chance, meaning that depending on the

14

size of the set of elements and the partition of the gold standard, the score fora random ordering of elements into partitions can vary. It would be desirablefor the score to be zero or negative for random ordering of elements, so wehave a point of reference for the score. In order to accomplish this, we firstdefine a contingency table. Given that A contains {A1, A2, Ar} and B contains{B1, B2, Br}, we define nij as the number of objects in common between Aiand Bj . ai is defined as

∑j nij and bi is defined as

∑j nji The adjusted rand

index is now defined as:ARI =

∑ij (

nij2 )−[

∑i (

ai2 )

∑j (

bj2 )]/(

n2)

12 [∑

i (ai2 )+

∑j (

bj2 )]−[

∑i (

ai2 )

∑j (

bj2 )]/(

n2)

(Rand, 1971)

5.2.2 Wikidata

Wikidata is a large, general-purpose knowledge base, aimed at being a structured-knowledge complement to Wikipedia (Wikidata, 2016). Wikidata has been con-structed manually by volunteers over many person hours. We want to see if wecan automatically construct a knowledge graph from plain Wikipedia text, andhow closely this knowledge graph resembles the manually constructed one.

We construct a word space from a 2010 plain text dump of Wikipedia andfrom this create a knowledge graph. This knowledge graph is then comparedwith Wikidata to evaluate its quality. We want to evaluate two different featuresof the knowledge graph: its general structure and the labelling of nodes. Wevary the construction algorithm in order to test how the results vary with thedifferent variations of the algorithm.

StructureTo evaluate the structure of our knowledge graph, we begin by constructing aword space from a plain text dump of Wikipedia. We then take seven differentcategories from Wikidata (country, city, fruit, animal, color, profession andband), and take 10 entities from each of these six categories, where the name ofthat entity exist in our word space. This partition of these entities into categoriesis taken as the gold standard. We then create a knowledge graph from all theseentities in our word space, and use the Girvan-Newman community detectionalgorithm to divide the graph into four communities. We then measure theoverlap between our community clustering and the gold standard using theAdjusted Rand Index. As in our GENIA evaluation, we put it into context bycomparing it to a “destroyed” gold standard with elements randomly switchedbetween communities. We also repeat this procedure for each possible pair ofeach categories, so we can see how much the clustering quality varies dependingon communities. The graph was in this experiment constructed using the RNGalgorithm. During testing, it was discovered that in such a small dataset, usingkNN methods for clustering did not work very well.

LabellingWe also want to test how close our labels on the relations matches those inWikidata. We do this by constructing a word space from Wikipedia, then wetake all entities within a certain category in Wikidata which also exist in ourword space, and we take the name of that category as the gold standard labelon the relation between these entities. We then use our labelling algorithm tofind our label on the relation, and compare that to the gold standard. Since

15

the label will probably not match the gold standard label exactly very often,we want to see how far away our label is from the gold standard label. We dothis in two ways - compare the gold standard label not only with the top label,but with the top 10 label candidates from our labelling algorithm, and by usinga measure of semantic word distance. In our case, we use the cosine distancein Gavagai’s core word space for this distance measure. Other alternatives, likethe amount of steps between the terms in WordNet, would be possible.

This evaluation is done on a number of different categories (country, capital,city, fruit, animal, color, profession and band). Averages as well as minimumand maximum values of the following results are reported:

1. The paradigmatic distance (in Gavagai’s core word space) from our algo-rithm’s top labelling choice to the gold standard label. (topdist)

2. The shortest paradigmatic distance (in Gavagai’s core word space) fromthe gold standard to any top 10 candidate from our labelling algorithm.(shortestdist)

16

6 ResultsThis section will present the evaluation results.

6.1 Distance measuresThis section presents the results from our distance measure evaluation. The y-axis indicates wordspaces of different sizes (amount of words×dimensions) inthousands, and the x-axis indicates different measures. Measures suffixed withan asterisk use filtering by sign.

cosine cosine* andcos andcos* andnorm andnorm* andmedian andmedian* andsum*

10000x500 0.250 0.323 0.325 0.318 0.078 0.245 0.186 0.406 0.32310000x2000 0.442 0.623 0.535 0.533 0.132 0.487 0.651 0.754 0.62310000x5000 0.527 0.759 0.579 0.577 0.180 0.599 0.849 0.854 0.76010000x10000 0.591 0.811 0.512 0.512 0.183 0.677 0.891 0.887 0.81120000x500 0.207 0.244 0.276 0.272 0.057 0.165 0.151 0.358 0.24420000x2000 0.406 0.577 0.519 0.515 0.119 0.396 0.575 0.752 0.57720000x5000 0.496 0.744 0.581 0.579 0.162 0.541 0.847 0.864 0.74420000x10000 0.538 0.806 0.522 0.523 0.167 0.627 0.892 0.890 0.80640000x500 0.170 0.202 0.225 0.223 0.045 0.115 0.113 0.305 0.20240000x2000 0.364 0.505 0.496 0.494 0.097 0.319 0.498 0.751 0.50540000x5000 0.456 0.695 0.576 0.575 0.139 0.472 0.812 0.873 0.69580000x50 0.023 0.014 0.013 0.020 0.014 0.015 0.011 0.014 0.01480000x100 0.040 0.038 0.037 0.034 0.011 0.031 0.019 0.041 0.03880000x200 0.065 0.072 0.067 0.071 0.016 0.048 0.022 0.088 0.07280000x500 0.139 0.168 0.184 0.184 0.034 0.093 0.087 0.267 0.16880000x2000 0.328 0.409 0.472 0.471 0.076 0.201 0.427 0.740 0.40980000x5000 0.429 0.626 0.570 0.571 0.123 0.382 0.760 0.878 0.626

Table 1: Mean


10000x500 0.132 0.117 0.141 0.137 0.082 0.111 0.084 0.134 0.11710000x2000 0.235 0.138 0.103 0.107 0.125 0.156 0.113 0.095 0.13810000x5000 0.271 0.114 0.116 0.114 0.173 0.117 0.093 0.079 0.11410000x10000 0.262 0.103 0.108 0.108 0.194 0.123 0.062 0.059 0.10320000x500 0.126 0.118 0.131 0.130 0.077 0.101 0.069 0.124 0.11820000x2000 0.223 0.156 0.105 0.108 0.117 0.157 0.081 0.091 0.15620000x5000 0.277 0.123 0.117 0.117 0.170 0.126 0.069 0.072 0.12320000x10000 0.294 0.110 0.112 0.112 0.183 0.143 0.069 0.059 0.11040000x500 0.107 0.105 0.125 0.115 0.056 0.086 0.046 0.115 0.10540000x2000 0.220 0.175 0.109 0.106 0.098 0.155 0.093 0.094 0.17540000x5000 0.269 0.152 0.110 0.109 0.146 0.166 0.111 0.059 0.15280000x50 0.027 0.022 0.026 0.025 0.019 0.025 0.020 0.021 0.02280000x100 0.041 0.043 0.042 0.040 0.021 0.035 0.022 0.038 0.04380000x200 0.056 0.046 0.061 0.064 0.025 0.043 0.026 0.043 0.04680000x500 0.095 0.090 0.112 0.105 0.042 0.068 0.040 0.089 0.09080000x2000 0.205 0.211 0.116 0.114 0.077 0.157 0.094 0.083 0.21180000x5000 0.265 0.185 0.106 0.106 0.130 0.158 0.128 0.054 0.185

Table 2: Standard Deviation

17


10000x500 11 9 8 0 0 0 1 21 010000x2000 4 5 1 0 0 0 7 33 010000x5000 1 1 1 0 0 0 34 13 010000x10000 7 4 0 0 0 0 32 7 020000x500 8 6 12 0 0 0 1 23 020000x2000 3 6 1 0 0 0 1 39 020000x5000 4 0 2 0 0 0 18 26 020000x10000 5 4 0 0 0 0 33 8 040000x500 10 4 9 0 0 0 0 27 040000x2000 3 2 0 0 0 0 1 44 040000x5000 3 3 0 0 0 0 14 30 080000x50 30 1 9 5 3 0 2 0 080000x100 23 4 17 2 0 0 0 4 080000x200 14 8 12 5 0 1 0 10 080000x500 8 4 10 1 0 0 0 27 080000x2000 1 2 0 0 0 0 0 47 080000x5000 2 0 0 0 0 0 8 40 0

Table 3: Number of words for which a distance measure was the best

6.2 GENIAThis section presents the results from the GENIA evaluation. Each line repre-sents an algorithm variation and its adjusted rand score. For context, the ad-justed rand score for gold standard with randomly introduced noise is included.rng means the Relative Neighborhood Graph and knn k-Nearest Neighbor.

rng 0.07knn, k=5 0.04knn, k=10 0.11knn, k=15 0.15knn, k=20 0.17knn, k=25 0.24knn, k=50 0.10knn, k=75 0.0850% random swaps 0.3160% random swaps 0.2170% random swaps 0.1280% random swaps 0.0590% random swaps 0.01100% random swaps 0.00

Table 4: Adjusted rand score for different edge selection algorithms

6.3 Wikidata structure evaluationThis section presents the results from the Wikipedia structure evaluation.

Clustering of all entitiesIn the following table, all is the adjusted rand score for the clustering of allcategories, compared to the gold standard from Wikidata. For context, adjustedrand score for gold standard with randomly-introduced noise is included.

18

all 0.41010% random swaps 0.80120% random swaps 0.62430% random swaps 0.48040% random swaps 0.356

Table 5: Results from structure comparison to Wikidata

Clustering of category pairsThis paragraph presents the results of clustering each possible pair of categories.max-pair is the adjusted rand score for the pair with the best adjusted rand scorecompared to the gold standard, min-pair, average-pair and std-dev-pair is theworst score, average score, and the standard deviation of the adjusted randscores, respectively.

max-pair 1.00min-pair -0.01average-pair 0.61std-dev-pair 0.37

Table 6: Results from structure comparison to Wikidata

6.4 Wikidata labelling evaluationThe following four tables present the results from the Wikidata labelling evalua-tion, each table using one version of the algorithm, using one of kNN-list mergingand vector multiplication for label finding, and one of cosine distance or and-median with filtering by sign for the distance measuring in these algorithms.average topdist is the average cosine distance from the algorithm’s top labelcandidate to the gold standard candidate, in Gavagai’s core wordspace. maxtopdist and min topdist is the minimum and maximum cosine distance. averageshortdist is the average shortest cosine distance from any of the algorithm’s top25 label candidates to the gold standard candidate in Gavagai’s core wordspace.max shortdist and min shortdist is the minimum and maximum shortest cosinedistance. In the following tables, lower numbers are better.

kNN-list merging (cosine)average topdist 0.55max topdist 0.73min topdist 0.00average shortdist in top 10 0.42max shortdist in top 10 0.59min shortdist in top 10 0.00

19

kNN-list merging (andmedian*)average topdist 0.53max topdist 0.65min topdist 0.00average shortdist in top 10 0.41max shortdist in top 10 0.53min shortdist in top 10 0.00

Vector multiplication (cosine)average topdist 0.68max topdist 0.74min topdist 0.57average shortdist in top 10 0.57max shortdist in top 10 0.67min shortdist in top 10 0.45

Vector multiplication (andmedian*)average topdist 0.69max topdist 0.88min topdist 0.45average shortdist in top 10 0.43max shortdist in top 10 0.62min shortdist in top 10 0.00

20

7 DiscussionThere are a number of conclusions to be drawn from the evaluation results.Concerning the distance measurements test, we can see that the andmediandistance measure with filtering by sign performs the best out of our testeddistance measures in almost all cases, and significantly outperforms the cosinedistance baseline. This shows that the method of measuring distances matter alot if we want to estimate co-occurrence counts from a word space, and that ithas a big impact on the ordering of candidates. Also notable is that althoughandmedian outperforms cosine generally, for larger word spaces its performanceapproaches that of the cosine distance. It is also very possible that even bettermeasurement functions could exist, and this is a possible future research.

The algorithm did not perform very well for the GENIA evaluation. In thebest case, the algorithm’s clustering was slightly better than the clustering givenby taking the gold standard and randomly changing community of 60% of theentities. This is considerably better than a random clustering, but probably notgood enough for most use cases. This could possibly be because of a relativelysmall data size - only 1999 abstracts of a few hundred words each. Distributionalsemantic models tend to work best on larger datasets. Medical abstract data isalso a rather special form of text, and many of the entities in the ontology wasarguably very semantically similar to each other, e.g. different types of genes.

What we can also deduce is that at least for this purpose, creating thegraphs using RNG performs worse than using a simple k-Nearest Neighbor.RNGs are interesting for looking at the neighborhood of a single entity, butwhen doing community detection on the created graph, they perform notablyworse than kNN, at least in this particular evaluation. This might be becausean RNG graph has considerably fewer edges between nodes, and the communitydetection algorithm therefore has a lot less to work with.

Also notable is that the k in kNN has a large impact on the performance ofthe clustering. Of our tested ks, k = 25 performs the best in this case.

The Wikidata structure evaluation shows that the algorithm usually per-forms generally well for clustering clearly separated categories when run ongeneral textual data, like that found in Wikipedia, although the clustering qual-ity varies widely depending on the categories. In the best case it could separatetwo categories from each other with no mistakes, while in the worst case it wasno better than random.

The Wikidata labelling evaluation shows that the performance of the la-belling evaluation varies widely depending on the input. In the best case, itmanaged to find the exact same label that was used by Wikidata, in the worstand average cases, it is quite far from the gold standard label. While the factthat it in some cases finds a very good label is promising, this instability likelymakes the algorithm in its current implementation not good enough for mostuse cases. It can also be deduced that kNN-list merging performs noticeablybetter than vector multiplication, although since vector multiplication is moreperformant, a trade-off exists between the two. As we saw in our other exper-iment, andmedian with filtering by sign performed considerably much betterthan cosine distance for approximating co-occurrence counts. As expected, wefind that makes andmedian also a better choice in our labelling algorithm.

There are many challenges in evaluating the quality of knowledge graphs,which is why the evaluations in this report are fairly complicated and not en-

21

tirely easy to interpret. We have focused our evaluations on comparing our gen-erated knowledge graphs to existing, manually-constructed knowledge graphs.There is an inherent problem in this in that our knowledge graphs could be,when manually determining their quality, as good or even better than the man-ually constructed knowledge graphs, despite not resembling them. However, ifour knowledge graphs show some similarity to manually-constructed knowledgegraphs, that is at least an indication that they have some quality.

Knowledge graphs contribute to easier and faster access to information, aswell as possibly uncovering previously hidden information. This could, like manyinnovations in information technology, be used for unethical purposes, one mightfor example have a privacy-invading knowledge graph with information retainedin unethical ways. On the other hand, better access to information could alsolead to new, positive ideas and innovations, e.g. contributing to sustainabledevelopment.

In conclusion, through running these evaluations, we have attempted to an-swer of research question of whether knowledge graphs of a high quality canbe constructed using this method. The results show that the knowledge graphscurrently created are not really consistent enough to be considered high quality,but that there are some promising results are future research in this directioncould be fruitful.

8 Future WorkThe work in this report has been exploratory in nature, and there are manythreads that could be followed further and expanded on. There seems to be atleast some merit to the idea that a word co-occurring often near two differentwords is a good label for the relation between those words. Good labels areoften quite high up in the list of words most often co-occurring with both words.However, further filtering of this list would be necessary in order to consistentlyfind a good label for the relation.

This report has exclusively looked at unsupervised methods, however therewould probably be merit in looking at supervised or semi-supervised methodswhich might take inspiration from the reported algorithms. Using the analogyproperty, present in for example word2vec, where, for example, if you subtractthe context vector for king with the one for queen you get a vector close to theone you get if you subtract man with woman, might be an interesting path.

The evaluation in this report focused on comparison with existing knowledgegraphs. It might also be interesting to look at inherent values of the generatedknowledge graphs, e.g. what does the general structure look like, how much doesthe knowledge graphs change when slight noises are introduced in the data.

In our evaluation we used clustering with Girvan-Newman to group edgeswhich should have a common label, and also for evaluating the graph structure.The choice of clustering algorithm likely heavily influences the results, but inthis report no alternative algorithms, or variations of Girvan-Newman, havebeen tried. Looking at how varying the clustering algorithm would affect theresults would be an interesting hypothetical study.

It would be interesting to have the results of the GENIA clustering manuallyinspected by a domain expert. Doing a deeper evaluation of the reuslts withouta significant knowledge of medicine is difficult.

22

ReferencesRand, William M. (1971). “Objective Criteria for the Evaluation of Cluster-

ing Methods”. In: Journal of the American Statistical Association 66.336,pp. 846–850. issn: 01621459. url: http : / / www . jstor . org / stable /2284239.

Salton, G., A. Wong, and C. S. Yang (1975). “A Vector Space Model for Au-tomatic Indexing”. In: Commun. ACM 18.11, pp. 613–620. issn: 0001-0782.doi: 10.1145/361219.361220. url: http://doi.acm.org/10.1145/361219.361220.

Robertson, Stephen E. and Karen Sparck Jones (1988). “Document RetrievalSystems”. In: ed. by Peter Willett. London, UK, UK: Taylor Graham Pub-lishing. Chap. Relevance Weighting of Search Terms, pp. 143–160. isbn: 0-947568-21-2. url: http://dl.acm.org/citation.cfm?id=106765.106783.

Deerwester, Scott et al. (1990). “Indexing by latent semantic analysis”. In:JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCI-ENCE 41.6, pp. 391–407.

Roussopoulos, Nick, Stephen Kelley, and Frédéric Vincent (1995). “NearestNeighbor Queries”. In: SIGMOD Rec. 24.2, pp. 71–79. issn: 0163-5808. doi:10.1145/568271.223794. url: http://doi.acm.org/10.1145/568271.223794.

Lund, Kevin and Curt Burgess (1996). “Producing high-dimensional semanticspaces from lexical co-occurrence”. In: Behavior Research Methods, Instru-ments, & Computers 28.2, pp. 203–208. issn: 1532-5970. doi: 10.3758/BF03204766. url: http://dx.doi.org/10.3758/BF03204766.

Agichtein, Eugene and Luis Gravano (2000). “Snowball: Extracting Relationsfrom Large Plain-text Collections”. In: Proceedings of the Fifth ACM Con-ference on Digital Libraries. DL ’00. San Antonio, Texas, USA: ACM, pp. 85–94. isbn: 1-58113-231-X. doi: 10.1145/336597.336644. url: http://doi.acm.org/10.1145/336597.336644.

Paccanaro, Alberto and Geoffrey E. Hinton (2001). “Learning Distributed Rep-resentations of Concepts Using Linear Relational Embedding”. In: IEEETrans. on Knowl. and Data Eng. 13.2, pp. 232–244. issn: 1041-4347. doi:10.1109/69.917563. url: http://dx.doi.org/10.1109/69.917563.

Girvan, M. and M. E. J. Newman (2002). “Community structure in social andbiological networks”. In: Proceedings of the National Academy of Sciences99.12, pp. 7821–7826. doi: 10.1073/pnas.122653799. eprint: http://www.pnas.org/content/99/12/7821.full.pdf. url: http://www.pnas.org/content/99/12/7821.abstract.

Dorow, Beate and Dominic Widdows (2003). “Discovering Corpus-specific WordSenses”. In: Proceedings of the Tenth Conference on European Chapter of theAssociation for Computational Linguistics - Volume 2. EACL ’03. Budapest,Hungary: Association for Computational Linguistics, pp. 79–82. isbn: 1-111-56789-0. doi: 10.3115/1067737.1067753. url: http://dx.doi.org/10.3115/1067737.1067753.

Kim, J.-D. et al. (2003). “GENIA corpus—a semantically annotated corpus forbio-textmining”. In: Bioinformatics 19.suppl 1, pp. i180–i182. doi: 10.1093/bioinformatics/btg1023. eprint: http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i180.full.pdf+html. url: http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i180.abstract.

23

http://www.jstor.org/stable/2284239

http://www.jstor.org/stable/2284239

http://dx.doi.org/10.1145/361219.361220

http://doi.acm.org/10.1145/361219.361220

http://doi.acm.org/10.1145/361219.361220

http://dl.acm.org/citation.cfm?id=106765.106783

http://dx.doi.org/10.1145/568271.223794

http://doi.acm.org/10.1145/568271.223794

http://doi.acm.org/10.1145/568271.223794

http://dx.doi.org/10.3758/BF03204766

http://dx.doi.org/10.3758/BF03204766

http://dx.doi.org/10.3758/BF03204766

http://dx.doi.org/10.1145/336597.336644

http://doi.acm.org/10.1145/336597.336644

http://doi.acm.org/10.1145/336597.336644

http://dx.doi.org/10.1109/69.917563

http://dx.doi.org/10.1109/69.917563

http://dx.doi.org/10.1073/pnas.122653799

http://www.pnas.org/content/99/12/7821.full.pdf

http://www.pnas.org/content/99/12/7821.full.pdf

http://www.pnas.org/content/99/12/7821.abstract

http://www.pnas.org/content/99/12/7821.abstract

http://dx.doi.org/10.3115/1067737.1067753

http://dx.doi.org/10.3115/1067737.1067753

http://dx.doi.org/10.3115/1067737.1067753

http://dx.doi.org/10.1093/bioinformatics/btg1023

http://dx.doi.org/10.1093/bioinformatics/btg1023

http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i180.full.pdf+html

http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i180.full.pdf+html

http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i180.abstract

http://bioinformatics.oxfordjournals.org/content/19/suppl_1/i180.abstract

Sahlgren, Magnus (2005). “An introduction to random indexing”. In: In Methodsand Applications of Semantic Indexing Workshop at the 7th InternationalConference on Terminology and Knowledge Engineering, TKE 2005.

Etzioni, Oren et al. (2008). “Open Information Extraction from the Web”. In:Commun. ACM 51.12, pp. 68–74. issn: 0001-0782. doi: 10.1145/1409360.1409378. url: http://doi.acm.org/10.1145/1409360.1409378.

Turney, Peter D. and Patrick Pantel (2010). “From Frequency to Meaning: Vec-tor Space Models of Semantics”. In: J. Artif. Int. Res. 37.1, pp. 141–188.issn: 1076-9757. url: http://dl.acm.org/citation.cfm?id=1861751.1861756.

Bordes, Antoine, Jason Weston, et al. (2011). “Learning Structured Embeddingsof Knowledge Bases”. In: Proceedings of the Twenty-Fifth AAAI Conferenceon Artificial Intelligence. AAAI’11. San Francisco, California: AAAI Press,pp. 301–306. url: http://dl.acm.org/citation.cfm?id=2900423.2900470.

Lenci, Alessandro and Giulia Benotto (2012). “Identifying Hypernyms in Dis-tributional Semantic Spaces”. In: Proceedings of the First Joint Conferenceon Lexical and Computational Semantics - Volume 1: Proceedings of theMain Conference and the Shared Task, and Volume 2: Proceedings of theSixth International Workshop on Semantic Evaluation. SemEval ’12. Mon-tral, Canada: Association for Computational Linguistics, pp. 75–79. url:http://dl.acm.org/citation.cfm?id=2387636.2387650.

Singhal, Amit (2012). Introducing Knowledge Graph: things, not strings. url:https://googleblog.blogspot.se/2012/05/introducing-knowledge-graph-things-not.html.

Bordes, Antoine, Nicolas Usunier, et al. (2013). “Translating Embeddings forModeling Multi-relational Data”. In: Advances in Neural Information Pro-cessing Systems 26. Ed. by C. J. C. Burges et al. Curran Associates, Inc.,pp. 2787–2795. url: http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf.

Mikolov, Tomas, Kai Chen, et al. (2013). “Efficient Estimation of Word Repre-sentations in Vector Space”. In: CoRR abs/1301.3781. url: http://arxiv.org/abs/1301.3781.

Mikolov, Tomas, Ilya Sutskever, et al. (2013). “Distributed Representations ofWords and Phrases and their Compositionality”. In: CoRR abs/1310.4546.url: http://arxiv.org/abs/1310.4546.

Pennington, Jeffrey, Richard Socher, and Christopher Manning (2014). “Glove:Global Vectors for Word Representation”. In: Proceedings of the 2014 Con-ference on Empirical Methods in Natural Language Processing (EMNLP).Doha, Qatar: Association for Computational Linguistics, pp. 1532–1543.url: http://www.aclweb.org/anthology/D14-1162.

Wang, Zhen et al. (2014a). “Knowledge graph and text jointly embedding”. In:In Proceedings of the Empiricial Methods in Natural Language Processing(EMNLP.

— (2014b). Knowledge Graph Embedding by Translating on Hyperplanes. url:http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8531.

Fan, Miao et al. (2015). “Jointly Embedding Relations and Mentions for Knowl-edge Population”. In: CoRR abs/1504.01683. url: http://arxiv.org/abs/1504.01683.

24

http://dx.doi.org/10.1145/1409360.1409378

http://dx.doi.org/10.1145/1409360.1409378

http://doi.acm.org/10.1145/1409360.1409378






https://googleblog.blogspot.se/2012/05/introducing-knowledge-graph-things-not.html

https://googleblog.blogspot.se/2012/05/introducing-knowledge-graph-things-not.html

http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf

http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf

http://arxiv.org/abs/1301.3781



http://www.aclweb.org/anthology/D14-1162

http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8531



Gyllensten, Amaru Cuba and Magnus Sahlgren (2015). “Navigating the Seman-tic Horizon using Relative Neighborhood Graphs”. In: CoRR abs/1501.02670.url: http://arxiv.org/abs/1501.02670.

He, Shizhu et al. (2015). “Learning to Represent Knowledge Graphs with Gaus-sian Embedding”. In: Proceedings of the 24th ACM International on Con-ference on Information and Knowledge Management. CIKM ’15. Melbourne,Australia: ACM, pp. 623–632. isbn: 978-1-4503-3794-6. doi: 10.1145/2806416.2806502. url: http://doi.acm.org/10.1145/2806416.2806502.

Wikidata (2016). url: https://www.wikidata.org/wiki/Wikidata:Main_Page.

Wordnet (2016). url: https://wordnet.princeton.edu/.

25


http://dx.doi.org/10.1145/2806416.2806502

http://dx.doi.org/10.1145/2806416.2806502

http://doi.acm.org/10.1145/2806416.2806502

https://www.wikidata.org/wiki/Wikidata:Main_Page

https://www.wikidata.org/wiki/Wikidata:Main_Page

https://wordnet.princeton.edu/

www.kth.se

Creating Knowledge Graphs using Distributional Semantic Models1065299/FULLTEXT01.pdf ·...

Documents

Transcript of Creating Knowledge Graphs using Distributional Semantic Models1065299/FULLTEXT01.pdf ·...