RC-NET: A General Framework for Incorporating Knowledge ... · 2/5/2016 · RC-NET: A General...

RC-NET: A General Framework for IncorporatingKnowledge into Word Representations

Chang XuCollege of Computer and

Control EngineeringNankai University

Tianjin, 300071, P. R. [email protected]

Yalong BaiSchool of Computer Science

and TechnologyHarbin Institute of TechnologyHarbin, 150001, P. R. China

[email protected]

Jiang Bian, Bin GaoMicrosoft Research

13F, Bldg 2, No. 5, Danling StBeijing, 100080, P. R. China

{jibian,bingao}@microsoft.com

Gang Wang, Xiaoguang LiuCollege of Computer and

Control EngineeringNankai University

Tianjin, 300071, P. R. China{liuxg,wgzwp}@nbjl.nankai.edu.cn

Tie-Yan LiuMicrosoft Research

13F, Bldg 2, No. 5, Danling StBeijing, 100080, P. R. China

[email protected]

ABSTRACTRepresenting words into vectors in continuous space canform up a potentially powerful basis to generate high-qualitytextual features for many text mining and natural languageprocessing tasks. Some recent efforts, such as the skip-grammodel, have attempted to learn word representations thatcan capture both syntactic and semantic information amongtext corpus. However, they still lack the capability of encod-ing the properties of words and the complex relationshipsamong words very well, since text itself often contains in-complete and ambiguous information. Fortunately, knowl-edge graphs provide a golden mine for enhancing the qualityof learned word representations. In particular, a knowledgegraph, usually composed by entities (words, phrases, etc.),relations between entities, and some corresponding meta in-formation, can supply invaluable relational knowledge thatencodes the relationship between entities as well as categor-ical knowledge that encodes the attributes or properties ofentities. Hence, in this paper, we introduce a novel frame-work called RC-NET to leverage both the relational andcategorical knowledge to produce word representations ofhigher quality. Specifically, we build the relational knowl-edge and the categorical knowledge into two separate reg-ularization functions, and combine both of them with theoriginal objective function of the skip-gram model. By solv-ing this combined optimization problem using back propa-gation neural networks, we can obtain word representationsenhanced by the knowledge graph. Experiments on populartext mining and natural language processing tasks, includinganalogical reasoning, word similarity, and topic prediction,

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, November 3–7, 2014, Shanghai, China.Copyright 2014 ACM 978-1-4503-2598-1/14/11 ...$15.00.http://dx.doi.org/10.1145/2661829.2662038.

have all demonstrated that our model can significantly im-prove the quality of word representations.

Categories and Subject DescriptorsI.2.6 [Artificial Intelligence]: Learning; I.5.1 [PatternRecognition]: Models

General TermsAlgorithms, Experimentation

KeywordsDistributed word representations, deep learning, knowledgegraph

1. INTRODUCTIONDeep learning techniques have been frequently used to

solve natural language processing (NLP) tasks [8, 1, 12, 21,22]. The main purpose of them is to learn distributed rep-resentations of words (i.e., word embedding) from text, anduse them as components or the basis to generate textual fea-tures for solving NLP tasks. Recently, some efficient meth-ods, such as the continuous bag-of-word model (CBOW) andthe continuous skip-gram model [18], have been proposed toleverage the context of each word in text streams to learnword embedding, which can capture both the syntactic andthe semantic information among words. The principle be-hind these models is that words that are syntactically orsemantically similar should also have similar context words.

Although these works have demonstrated their effective-ness in a number of NLP tasks, they still suffer from somelimitations. In particular, as these works learn word rep-resentations mainly based on the word co-occurrence infor-mation, the obtained word embedding cannot capture therelationship between two syntactically or semantically sim-ilar words if either of them yields very little context infor-mation. On the other hand, even enough amount of contextcould be noisy or biased such that they cannot reflect theinherent relationship between words and further mislead the

training process. To solve these problems, we propose to in-corporate the information from knowledge graphs into thelearning process in order to produce better word represen-tations.

Knowledge graph is a kind of knowledge base, which hasbeen widely used to store complex structured and unstruc-tured knowledge. It is usually in the form of a directedor undirected graph that leverages vertices and edges torepresents entities (words, phrases, etc.) and their rela-tionships, respectively. Knowledge graphs, such as Free-base [10] and WordNet [24], have started playing importantroles in many text mining and NLP applications, includingexpert system, question-answer system, etc. A knowledgegraph commonly contains two forms of knowledge: rela-tional knowledge and categorical knowledge. Specifically, re-lational knowledge (like is-a, part-of, child-of, etc.) encodesthe relationship between entities so as to differentiate wordpairs with analogy relationships; categorical knowledge (likegender, location, etc.) encodes the attributes and propertiesof entities, according to which similar words can be groupedinto the meaningful categories. Both relational and cate-gorical knowledge extracted from the knowledge graph, anexample of which is shown in Figure 1, can serve as valuableexternal information to enhance learning word representa-tions. Specifically, in the learning process, the relationalknowledge can be leveraged to infer certain explicit connec-tions between the embeddings of related words, and the cat-egorical knowledge can be used to reflect coherence betweenthe embeddings of those words with the same attributes,even if some of them yield very little context information,or biased/noisy context information.

In this paper, we propose a novel framework to take ad-vantage of both relational and categorical knowledge to pro-duce high-quality word representations. This framework isbuilt upon the skip-gram model [18], in which we extend itsobjective function by incorporating the external knowledgeas regularization functions. In particular, to leverage therelational knowledge, we define a corresponding regulariza-tion function by inheriting the similar idea from a recentstudy on multi-relation model [5], which characterizing therelationships between entities by interpreting them as trans-lations in the low-dimensional embeddings of the entities.To incorporate the categorical knowledge, we define anotherregularization function by minimizing the weighted distancebetween those words with the same attributes. Then, wecombine these two regularization functions with the originalobjective function of the skip-gram model. After solving thiscombined optimization problem via back propagation neu-ral networks, we can obtain the continuous representationsof words. We call the proposed framework as RC-NET, indi-cating the incorporation of both Relational and Categoricalknowledge into neural NETworks to learn word embeddings.We have conducted empirical experiments on three populartext mining and NLP tasks, including analogical reasoning,word similarity, and topic prediction, with large-scale pub-lic datasets, and the results all demonstrate that, comparedwith the state-of-the-art methods, our proposed approachcan significantly improve the quality of word representationsby encoding both the word co-occurrence information andthe external knowledge.

The rest of the paper is organized as follows. We brieflyreview the related work on learning word embedding viadeep neural networks in Section 2. In Section 3, we describe

the proposed framework to incorporate relational and cat-egorical knowledge in learning word representations. Theexperimental setup and results are reported in Section 4.The paper is concluded in Section 5.

2. RELATED WORKBuilding distributed word representations [14] has attracted

increasing attention in the area of machine learning. Re-cently, to show its effectiveness in a variety of text min-ing and NLP tasks, a series of works applied deep learningtechniques to learn high-quality word representations. Forexample, Collobert et al. [7, 8] proposed a neural networkthat can learn a unified word representations suited for sev-eral NLP tasks simultaneously. Furthermore, Mikolov et al.proposed efficient neural network models for learning wordrepresentations, i.e., word2vec [18]. This work introducedtwo specific models, including the continuous bag-of-wordsmodel (CBOW) and the continuous skip-gram model (skip-gram), both of which are unsupervised models learned fromlarge-scale text corpora. Under the assumption that similarwords yield similar context, these models maximize the loglikelihood of each word given its context words within a slid-ing window. The learned word representations amazinglyshow that they can indicate both syntactic and semanticregularities.

Nevertheless, since most of existing works learned wordrepresentations mainly based on the word co-occurrence in-formation, it is quite difficult to obtain high quality embed-dings for those words with very little context information;on the other hand, large amount of noisy or biased con-text could give rise to ineffective word embeddings either.Therefore, it is necessary to introduce extra knowledge intothe learning process to regularize the quality of word embed-ding. Unfortunately, there are very few previous studies thatattempt to explore knowledge powered word embedding.

Some efforts have paid attention to learn word embeddingin order to address knowledge base completion and enhance-ment [6, 21, 23]; however, they did not investigate the otherside of the coin, i.e., leveraging knowledge to enhance wordrepresentations. Recently, there have been some early at-tempts on this direction. For example, Luong et al. [16] pro-posed a neural model to learn morphologically-aware wordrepresentations by combining recursive neural network andneural language model. In this model, they explicitly uti-lize the knowledge in terms of morphological structure insidea word and regard each morpheme as a basic unit. Whilebeing restricted to the morpheme-level knowledge, this at-tempt has not taken investigation on more important word-level knowledge, such as analogical relation between words.In contrast, we will mainly explore how to take advantageof word-level knowledge to enhance word embedding in thispaper.

Most recently, Yu et al. [25] attempted to use knowledgeto improve word representations. In particular, they pro-posed a new learning objective that incorporates both aneural language model objective and a semantic prior knowl-edge objective. By leveraging the knowledge in terms ofsemantic similarity between words during the learning pro-cess, they demonstrate that their new method can result inimprovement by evaluations on three tasks: language mod-eling, measuring semantic similarity, and predicting humanjudgments. However, this model is specified on incorporat-ing semantic knowledge and it does not explicitly distin-

United Kingdom

Alias: United Kingdom, Great Britain, UK, U.K., GB, GBR, United Kingdom of Great Britain and Ireland, Britain

George VI

Gender: Male

Elizabeth II

Gender: Female

Prince Philip, Dukeof Edinburgh

Gender: Male

Charles, Princeof Wales

Gender: Male

Diana, Princessof Wales

Gender: Female

King_of Queue_of

Wife_of

Child_of Child_of

Child_of

Wife_of

King_of

<George VI, Britain>

Queue_of

<Elizabeth II, Britain>

Wife_of

<Elizabeth II, Prince Philip><Diana, Charles>

Child_of

<Elizabeth II, George VI><Charles, Elizabeth II><Charles, Prince Philip>

Relational Knowledge

Categorical Knowledge

Synonyms of U.K.

Britain, United Kingdom, Great Britain, UK, U.K., GB, GBR, United Kingdom of Great Britain and Ireland

Male

George VI, Prince Philip, Charles

Female

Elizabeth II, Diana

Knowledge Graph

Figure 1: Knowledge graph contains two forms of knowledge: relational knowledge and categorical knowledge.

guish different kinds of relational knowledge. Bian et al. [2]recently proposed to leverage morphological, syntactic, andsemantic knowledge to advance the learning of word embed-dings. Particularly, they explored these types of knowledgeto define new basis for word representation, provide addi-tional input information, and serve as auxiliary supervisionin the learning process. Although they did intensive empiri-cal study, they did not make model-level innovation to lever-age external knowledge to improve word representations.

In contrast to all the aforementioned works, in this pa-per, we present a general method to leverage various typesof knowledge into learning word representations. With thetarget at incorporating more extensive forms of knowledge,we define a new learning objective as a combination betweenthat of the raw text and that of external knowledge. As aresult, our new model is able to learn word representationswith encoding both contextual information and extra knowl-edge, which is much more general and flexible than previousworks.

3. KNOWLEDGE POWERED WORD REP-RESENTATIONS

In this section, we first introduce the continuous skip-grammodel, which serves as the basis of the proposed frame-work. Then, we describe how we model relational knowledgeand categorical knowledge as regularization functions. Af-ter that, we introduce the proposed RC-NET framework byincorporating these regularization functions into the skip-gram model to strengthen the learning of word representa-tions. At last, we describe how we solve the optimizationproblem in the proposed framework.

3.1 Skip-gramWe take the continuous skip-gram model [18] as the ba-

sis of our proposed framework.1 Skip-gram is a recently

1Note that although we use the skip-gram model as an ex-ample to illustrate our framework, the similar frameworkcan be developed on the basis of any other word embeddingmodels.

proposed algorithm for learning word representations usinga neural network model, whose underlying principle lies inthat similar words should have similar contexts. In the skip-gram model (see Figure 2), a sliding window is employed onthe input text stream to generate the training samples. Ineach sliding window, the model tries to use the central wordas input to predict the surrounding words. Specifically, theinput word is represented in the 1-of-V format, where V isthe size of the entire vocabulary of the training data andeach word in the vocabulary is represented as a long vec-tor with only one non-zero element. In the feed-forwardprocess, the input word is first mapped into the embeddingspace by the weight matrix M . After that, the embeddingvector is mapped back to the 1-of-V space by another weightmatrix M ′, and the resulting vector is used to predict thesurrounding words after conducting softmax function on it.In the back-propagation process, the prediction errors arepropagated back to the network to update the two weightmatrices. After the training process converges, the weightmatrix M is regarded as the learned word representations.

𝑤1, … , 𝑤𝑘−𝑁−1, 𝑤𝑘−𝑁, … , 𝑤𝑘−1, 𝑤𝑘 , 𝑤𝑘+1, … , 𝑤𝑘+𝑁 , 𝑤𝑘+𝑁+1, … , 𝑤𝐾

sliding window (𝑠𝑖𝑧𝑒 = 2𝑁 + 1)

𝑤𝑘−𝑁 𝑤𝑘−1 𝑤𝑘+1 𝑤𝑘+𝑁

Embedding of 𝑤𝑘

. . . . . .

𝑀

𝑀’

𝑤𝑘

𝑠𝑜𝑓𝑡𝑚𝑎𝑥

. . . . . .

Figure 2: The continuous skip-gram model.

Specifically, given a sequence of training text stream w1, w2,w3, ..., wK , the objective of the skip-gram model is to maxi-mize the following average log probability:

L =1

K

K∑k=1

∑−N≤j≤N,j 6=0

log p (wk+j | wk) (1)

where wk is the central word, wk+j is a surrounding word,and N indicates the context window size to be 2N + 1. Theconditional probability p(wk+j |wk) is defined in the follow-ing softmax function:

p (wk+j | wk) =exp

(v′wk+j

Tvwk

)∑V

w=1 exp(v′

wTvwk

) (2)

where vw and v′w are the input and the output latent vari-

ables, i.e., the input and output representation vectors of w,and V is the vocabulary size.

To calculate the prediction errors for back propagation,we need to compute the derivative of p(wk+j |wk), whosecomputation cost is proportional to the vocabulary size V .As V is often very large, it is difficult and sometimes im-practical to directly compute the derivative. The typicalmethod to solve this problem is noise-contrastive estima-tion (NCE) [13], which aims at fitting unnormalized prob-abilistic models. NCE can approximate the log probabil-ity of softmax by performing logistic regression to discrim-inate between the observed data and some artificially gen-erated noise. It has been applied to the neural probabilis-tic language model [20] and the inverse vector log-bilinearmodel [19]. A simpler method to deal with the problemis negative sampling (NEG) [18], which generates l noisesamples for each input word to estimate the target word, inwhich l is a very small number compared with V . Therefore,the training time yields linear scale to the number of noisesamples and it becomes independent of the vocabulary size.Suppose the frequency of word w is u(w), then the proba-

bility of sampling w is usually set to p(w) ∝ u(w)3/4 [18].

3.2 Relational Knowledge Powered ModelAfter briefing the skip-gram model, we introduce how we

equip it with the relational knowledge. According to theleft part of Figure 3, relational knowledge encodes the re-lationship between words. Inspired by a recent study onmulti-relation model [5] that builds relationships betweenentities by interpreting them as translations operating onthe low-dimensional representations of the entities, we pro-pose to use a function Er as described below to capture therelational knowledge.

Specifically, the existing relational knowledge in knowl-edge graphs is usually represented in the triplet (head, re-lation, tail) (denoted by (h, r, t) ∈ S, where S is the set ofrelational knowledge), which consists of two words h, t ∈W(W is the set of words) and a relationship r ∈ R (R is theset of relationships). To learn the relation representations,we make an assumption that relationships between wordscan be interpreted as translation operations and they canbe represented by vectors. The principle in our model isthat if the relationship (h, r, t) holds, the representation ofthe tail word t should be close to the representation of thehead word h plus the representation vector of the relation-ship r, i.e., h+ r; otherwise, h+ r should be far away from

t. Note that this model learns word representations andrelation representations in the same continuous embeddingspace.

According to the above principle, we define Er as a margin-based regularization function over the set of relational knowl-edge S.

Er =∑

(h,r,t)∈S

∑(h

′,r,t

′)∈S′

(h,r,t)

[γ + d(h+ r, t)− d(h

′+ r, t

′)]+

(3)In the above formulation, [x]+ denotes the positive part

of x, γ > 0 is a margin hyperparameter, and d(x, y) is thedistance measure for the words in the embedding space. Forsimplicity, we define d(x, y) as the Euclidean distance be-

tween x and y. The set of corrupted triplets S′(h,r,t) is de-

fined as:

S′(h,r,t) =

{(h

′, r, t) | h

′∈W

}⋃{(h, r, t

′) | t

′∈W

}(4)

which is constructed from S by replacing either the headword or the tail word by another random selected word such

that S⋂S

′= ∅.

Note that the optimization process might trivially mini-mize Er by simply increasing the norms of word represen-tations and relation representations. To avoid this problem,we use an additional constraint on the norms, which is acommonly-used trick in the literature [5, 4, 6, 15]. However,instead of enforcing the L2-norm of the word representa-tions to 1, we adopt a soft norm constraint on the relationrepresentations as below:

ri = 2σ(xi)− 1 (5)

where σ(·) is the sigmoid function σ(xi) = 1/(1 + e−xi),ri is the i-th dimension of relation vector r, and xi is alatent variable, which guarantees that every dimension of therelation representation vector is within the range (−1, 1).

By combining the skip-gram objective function and theregularization function derived from relational knowledge,we get the following combined objective Jr that incorporatesrelational knowledge into the word representations learningsystem,

Jr = αEr − L (6)

where α is the combination coefficient. Our goal is to min-imize the combined objective Jr, which can be optimizedusing back propagation neural networks. We call this modelas Relational Knowledge Powered Model, and denote it byR-NET for ease of reference.

3.3 Categorical Knowledge Powered ModelAfter introducing R-NET, we describe how we equip the

skip-gram model with the categorical knowledge. Accordingto the right part of Figure 3, categorical knowledge encodesthe attributes or properties of words, from which we cangroup similar words according to their attributes. Then wemay require the representations of words that belong to thesame category to be close to each other.

For a specific kind of categorical knowledge, it can berepresented by a similarity matrix Q, in which the elementq(wi, wj) is the similarity score between wi and wj . Notethat many kinds of categorical knowledge can be mined fromknowledge graphs, and they might vary a lot in their simi-larity properties. For example, in Figure 1, we can see that

Vector space

Relational knowledge(Semantic relations such as “capital of”,syntactic relations such as “singular verbto plural verb”, etc.)

Categorical knowledge(word set from thesaurus, synonyms, domain knowledge, etc.)

Figure 3: RC-NET: leveraging relational knowledge and categorical knowledge to improve the quality of wordrepresentations.

the categorical knowledge “synonyms of United Kingdom”only consists of several entities including United Kingdom,Great Britain, U.K., etc., and these words are strongly sim-ilar to each other since they are all aliases of the UnitedKingdom; in the same figure, we can also find that the cat-egorical knowledge “Male” or “Female” consists of a massivenumber of person names, but these persons are similar onlybecause they are all men or women, which is a very weaksimilarity. From the above examples, we can observe that,in most cases, categorical knowledge with smaller capacityis more likely to contain more specific information, so thatwe are more confident in grouping the words with that cate-gorical knowledge close to each other. On the contrary, thecategorical knowledge with larger capacity is more likely tocontain more general information, so that we are less con-fident in grouping the corresponding words. We use thisheuristic to constrain the similarity scores:

V∑j=1

s(wi, wj) = 1, (7)

where if a word shares the same category with many otherwords, their mutual similarity scores will become small. Then,we encode the categorical knowledge using another regular-ization function Ec:

Ec =

V∑i=1

V∑j=1

s(wi, wj)d(wi, wj) (8)

where d(wi, wj) is the distance measure for the words inthe embedding space and s(wi, wj) serves as a weightingfunction. Again, for simplicity, we define d(wi, wj) as theEuclidean distance between wi and wj .

By combining the skip-gram objective function and theregularization function derived from the categorical knowl-edge, we get the following combined objective Jc that incor-porates categorical knowledge into the word representationslearning system,

Jc = βEc − L (9)

where β is the combination coefficient. Our goal is to min-imize the combined objective Jc, which can be optimized

using back propagation neural networks. We call this modelas Categorical Knowledge Powered Model, and denote it byC-NET for ease of reference.

3.4 Joint Knowledge Powered ModelAfter describing the R-NET and C-NET models, it is nat-

ural to combine them into a global framework which canleverage both relational knowledge and categorical knowl-edge to learn word representations. Specifically, in the globalframework, we want to minimize the following combined ob-jective function:

J = αEr + βEc − L. (10)

We call this framework as Joint Knowledge Powered Model,and denote it by RC-NET for ease of reference.

Figure 4 shows the architecture of the proposed RC-NETframework. Compared to either of R-NET and C-NET,RC-NET shows strong superiorities. Relational knowledgemainly helps build the global structure of the learned wordrepresentations by utilizing the relationship between differ-ent words; while categorical knowledge helps improve thelocal structure of the learned word representations by clus-tering similar words together. Hence, RC-NET might yieldto a structured embedding space and reduce the randomnessof word representations caused by the incomplete or biasedtraining information. Actually with the RC-NET frame-work, the relational knowledge and categorical knowledgecan compensate each other. On one hand, sometimes therelational knowledge of some words might be absent, butwe can obtain their similar words from categorical knowl-edge and then make inference on their relations accordingto the relationships of their similar words. On the otherhand, sometimes the categorical knowledge of a word mightbe missing. However, if the word share the same relation-ships with a number of other words, we will be able to inferits categorical knowledge from the categories of these relatedwords.

3.5 Optimization ProcedureIn the implementation, we optimize the regularization func-

tions derived from both the relational knowledge and the

. . .

. . .

𝑤𝑘,𝑟 𝑤𝑘,𝑟′

Relation embedding 𝑟

𝑑(𝑤𝑘,𝑟 , 𝑤𝑘,𝑟′ )

𝑤𝑘

𝑤𝑘−𝑁 𝑤𝑘−𝑁+1 𝑤𝑘+𝑁−1 𝑤𝑘+𝑁

𝑤1 𝑤2 𝑤𝑉

𝑑(𝑤𝑘,, 𝑤1) 𝑑(𝑤𝑘,, 𝑤2) 𝑑(𝑤𝑘,, 𝑤𝑉)

R-NET Skip-gram C-NET

. . . . . .

Figure 4: The architecture of RC-NET. The objective is to learn word representations and relation repre-sentations based on text stream, relational knowledge, and categorical knowledge.

categorical knowledge along with the training process of theskip-gram model. During the procedure of learning wordrepresentations from the context words in the sliding win-dow, if the central word wk hits the external knowledge, thecorresponding optimization process of the knowledge basedregularization function will be activated.

• For the branch of relational knowledge, according tothe objective function (6), we minimize the Euclideandistance between the representation vector of the realtail word t and the predicted vector that is computedas wk + r, and then we update the central word rep-resentation as well as the relation representation. For

efficiency, we sample a subset of S′

with a fixed size(which is also a parameter) instead of using the whole

set of S′. Note that the central word wk may appear as

the head word or the tail word in a triplet of relationalknowledge.

• For the branch of categorical knowledge, we minimizethe weighted Euclidean distance between the represen-tations of the central word and that of its similar wordsaccording to the objective function (9).

The two knowledge branches and the skip-gram branchshare the same word representations in the learning pro-cess. In our implementation, the optimization is conductedby stochastic gradient descent in a mini-batch mode, whosecomputational complexity is comparable to that of the op-timization process of the skip-gram model.

4. EXPERIMENTSIn this section, we conduct experiments to examine whether

incorporating relational knowledge and categorical knowl-edge into learning continuous word representations can sig-nificantly improve the quality of word embeddings. In par-ticular, we compare the performance of our knowledge pow-ered model and that of the state-of-the-art baselines by eval-uating the quality of respective learned word embedding onthree text mining and NLP tasks, including analogical rea-soning, word similarity, and topic prediction. In the rest of

this section, we first introduce the experimental setup, andthen report evaluation results and further analysis on theanalogical reasoning task, the word similarity task, and thetopic prediction task, respectively.

4.1 Experimental Setup

4.1.1 Training DataIn our experiments, we trained word embeddings on a

publicly available text corpus2, a dataset about the first bil-lion characters from Wikipedia. After being pre-processedby removing all the HTML meta-data and hyper-links andreplacing the digit numbers into English words, the finaltraining corpus contains totally 123.4 million words, wherethe number of unique words, i.e., the vocabulary size, isabout 220 thousand.

4.1.2 Parameter Setting for Compared MethodsIn the following experiments, we will compare four meth-

ods: Skip-gram (baseline), R-NET, C-NET, and RC-NET. To train the word embedding using these four meth-ods, we apply the same setting for their common parameters.Specifically, the count of negative samples was set to 3; thecontext window size was set to 5; each model was trainedthrough 1 epoch; the learning rate was initialized as 0.025and was set to decrease linearly so that it approached zeroat the end of training.

Moreover, the combination weights in R-NET, C-NET,and RC-NET also play a critical role in producing high-quality word embedding. Overemphasizing the weight ofthe original objective of Skip-gram may result in weakenedinfluence of knowledge, while putting too large weight onknowledge powered objectives may hurt the generality oflearned word embedding. According to our empirical expe-rience, it is a better way to decide the objective combinationweights of the Skip-gram model, relational knowledge, andcategorical knowledge based on the scale of their respectivederivatives during optimization. Specifically, it is better toset the objective weight of C-NET (β) as a smaller value

2http://mattmahoney.net/dc/enwik9.zip

than the objective weight of Skip-gram and R-NET since thederivative of C-NET objective usually yields a larger scalethan that of Skip-gram and R-NET. Along our experimentsin the following, we set α = 1 for R-NET, β = 0.001 for C-NET, and α = 1, β = 0.0001 for RC-NET. Note that, thisparameter setting may not be optimal for different trainingcorpus or various tasks, but the following experiments mayillustrate its robustness to some extent.

4.2 Analogical Reasoning Task

4.2.1 Task DescriptionThe analogical reasoning task was originally introduced by

Mikolov et al [18, 17], which defines a comprehensive testset that contains five types of semantic analogies and ninetypes of syntactic analogies3. For example, to solve semanticanalogies such as Germany : Berlin = France : ?, we needto find a vector x such that the embedding of x, denotedas vec(x) is the closest to vec(Berlin) - vec(Germany) +vec(France) according to the cosine distance. This specificexample is considered to have been answered correctly if xis Paris. Another example of syntactic analogies is quick :quickly = slow : ?, the correct answer of which should beslowly.

In our experiments, we use an enlarged dataset calledWordRep [11] which extends the original evaluation datasetof analogical reasoning task. In particular, this larger datasetis generated by extracting more analogy pairs from Longmandictionary4. Finally, we collect totally 34,773 relevant wordpairs in the enlarged dataset. In our experiments, we splitthe whole dataset into two parts with a ratio of 4:1, in whichthe larger part is used for training and the smaller part fortesting. To form up the testing set from the smaller partof dataset, we connect every two-word pairs from the samerelation together to generate a set of four-word tuple as ana-logical questions. Note that we avoid using those word pairsfor training if at least one of their two words also appears inthe testing set.

4.2.2 Applied KnowledgeR-NET. For training R-NET, we directly use relation

pairs and relation types as supervised information to learnrepresentation vectors of different relations.

C-NET. For training C-NET, we extract the categoricalknowledge from those relations. Specifically, given one rela-tion, the set of head words extracted from all pairs of thisrelation forms up a category, and the collection of tail wordsforms up another category. For instance, there are 1467“city-in-state” word pairs in the training part of the Wor-dRep dataset. We split them into two categories: one is thecollection of cities, while the other corresponds to the set ofstates. Each of them will be treated as a type of categoricalknowledge for training C-NET.

RC-NET. Finally, we employ all the relational and cat-egorical knowledge applied for training R-NET and C-NETin the learning process of RC-NET.

4.2.3 Experimental ResultsIn our experiments on the analogical reasoning task, we

compared the baseline word embedding trained by Skip-

3http://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt4http://www.longmandictionariesonline.com/

gram against those trained by R-NET, C-NET, and RC-NET. The dimension of word embedding is set as 100 and300. Table 1 illustrates the semantic, syntactic, and to-tal accuracy by using the four methods. From this table,we can find that all of the knowledge powered models out-perform the baseline skip-gram model, and RC-NET yieldsthe largest improvements. These results can imply that theknowledge powered word embedding is of higher quality thanthe baseline model with no knowledge regularizations.

From Table 1, we can also observe that incorporating re-lational knowledge to the skip-gram model can increase theaccuracy of all three sub-types of the analogical reasoningtask; meanwhile, incorporating the categorical knowledgecan give rise to a higher accuracy on semantic analogies buta decreasing performance on the syntactic analogies. Wehypothesize the reason as, there are some syntactic relation-ships, such as “opposite”, whose head or tail word collectiondo not strictly form up a word group representing a coherentcategory.

In order for deeper understanding on why our new meth-ods can learn higher-quality word embedding, we take casestudies on a specific syntactic relationship called “Adjectiveto Adverb” and a specific semantic relationship called “Maleto Female”. In particular, we apply the two-dimensionalPCA projection on the 100-dimensional learned word em-bedding of randomly selected word pairs. Figure 5 revealsthe RC-NET’s capability of learning the representations ofrelational knowledge and that of constructing the distribu-tions of words in the embedding space. In other words, fromthis figure, it is easy to see that, by incorporating relationalknowledge, R-NET can produce word embedding such thatthe offset vector of any word pair in the same relationshiptends to yield a common direction with similar distance,while by incorporating categorical knowledge, C-NET at-tempts to generate word embedding such that those wordscorresponding to the same topic or domain tend to be closeto each other.

4.3 Word Similarity Task

4.3.1 Task DescriptionAnother standard dataset for evaluating vector space mod-

els is the WordSim-353 dataset [9], which consists of 353pairs of nouns. Each pair is presented without context andassociated with 13 to 16 human judgments on similarity andrelatedness on a scale from 0 to 10. For example, (cup, drink)received an average score of 7.25, while (cup, substance) re-ceived an average score of 1.92. To evaluate the quality oflearned word embeddings, we compute Spearman’s ρ cor-relation between the similarity scores computed based onlearned word embeddings and human judgments.

Since this task expects those similar or highly-correlatedwords are close to each other, it could only need to incor-porate the categorical knowledge extracted from similar orhighly-correlated words. Thus, we only evaluate the effec-tiveness of C-NET for this task.

4.3.2 Applied KnowledgeTo train C-NET, it is necessary to collect the categorical

knowledge that can reflect the topic or concept informationof words. In our experiments, we extract such knowledgefor training from Freebase [3]. As a structured knowledgebase, Freebase organizes words according to a couple of basic

Table 1: Performance of using relational knowledge and categorical knowledge on the analogical reasoningtask based on our proposed models.

ModelVector

DimensionalityAccuracy[%]

Semantic Syntactic Total

Skip-gram100 25.06 36.49 31.30300 28.76 40.31 35.07

R-NET100 26.91 39.37 33.56300 32.64 43.46 38.55

C-NET100 29.67 36.12 33.19300 37.07 40.06 39.00

RC-NET100 32.02 43.92 38.52300 34.36 44.42 39.85

−1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−1.5

−1

−0.5

0

0.5

1

oral

orally

successful

successfully

plausible

plausibly

forgetful

forgetfullydynamic

dynamicly

incrementalincrementally

lazy

lazily

evil

evilly

cheap

cheaply

a) Skip−gram, "Adjective−to−Adverb"−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

oral

orally

successful

successfullyplausible

plausibly

forgetful

forgetfully

dynamic

dynamiclyincremental

incrementally

lazy

lazily

evil

evilly

cheap

cheaply

handy

handily

b) RC-NET, "Adjective−to−Adverb"

0.8 1 1.2 1.4 1.6 1.8 2 2.2−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

master

mistress

Alexander

Alexandra

fathers

mothers

Christian

Christina

masseur

masseuse

emperor

empress

tutorgoverness

ramewemale female

c) Skip−gram, "Male−to−Female"−1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

master

mistress

Alexander

Alexandra

fathers

mothers

Christian

Christina

masseur

masseuse

emperor

empress

tutor

governess

ram

ewe

male

female

d) RC-NET, "Male−to−Female"

Figure 5: Two-dimensional PCA projection of 100-dimensional Skip-gram vectors and our proposed RC-NETword vectors of syntactic relation “Adjective to Adverb” and semantic relation “Male to Female”. All of theseword pairs were chosen randomly.

relations. Among them, we take advantage of the “typeof” relation to generate the categorical knowledge since thistype of relation can naturally reflect the correlation betweenentities (words) with topics.

While Freebase contains many domain-specific words, suchas professional terminologies and names (person, location,

business), these words are so rare in the general trainingcorpus that they yield quite limited contribution to improveword embedding quality. Therefore, to address this prob-lem, we only collect the categorical knowledge related to apre-defined vocabulary, which only contains common nounsin the Longman Dictionary and filters our all multi-word

Table 2: Results obtained by the different methodson the word similarity task.

MethodsVector

DimensionalitySpearman’s ρ correlation

Skip-gram100 0.652300 0.678

C-NET100 0.661300 0.683

phrases and non-alphabetic characters. Finally, we select thetop 10 human rated topic sets, including astronomy, biology,boats, chemistry, computer, fashion, food, geology, interests,and language, as the categorical knowledge for training, thevocabulary size of which is 3,174.

4.3.3 Experimental ResultsTable 2 compares the performance of C-NET against

Skip-gram on the word similarity task. From this table, wecan find that C-NET can achieve better performance thanSkip-gram on this task no matter 100 or 300 vector dimen-sion is applied. These results indicate that C-NET can moreeffectively let those words similar in terms of topic or conceptbe close to each other in the obtained representation space,as it incorporates the categorical knowledge extracted fromFreebase explicitly into the learning process so as to encodethe similarity and correlation between words into the wordrepresentations.

4.4 Topic Prediction Task

4.4.1 Task DescriptionIn many text mining and NLP applications, it is important

to identify the topic of any given word since it can provideuseful semantic information. For instance, both the word“star” and “earth” correspond to the topic of “astronomy”,and both“cell”and“neuron”belong to the topic of“biology”.In the rest of this section, we evaluate word embedding viathe topic prediction task, whose goal is to find the mostrelated topic word for a given word.

Our proposed methods, especially R-NET and RC-NETcan be naturally applied to solve this task. In particular,since R-NET and RC-NET can learn the relation embed-ding beyond word embedding, given a word h and the topicrelation embeddings r, we can predict the topic word t as theone that has the shortest Euclidean distance to h + r overthe whole vocabulary. Although Skip-gram and C-NET donot explicitly produce the relation embedding in the train-ing process, for a specific relation r, we are able to computethe average offset vector t − h for any word pair 〈h, t〉 be-longing to this relation as the embedding of r. Then, wecan follow the same way of R-NET and RC-NET to solvethe topic prediction task.

4.4.2 Applied KnowledgeFor training C-NET for this task, we leverage the same

categorical knowledge used for the word similarity task, asdescribed in Section 4.3.2. To obtain relational knowledgefor training R-NET and RC-NET, we simply transform thedataset about categorical knowledge into a new format torepresent relational knowledge. Specifically, the relationalknowledge is in the triple format (h, r, t), where h is a spe-

Table 3: Results obtained by the comparing meth-ods on the topic prediction task.

RelationError Rate[%]

Skip-gram C-NET R-NET RC-NETastronomy 2.00 2.00 8.00 2.00

biology 6.29 4.91 4.91 4.32boats 11.76 5.88 5.88 7.84

chemistry 6.67 5.71 9.52 15.24computer 19.54 8.62 6.32 4.02fashion 22.08 24.68 22.08 22.08

food 17.98 13.60 11.40 7.89geology 0.00 0.00 11.54 7.69interests 6.80 8.16 6.12 6.12language 33.33 33.33 18.52 7.41

Total 12.21 9.37 8.57 7.15

cific word under a certain topic, r is the corresponding topicrelation, and t is the name of topic or concept.

4.4.3 Experimental ResultsIn the following experiments, we split the generated knowl-

edge data into training set and testing set by the ratio of 1:1for each relation. Note that there is no overlap between thetraining set and the testing set in the vocabulary except forthe topic word t. The dimension of word representations isset as 100.

Table 3 reports the error rates of different word embed-ding models on the topic prediction task. From this table,we can see that knowledge powered models can achieve lowererror rates than Skip-gram on most of the relations. Further-more, RC-NET can reach better performance than R-NETand C-NET, which indicates that both relational and cate-gorical knowledge are important for predicting the topic forthe word.

From the table, we also observe that there are some rela-tions, where knowledge powered models do not yield betterperformance. Our further analysis reveals that these rela-tions can be classified into two types. One type includesrelations that have inadequate training pairs such that therelation embedding cannot be trained sufficiently. For ex-ample, it is quite difficult to train high quality embeddingfor the relation “astronomy” and “geology” since they merelyhave 25 and 13 pairs for training, respectively. The othertype contains relations which have so many rare words thatthey yield less chance to be trained either. For example, asthere are a lot of uncommon words in the relation “chem-istry”, it is not easy to collect enough training samples forthis relation.

5. CONCLUSIONS AND FUTURE WORKLearning high-quality word embedding is quite valuable

for many text mining and NLP tasks. To address the limi-tation of the state-of-the-art methods in terms of their inca-pability of encoding the properties of words and the complexrelationships among words very well, this paper proposes toincorporate knowledge graphs into the learning process sinceit contains invaluable relational knowledge that encodes therelationship between entities as well as categorical knowl-edge that encodes the attributes or properties of entities. Inthis paper, we introduce a new knowledge powered method,called RC-NET, to leverage both the relational and categor-

ical knowledge to obtain word representations. Experimentson three popular text mining and NLP tasks have illustratedthat the knowledge powered method can significantly im-prove the quality of word representations.

For the future work, we will explore how to incorporatemore types of knowledge, such as the morphological knowl-edge of words, into the learning process to obtain more pow-erful word representations. Meanwhile, we will study howto define more general regularization functions to representthe effect of various types of knowledge.

6. REFERENCES[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A

neural probabilistic language model. In The Journal ofMachine Learning Research, pages 3:1137–1155, 2003.

[2] J. Bian, B. Gao, and T.-Y. Liu. Knowledge-powereddeep learning for word embedding. In Proceedings ofthe European Conference on Machine Learning andPrinciples and Practice of Knowledge Discovery inDatabases, 2014.

[3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, andJ. Taylor. Freebase: a collaboratively created graphdatabase for structuring human knowledge. InProceedings of the 2008 ACM SIGMOD internationalconference on Management of data, pages 1247–1250.ACM, 2008.

[4] A. Bordes, X. Glorot, J. Weston, and Y. Bengio. Asemantic matching energy function for learning withmulti-relational data. Machine Learning,94(2):233–259, 2014.

[5] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston,and O. Yakhnenko. Translating embeddings formodeling multi-relational data. In Advances in NeuralInformation Processing Systems, pages 2787–2795,2013.

[6] A. Bordes, J. Weston, R. Collobert, Y. Bengio, et al.Learning structured embeddings of knowledge bases.In AAAI, 2011.

[7] R. Collobert and J. Weston. A unified architecture fornatural language processing: Deep neural networkswith multitask learning. In Proceedings of the 25thinternational conference on Machine learning, pages160–167. ACM, 2008.

[8] R. Collobert, J. Weston, L. Bottou, M. Karlen,K. Kavukcuoglu, and P. Kuksa. Natural languageprocessing (almost) from scratch. The Journal ofMachine Learning Research, 12:2493–2537, 2011.

[9] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin,Z. Solan, G. Wolfman, and E. Ruppin. Placing searchin context: The concept revisited. In Proceedings ofthe 10th international conference on World Wide Web,pages 406–414. ACM, 2001.

[10] Freebase. http://www.freebase.com.

[11] B. Gao, J. Bian, and T.-Y. Liu. Wordrep: Abenchmark for research on learning wordrepresentations. In ICML 2014 Workshop onKnowledge-Powered Deep Learning for Text Mining,2014.

[12] X. Glorot, A. Bordes, and Y. Bengio. Domainadaptation for large-scale sentiment classification: Adeep learning approach. In In Proceedings of the

Twenty-eight International Conference on MachineLearning, ICML, 2011.

[13] M. U. Gutmann and A. Hyvarinen. Noise-contrastiveestimation of unnormalized statistical models, withapplications to natural image statistics. The Journalof Machine Learning Research, 13:307–361, 2012.

[14] G. E. Hinton. Distributed representations. 1984.

[15] R. Jenatton, N. Le Roux, A. Bordes, G. Obozinski,et al. A latent factor model for highly multi-relationaldata. In NIPS, pages 3176–3184, 2012.

[16] M.-T. Luong, R. Socher, and C. D. Manning. Betterword representations with recursive neural networksfor morphology. CoNLL-2013, 104, 2013.

[17] T. Mikolov, K. Chen, G. Corrado, and J. Dean.Efficient estimation of word representations in vectorspace. arXiv preprint arXiv:1301.3781, 2013.

[18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, andJ. Dean. Distributed representations of words andphrases and their compositionality. In Advances inNeural Information Processing Systems, pages3111–3119, 2013.

[19] A. Mnih and K. Kavukcuoglu. Learning wordembeddings efficiently with noise-contrastiveestimation. In Advances in Neural InformationProcessing Systems, pages 2265–2273, 2013.

[20] A. Mnih and Y. W. Teh. A fast and simple algorithmfor training neural probabilistic language models.arXiv preprint arXiv:1206.6426, 2012.

[21] R. Socher, D. Chen, C. D. Manning, and A. Ng.Reasoning with neural tensor networks for knowledgebase completion. In Advances in Neural InformationProcessing Systems, pages 926–934, 2013.

[22] G. Tur, L. Deng, D. Hakkani-Tur, and X. He. Towardsdeeper understanding: Deep convex networks forsemantic utterance classification. In ICASSP, pages5045–5048, 2012.

[23] J. Weston, A. Bordes, O. Yakhnenko, and N. Usunier.Connecting language and knowledge bases withembedding models for relation extraction. arXivpreprint arXiv:1307.7973, 2013.

[24] WordNet. “about wordnet”, princeton university.http://wordnet.princeton.edu. 2010.

[25] M. Yu and M. Dredze. Improving lexical embeddingswith semantic knowledge. In Association forComputational Linguistics (ACL), 2014.

RC-NET: A General Framework for Incorporating Knowledge ... · 2/5/2016 · RC-NET: A General...

Documents

Transcript of RC-NET: A General Framework for Incorporating Knowledge ... · 2/5/2016 · RC-NET: A General...