ITDDM01

14
Bridging Domains Using World Wide Knowledge for Transfer Learning Evan Wei Xiang, Bin Cao, Derek Hao Hu, and Qiang Yang, Fellow, IEEE Abstract—A major problem of classification learning is the lack of ground-truth labeled data. It is usually expensive to label new data instances for training a model. To solve this problem, domain adaptation in transfer learning has been proposed to classify target domain data by using some other source domain data, even when the data may have different distributions. However, domain adaptation may not work well when the differences between the source and target domains are large. In this paper, we design a novel transfer learning approach, called BIG (Bridging Information Gap), to effectively extract useful knowledge in a worldwide knowledge base, which is then used to link the source and target domains for improving the classification performance. BIG works when the source and target domains share the same feature space but different underlying data distributions. Using the auxiliary source data, we can extract a “bridge” that allows cross-domain text classification problems to be solved using standard semisupervised learning algorithms. A major contribution of our work is that with BIG, a large amount of worldwide knowledge can be easily adapted and used for learning in the target domain. We conduct experiments on several real-world cross-domain text classification tasks and demonstrate that our proposed approach can outperform several existing domain adaptation approaches significantly. Index Terms—Data mining, transfer learning, cross-domain, text classification, Wikipedia. Ç 1 INTRODUCTION T EXT classification, which aims to assign a document to one or more categories based on its content, is a fundamental task for Web and document data mining applications, ranging from information retrieval, spam detection, to online advertisement and Web search. Traditional supervised learning approaches for text classification require sufficient labeled instances in a problem domain in order to train a high- quality model. However, it is not always easy or feasible to obtain new labeled data in a domain of interest (hereafter, referred to as the target domain). The lack of labeled data problem can seriously hurt classification performance in many real world applications. To solve this problem, transfer learning techniques, in particular domain adaptation techniques in transfer learning, are introduced by capturing the shared knowledge from some related domains where labeled data are available, and use the knowledge to improve the performance of data mining tasks in a target domain. In transfer learning terminologies, one or more auxiliary domains are identified as the source of knowledge transfer, and the domain of interest is known as the target domain. Much effort has been dedicated to this problem in recent years in machine learning, data mining, and information retrieval [1], [2], [3], [4], [5]. However, transfer learning may not work well when the difference between the source and target domains is large. In particular, when the distribution gap between the source and target domains is large, transfer learning can hardly be used to benefit learning in the target domain [6]. For example, when we use some financial documents as the source domain and information technology documents as the target domain, the differences are so large that the performance in the target domain may decrease. Another problem is when the source and target domains have a large divergence in feature space; for example, the source data might be written for one audience and the target data for another. In these situations, traditional transfer learning might not work well. We note that the above difficulty is caused by a so-called information gap in domain adaptation tasks, which essen- tially is a combination of feature space and data distribution differences between the training and test data. Since previous domain adaptation methods only focused on the data on the source and target domains, they may fail to connect the related but indirectly shared parts of the domains. Our observation is that such a gap can potentially be found and bridged using knowledge from other domains. To solve this problem, we introduce a bridge between the two different domains by leveraging additional knowledge sources that are readily available and have wide coverage in scope. This knowledge source can be a third domain, such as Wikipedia or the Open Directory Project (ODP). For example, the connection between commutative algebra and geometry can be found through a large knowledge base on algebraic geometry topics. Once we find such a knowledge bridge, we can use the auxiliary data and semisupervised learning methods to fill in the information gap. In this paper, we apply semisupervised learning (SSL) to domain adaption problems based on the use of the auxiliary data. We take the labeled data from the source domain and the unlabeled data from the target domain, as well as an auxiliary data source such as the Wikipedia. We then apply 770 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010 . The authors are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear- water Bay, Kowloon, Hong Kong. E-mail: {wxiang, caobin, derekhh, qyang}@cse.ust.hk. Manuscript received 31 Mar. 2009; revised 22 Sept. 2009; accepted 30 Oct. 2009; published online 4 Feb. 2010. Recommended for acceptance by C. Zhang, P.S. Yu, and D. Bell. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDESI-2009-03-0256. Digital Object Identifier no. 10.1109/TKDE.2010.31. 1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

description

document of ieee paper

Transcript of ITDDM01

Bridging Domains Using World WideKnowledge for Transfer Learning

Evan Wei Xiang, Bin Cao, Derek Hao Hu, and Qiang Yang, Fellow, IEEE

Abstract—A major problem of classification learning is the lack of ground-truth labeled data. It is usually expensive to label new data

instances for training a model. To solve this problem, domain adaptation in transfer learning has been proposed to classify target

domain data by using some other source domain data, even when the data may have different distributions. However, domain

adaptation may not work well when the differences between the source and target domains are large. In this paper, we design a novel

transfer learning approach, called BIG (Bridging Information Gap), to effectively extract useful knowledge in a worldwide knowledge

base, which is then used to link the source and target domains for improving the classification performance. BIG works when the

source and target domains share the same feature space but different underlying data distributions. Using the auxiliary source data, we

can extract a “bridge” that allows cross-domain text classification problems to be solved using standard semisupervised learning

algorithms. A major contribution of our work is that with BIG, a large amount of worldwide knowledge can be easily adapted and used

for learning in the target domain. We conduct experiments on several real-world cross-domain text classification tasks and

demonstrate that our proposed approach can outperform several existing domain adaptation approaches significantly.

Index Terms—Data mining, transfer learning, cross-domain, text classification, Wikipedia.

Ç

1 INTRODUCTION

TEXT classification, which aims to assign a document to oneor more categories based on its content, is a fundamental

task for Web and document data mining applications,ranging from information retrieval, spam detection, to onlineadvertisement and Web search. Traditional supervisedlearning approaches for text classification require sufficientlabeled instances in a problem domain in order to train a high-quality model. However, it is not always easy or feasible toobtain new labeled data in a domain of interest (hereafter,referred to as the target domain). The lack of labeled dataproblem can seriously hurt classification performance inmany real world applications.

To solve this problem, transfer learning techniques, inparticular domain adaptation techniques in transfer learning,are introduced by capturing the shared knowledge fromsome related domains where labeled data are available, anduse the knowledge to improve the performance of datamining tasks in a target domain. In transfer learningterminologies, one or more auxiliary domains are identifiedas the source of knowledge transfer, and the domain ofinterest is known as the target domain. Much effort has beendedicated to this problem in recent years in machine learning,data mining, and information retrieval [1], [2], [3], [4], [5].

However, transfer learning may not work well when thedifference between the source and target domains is large. In

particular, when the distribution gap between the source andtarget domains is large, transfer learning can hardly be usedto benefit learning in the target domain [6]. For example,when we use some financial documents as the sourcedomain and information technology documents as the targetdomain, the differences are so large that the performance inthe target domain may decrease. Another problem is whenthe source and target domains have a large divergence infeature space; for example, the source data might be writtenfor one audience and the target data for another. In thesesituations, traditional transfer learning might not work well.

We note that the above difficulty is caused by a so-calledinformation gap in domain adaptation tasks, which essen-tially is a combination of feature space and data distributiondifferences between the training and test data. Sinceprevious domain adaptation methods only focused on thedata on the source and target domains, they may fail toconnect the related but indirectly shared parts of thedomains. Our observation is that such a gap can potentiallybe found and bridged using knowledge from other domains.To solve this problem, we introduce a bridge between the twodifferent domains by leveraging additional knowledgesources that are readily available and have wide coveragein scope. This knowledge source can be a third domain, suchas Wikipedia or the Open Directory Project (ODP). Forexample, the connection between commutative algebra andgeometry can be found through a large knowledge base onalgebraic geometry topics. Once we find such a knowledgebridge, we can use the auxiliary data and semisupervisedlearning methods to fill in the information gap.

In this paper, we apply semisupervised learning (SSL) todomain adaption problems based on the use of the auxiliarydata. We take the labeled data from the source domain andthe unlabeled data from the target domain, as well as anauxiliary data source such as the Wikipedia. We then apply

770 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

. The authors are with the Department of Computer Science andEngineering, Hong Kong University of Science and Technology, Clear-water Bay, Kowloon, Hong Kong.E-mail: {wxiang, caobin, derekhh, qyang}@cse.ust.hk.

Manuscript received 31 Mar. 2009; revised 22 Sept. 2009; accepted 30 Oct.2009; published online 4 Feb. 2010.Recommended for acceptance by C. Zhang, P.S. Yu, and D. Bell.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTKDESI-2009-03-0256.Digital Object Identifier no. 10.1109/TKDE.2010.31.

1041-4347/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

SSL to utilize the information contained in the unlabeleddata to help the classification task on the target data.Although domain adaptation (DA) and SSL share similarproblem settings, directly using SSL to solve DA problemsmay result in poor performance, as validated in [3], due tothe fact that SSL assumes that the distribution of theunlabeled data is similar to that of the labeled data.However, the existence of the information gap can makethe assumption invalid. Using our approach, we show thatwith the extracted bridge for filling in the information gap,SSL-based algorithms can be applied successfully to theclassification problems.

This paper introduces a novel domain adaptationalgorithm called BIG (Bridging Information Gap). OurBIG algorithm requires that the source domain and thetarget domain share the same feature space, but thedistributions between domains can be highly different. Amajor contribution of our work is that we make use of alarge amount of worldwide knowledge to build a bridge forlinking the source and target domains, even when theirdistribution differences are large. We conduct thoroughexperiments on several real-world cross-domain textclassification tasks, and demonstrate that our proposedapproach can outperform a number of existing approachesfor classification, including nontransfer learning ap-proaches, traditional domain adaptation approaches thatdo not use the auxiliary data, and approaches that use theauxiliary data in a naive manner. We show that, especiallyin situations when the source and target domains are faraway from each other, our approach can outperform thebaseline methods significantly.

The remainder of the paper is organized as follows: InSection 2, we first survey related works in domainadaptation. In Section 3, we define the concept of informa-tion gap and propose our algorithm BIG for filling in theinformation gap. In Section 4, we demonstrate its effective-ness through several cross-domain text classification tasks.In Section 5, we conclude the paper and discuss somedirections for future research.

2 RELATED WORK

In this section, we briefly review some previously proposedmethods for solving the task of domain adaptation. Sincethe Wikipedia is used as our auxiliary data for building theinformation bridge, we also review some methods thatextract useful knowledge from Wikipedia and other similarknowledge bases such as ODP. Finally, since our approachto domain adaptation problem is highly related to semi-supervised learning, we also briefly review this topic.

2.1 Domain Adaptation

Domain adaptation has attracted more and more attention inthe recent years. In general, previous domain adaptationapproaches can be classified into two categories [7]: instance-based approaches [1], [2] or feature-based approaches [4],[5], [8], [9].

Instance-based methods try to seek some reweightingstrategies on the source data, such that the source distribu-tion can match the target distribution. Feature-basedmethods try to discover a shared feature space on which

the distributions of different domains are pulled closer. Bothtypes are trying to discover the relation between source andtarget domains within the scope of two domains. Forexample, instance-based transfer learning models assumethat there is a subset of instances sharing similar distribu-tions in different domains, and then they emphasize theimpact of these data in the models since they are more“similar.” For the feature-based domain adaptation models,they assume that different domains may share some features,for instance, a subset of explicit features or implicit features.

Here, we consider some well-known instance-baseddomain adaptation methods. Jiang et al. in [2] use theinstance weighting method for natural language processing,where the method used is a type of importance samplingmethod for solving sample selection bias problems [10]. Daiet al. in [1] propose a boosting-style reweighting methodand provide different weighting schemes for data indifferent domains. Other feature-based methods have beendeveloped and are compared to instance-based methods.Daume et al. [11] propose a simple feature augmentationmethod for NLP tasks. Blitzer et al. [12] use the StructuralCorrespondence Learning model (SCL) to identify corre-spondences among features from different domains bymodeling their correlations with pivot features that behavein the same way for discriminative learning. They choosethe pivot features that are used to bridge two domains. Leeet al. [13] use transfer learning on an ensemble of relatedtasks to construct an informative prior on feature relevance.They assume that features themselves have metafeaturesthat are predictive of their relevance to the prediction task,and modeled their relevance as a function of the metafea-tures. Raina et al. [5] describe an approach to self-taughtlearning that uses sparse coding to construct high-levelfeatures using the unlabeled data. They first express theunlabeled data instance with a sparse weighted linearcombination of the basis and emphasize the L1 norm. Theythen use these features as input to standard supervisedclassification algorithms. In [9], a co-clustering-basedclassification algorithm called CoCC is proposed to classifyout-of-domain documents. The class structure is passedthrough word clusters from the in-domain data to the out-of-domain data. Additional class-label information given bythe in-domain data is extracted and used for labeling theword clusters for out-of-domain documents. However, onedrawback of many previous works is that for the source andtarget domains, the shared knowledge sometimes may bequite limited and the relation between the two domainscannot be fully exploited.

2.2 Data Mining with Online Knowledge Repository

A major component of our approach is to use onlineknowledge repositories as auxiliary information sources tohelp bridge the gap between the source domain and thetarget domain. Therefore, we review some latest approachesof data mining with online knowledge repositories.

In recent years, understanding and using online knowl-edge repositories to aid real world data mining tasks hasbecome a hot research topic. There are more and moreworks trying to use the Wikipedia for feature enrichment.Gabrilovich and Markovitch [14], [15] try to use the OpenDirectory Project (ODP) for feature enrichment in the text

XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 771

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

classification problem. They also show that using Wikipediaas the external Web knowledge resource for featureenrichment performs better than using ODP [16]. Gabrilo-vich and Markovitch [18] try to explicitly represent themeaning of texts in terms of a weighted vector ofWikipedia-based concepts. Their semantic analysis isexplicit in the sense that they manipulate manifest conceptsgrounded in human cognition, rather than latent conceptsused by LSA. In [19], a general framework for buildingclassifiers with hidden topics discovered from large-scaledata collections is proposed. The framework is mainlybased on latent topic analysis models like PLSA [20] andLDA [21] and machine learning methods like maximumentropy and SVMs. The underlying idea of such a frame-work is that for each classification task, a very large externaldata collection is collected and called a “universal data set,”then a classification model on both a small set of labeledtraining data and a rich set of hidden topics discoveredfrom that data collection is built. Currently, few previousapproaches have used auxiliary knowledge such as onlineknowledge database for transfer learning or domainadaptation. Wang et al. make an extension [22], [23] to thefeature-based transfer learning models via incorporating asemantic kernel [17], [24] learned from Wikipedia. How-ever, building a semantic kernel from the whole knowledgebase is costly. Huge cost could be saved by considering onlythe “most useful” concepts to bridge the information gap.Moreover, instance-based transfer, especially our method,adds more interpretability into the transfer scheme and it iseasy to study “what kind” of instances are useful forbridging the gap instead of the more compact and abstractsemantic kernel. In this paper, we propose to incorporatethe background knowledge efficiently from the instance-based transfer perspective, which could also establish aconnection between the transfer learning problems andtraditional semisupervised learning problems.

Another collection of works are done by Zelikovitz andHirsh [25], [26] that are related to our research. These worksuse some unlabeled data from a background knowledgedocument collection to enhance document similarities [10]or dimensionality reduction [26]. However, their target is toreduce the data sparsity, while we are focusing onconnecting the different domains apart in the problem oftransfer learning.

2.3 Semisupervised Learning

Domain adaptation could also be viewed as transductivetransfer learning if the source domain and the targetdomain had no information gap. In this case, the problemcan be reduced to a semisupervised learning problem.However, when there is information gap, how to exploitsemisupervised learning is not clear. In this section, we firstreview some semisupervised learning research works.

Semisupervised learning addresses the problem when thelabeled data are too few to build a good classifier and makesuse of a large amount of unlabeled data, together with asmall amount of labeled data to enhance the classifiers.

Many semisupervised learning algorithms have beendeveloped in the past 10 years. Some notable modelsdeveloped include co-training, transductive SVM, and someother graph-based regularization methods like mincut [27].

Cotraining [28] assumes that the features can be split intotwo sets and each subset is sufficient to train a goodclassifier. Unlabeled data in co-training help to reduce thesize of the version space. Transductive SVM [29] builds theconnection between class distribution and decision bound-ary by putting the boundary in low density regions. Thegoal is to find a labeling of the unlabeled data such that alinear boundary has the maximum margin on both theoriginal labeled data and the unlabeled data. It can beviewed as an SVM with additional regularization term onunlabeled data. Graph-based semisupervised methodsdefine a graph where the nodes are labeled and unlabeledexamples and edges reflect the similarities between exam-ples. Blum and Chawla [27] view semisupervised learningas a graph mincut problem, where positive labels act assources and negative labels act as sinks and the mincutproblem aims to find a minimum set of edges whoseremoval would block all flow from sources to sinks. Nodesconnected to the sources would be labeled as positive andthose connected to the sinks would be labeled as negative.In this work, we will exploit the ability of semisupervisedlearning to aid the problem of domain adaptation.

3 METHODOLOGY

3.1 Preliminaries

In this section, we provide several definitions to clarify ourterminology and present our analysis on the problem ofdomain adaptation.

Definition 1 (Feature Space). A feature space is an abstractspace where each data instance is represented as a point in ann-dimensional space. Its dimension is determined by thenumber of features used to describe the instances.

In the problem of text classification, the most commonlyused feature space is the term space used in the vector spacemodel [30]. The term space is usually of very highdimension. For example, in the 20Newsgroup data set,1 thedimension of the vector space model is over 40,000. To solvethe high dimensionality issue, topic models are introducedfor text modeling [20]. Then documents can be representedin a low dimensional topic space. However, one disadvan-tage of topic space is that the topics are hidden and it is noteasy to obtain their semantic meanings. Therefore, recentlythe space of Wikipedia concept is introduced as the featurespace for text modeling [18]. Due to the existence of richbackground knowledge in the Wikipedia concept space,even short texts can be accurately modeled [18].

Definition 2 (Domain). A domain D is a probabilisticdistribution PD over the data instances in a feature space.

Let x represent the feature vector of one data instance inthe feature space, then when we say a data instance x is inthe domain x 2 D, it means that it is sampled from thedistribution PD. A data set in the domain D is a set of datainstances sampled from PD. We slightly overuse thenotation D to represent both the domain and the data setin the domain.

772 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

1. http://people.csail.mit.edu/jrennie/20Newsgroups/.

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

In traditional supervised learning problems, we aregiven a set of labeled instances from a specific domain asa training set. A learning machine is trained on the trainingset and will be applied on newly incoming instances fromthe same domain to obtain their labels. The condition thattraining and test sets are drawn from the same domainguarantees the consistence and generalization ability of thelearning machine [31]. However, in practice we may not beable to ensure that the training and test sets are from thesame domain.

Definition 3 (Domain Adaptation). In this paper, we refer todomain adaptation as the following problem: We are given aset of labeled instances Dsl ¼ fðxi; yiÞg

Ni¼1 2 Ds � f�1g from

a source domain. However, we need to make prediction forsome unlabeled data Dtu ¼ fxjg

Mj¼1 2 Dt from a target domain.

The source domain and target domain are different domains inthe same feature space. Such a problem setting for domainadaptation is quite common in a variety of applications, suchas text categorization [1], image classification [5], sentimentanalysis [12], localization [32], [33], activity recognition [34]and so on. Since all data instances in the source domain arelabeled and data instances in the target domain are unlabeled,we can remove the subscript in the notations Dsl and Dtu anduse Ds and Dt without introducing any ambiguity.

The problem of domain adaption is likely to be encoun-tered in many real-world applications. For example, we mayhave trained a sentiment classifier for reviews on movies butwe want to use it to classify reviews from other domainssuch as books or music [12]. Another example is that we mayhave trained a classifier to classify news into topicalcategories but we want to use it also on blogs. In thesecases, we do not want to relabel the data in the new domainsbut hope to borrow the knowledge from the old domains.

When the differences between the source and targetdomains are large, the model trained on the source domainscannot generalize well for the target domain data [2], [9]. Anatural approach to follow is to consider transductive (orsemisupervised) learning, since unlabeled data from thetarget domain is available. However, some previous workshave found that after introducing some unlabeled data in thetarget domain, transductive learning is still not sufficient inimproving the performance [1]. The reason may be thattransductive or semisupervised learning generally assumesthat the decision boundary lies in the low-density region ofthe feature space.2 When the distributions of source andtarget domains are different, there may exist a low densityregion between different domains which is a gap thatdisconnects the same-class data in different domains. Werefer to this gap as the information gap in domain adaptation.For example, Fig. 1 illustrates an example where the featurespace is the Wikipedia concept space. From Fig. 1, we couldsee that there exist some information gaps between thesource domain and target domains.

To solve the problem of domain adaptation under largeinformation gaps, an intuitive idea is to find the shared partof different knowledge between the domains, and ignorethe differences. One instantiation of this idea is to make use

of the abundant and potentially useful information sourcesthat are around, and use them to connect the informationseparated by the gap. Such an intuition motivates us tothink of a different way for solving the domain adaptionproblem, i.e., through finding an information bridge.

3.2 Margin as Information Gap

An intuitive way to understand the concept of informationgap is to consider separability of the source and targetdomains. Consider the simplest case when we want totransfer knowledge from a single source domain to a targetdomain. Intuitively, the difficulty in separating thesedomains shows how large the information gap is betweenthem. If the two domains can be easily separated, then thereexists a large information gap between them, which mayprevent our adapting the original learned model from thesource to the target domain. On the contrary, if the twodomains cannot be separated from each other easily, thenthe information gap is small, in which case we can treat thetwo domains as essentially data that are sampled from asingle underlying distribution. In other words, the original“domain adaptation problem” is transformed into aclassification problem under the supervised setting or asemisupervised (transductive) setting. A similar idea isused in [6] where a classifier is trained to distinguish thesource and target domains and the classification error isused as an empirical estimation for domain distance.Although this idea is useful, it did not consider theexistence of auxiliary information sources that can be usedto bridge two domains.

Following the above intuitive idea, we use GðDs;Dt;KÞto denote the information gap between source domain Dsand target domain Dt when knowledge base K is available.In the case where no other knowledge is obtainable besidesthe two domains, we use the notation GðDs;DtÞ. Previousworks on domain adaptation only focused on modelingGðDs;DtÞ and ignored K. In contrast, we are modelingGðDs;Dt;KÞ directly in this work.

We first consider the case where K is not available.Since the concept of margin in SVM can be used tomeasure the separability between two classes in classifica-tion problems, therefore, we can define one form ofinformation gap of two domains GðDs;DtÞ to be the marginbetween the source domain and target domain, whentreating them as two classes.

XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 773

Fig. 1. Measuring similarities between documents based on theknowledge base.

2. Some semisupervised learning algorithms have slightly differentassumptions.

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

Definition 4 (Information Gap with No BackgroundKnowledge Available). The definition of information gapbetween Ds and Dt without background knowledge is given bythe following equation:

GðDs;DtÞ ¼ 2

kwk ð1Þ

w ¼ arg min kwk2 s:t: yiðwTxi � bÞ � 1 for all i; ð2Þ

where xi 2 Ds [ Dt, yi ¼ 1 if xi 2 Ds and yi ¼ �1 if xi 2 Dt.

Given the form of knowledge bases as set of auxiliary dataK ¼ fx0ig from other domains, the margin between twodomains not only depends on the data from source andtarget domains which are treated as labeled data, but alsodepends on the auxiliary dataK ¼ fx0ig, which are treated asunlabeled data. The information gap can be defined as themaximum margin under the transductive learning setting.

Definition 5 (Information Gap with Background Knowl-edge). The definition of information gap between Ds and Dtwith background knowledge K is given by the followingequation:

GðDs;Dt;KÞ ¼ 2

kwk ; ð3Þ

where

w ¼ arg min kwk2

s:t: yi�wTxi � b

�� 1; jwTx0i � bj � 1 for all i:

ð4Þ

Given the above definition, reducing the information gapcan be expressed as selecting the set of unlabeled data fxigfrom K to minimize the margin. Therefore, it can beformulated as,

maxfx0ig�K

½½minkwk2��

s:t: yi�wTxi � b

�� 1; j

�wTx0i � b

�j � 1 for all i: ð5Þ

We can also extend the problem by adding slack variablesfor inseparable problems, where the optimal boundary isobtained by

w ¼ arg minf

""kwk2 þ C

Xli¼1

1� yifðxiÞ

þXnlþ1

ð1� jfðx0iÞjÞþ

!##;

ð6Þ

where ðzÞþ ¼ maxðz; 0Þ. However, the above max-minproblem is difficult to solve directly. Finding the optimalsolution is an NP-hard problem and is impractical. In thenext section, we propose an algorithm that can optimize (5)in a greedy manner.

3.3 BIG: A Min-Margin Algorithm to ReduceInformation Gap

Algorithm 1 shows our proposed solution. Its inputs includethe source domain data setDs, target domain data setDt andauxiliary domain data K. The output of the algorithm

consists of some unlabeled data that are chosen so that wecan apply semisupervised learning algorithms for training aclassifier. These unlabeled data carry important informationabout the distribution between the two domains.

Algorithm 1. BIG: A Greedy Algorithm for Min-Margin

Domain Adaptation

Input: Source domain data set Ds, target domain data setDt, a knowledge base K, terminating threshold

t, relevance threshold �

Output: A subset D from K

Initialize: Set D ¼ ;.Preprocess: Set D0 ¼ CandSelectðDs;Dt;K; �Þ.Train an SVM model using ðxi; yiÞ, where xi 2 Ds [ Dt,yi ¼ 1 if xi 2 Ds and yi ¼ �1 if xi 2 Dt.while D0 6¼ ; do

for j = 1 to k do

Select xi from C satisfying jwTxi � bj < 1 by:

xi ¼ arg minxi jwTxi � bjLet D ¼ D [ fxig, D0 ¼ D0=fxig.

end for

wold ¼ w

Train a TSVM model using ðxi; yiÞ and D.�G ¼ ð 2

kwk � 2kwoldkÞ=

2kwoldk

if �G � t then

Output D and Exit.

end if

end while

A comprehensive example is illustrated in Fig. 2, wherethree domains are shown. Domain A is a source domain,domain B is the target domain, and domain C is an auxiliarydomain used to bridge domains A and B. Without any datafrom the domain C, the problem cannot be solved directlyusing a transductive SVM method, since the informationgap is large. In applying the algorithm, some data instancesfrom the auxiliary domain C are selected into the unlabeleddata sets which reduce the information gap between thetwo domains. Finally, when the algorithm converges, somedata are selected from the domain C. These data can beused to fill the gap between domains A and B.

3.4 Algorithm Convergence

In this section, we show that the BIG algorithm is guaranteedto converge. We first prove several properties the margin-based information gap has. Given the Definitions 4 and 5, weshow the following properties for information gap.

Lemma 1 (Nonincreasing Margin Lemma). Given thedefinition of information gap, we have

GðDs;DtÞ � G�Ds;Dt;K

�:

Furthermore, if we have K1 � K2, then

GðDs;Dt;K1Þ � G�Ds;Dt;K2

�:

Proof. We can show that if the conclusion does not hold,which means GðDs;DtÞ < GðDs;Dt;KÞ, then GðDs;DtÞ isnot the maximum margin. This conflicts with thedefinition of GðDs;DtÞ. tu

774 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

The above lemma indicates that the information gapbetween two domains is nonincreasing when more andmore knowledge is available. This property is consistentwith our intuition. We can further obtain a strongerconclusion as shown in the following lemma.

Lemma 2 (Decreasing Margin Lemma). We have

GðDs;DtÞ > GðDs;Dt;KÞ;

if 9xi satisfies that jwTxi � bj < 1. Furthermore, if we haveK1 � K2, then

GðDs;Dt;K1Þ > GðDs;Dt;K2Þ;

if 9xi 2 K2 and xi 62 K1 satisfies that jwTxi � bj < 1.

Proof. We can prove the converse-negative proposition ofthis proposition. Suppose we have

GðDs;Dt;K1Þ � GðDs;Dt;K2Þ:

Let w1 and w2 be the separating planes for GðDs;Dt;K1Þand GðDs;Dt;K2Þ, respectively. It is easy to validate thatw2 is also a separating plane with margin 2=GðDs;Dt;K2Þfor K1. Therefore, such xi does not exist, which satisfiesxi 62 K1 and jwTxi � bj < 1 but xi 2 K2. tuThis theorem shows that selecting the data that fall in the

margin can reduce the size of the margin. Algorithm BIG ismotivated by this lemma. We can further guarantee itsconvergence as shown in the following theorem:

Theorem 1 (Convergence Theorem). The iteration process inAlgorithm 1 is always increasing for the objective in (5) untilconvergence.

The proof of the above theorem can be easily obtainedfrom Lemma 2.

3.5 Selecting Candidates from Auxiliary KnowledgeBase

In this section, we make our algorithm more concrete byconsidering the data as text documents.

Based on our proposed algorithm in the previoussection, theoretically we are able to directly retrieve someunlabeled documents from the knowledge base to bridgedomains. However, in practice, in order to fill in the

information gap, we need to involve a large auxiliaryknowledge base for better coverage. However, it is verycostly to perform the TSVM algorithm for searching thebridging documents over a large knowledge base. There-fore, it is necessary for us to identify some possible relatedsubsets as candidates first. For example, in Fig. 2, we areonly interested in retrieving those relevant unlabeleddocuments which are “helpful” to connect a domain Awith another domain B. An intuitive idea to estimate thehelpfulness is to check whether one unlabeled documentstays “close” to both of the domains. This can be done bycalculating the similarity between these documents. Themost intuitive idea to identify the related domain data is tomeasure the similarity between the domain documents andeach article in the knowledge base [18].

However, such a method is computationally expensive,since it requires that we pair up these documents with eacharticle in the knowledge base from over one million articles.To address this problem, we introduce a novel method tosolve this problem by unifying the idea of topic models [21]and the knowledge-base document space projection [18]. Tothis end, we apply Latent Dirichlet Allocation (LDA) [21] toconstruct a topic space over the knowledge base docu-ments. LDA is a generative graphical model which can beused to model and discover underlying topic structures ofany discrete data such as texts. In LDA, a document dm ¼fwm;ngNm

n¼1 is generated by first picking a distribution overtopics �m from a Dirichlet distribution Dirð�Þ, whichdetermines the topic assignment of the document dm. Then,the topic assignment for each word placeholder ½m;n� isperformed by sampling a particular topic zm;n from multi-nomial distribution Multð�mÞ. Finally, a particular wordwm;n is generated for the word place holder ½m;n� bysampling from multinomial distribution Multð�zm;nÞ. Inorder to estimate the parameters � and � for LDA, weneed to maximize the likelihood of the whole datacollection—i.e., the entire knowledge base K:

pðKj�; �Þ ¼Ydm2K

pðdmj�; �Þ

¼Z Z

pð�j�ÞYNm

n¼1

pðwm;nj�zm;nÞpðzm;nj�mÞpð�mj�Þd�d�m:

ð7Þ

XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 775

Fig. 2. An illustration of the BIG algorithm. Domain A is the source domain; Domain B is the target domain. After adding the new auxiliary data duringeach iteration, the information gap between domain A and B is reduced. After reducing the information gap, TSVM can successfully solve the domainadaptation problem.

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

However, the two integrations in (7) are intractable. We

use Gibbs Sampling for approximation [35] here (due to space

limitation, the details of the Gibbs Sampling algorithm for

LDA are omitted.) When learning parameters � and �, we can

obtain the topic association of each word in the documents in

our knowledge base. Thus, we are able to obtain the

document-topic association and the word-topic association,

represented as pðzjjdiÞ and pðwjzjÞ, respectively. Actually,

LDA summarizes the documents fdigNi¼1 in the whole

knowledge base to a few latent topics fzjgkj¼1. It can be

viewed as a soft clustering over the documents in the

knowledge base where each hidden topic zj can be viewed

as a cluster of documents. We can also obtain the conditional

probabilities of generating document di from a topic zj with

Bayes rule:

pðdijzjÞ ¼pðzjjdiÞpðdiÞPNi¼1 pðzjjdiÞpðdiÞ

; ð8Þ

where pðzjjdiÞ is the topic proportion provided by the topic

model. We assume a uniform prior distribution for pðdiÞ.Using the word-topic association pðwjzjÞ, we can also

infer the hidden topic for the newly incoming document d0:

pðzjjd0Þ ¼pðd0jzjÞpðzjÞPkj¼1 pðd0jzjÞpðzjÞ

¼ pðzjÞQ

w2d0 pðwjzjÞZ

; ð9Þ

where pðzjÞ is the prior of hidden topic zj, and Z is the

normalization factor. Then we are able to obtain the

similarity between a new document d0 and the document

di in the knowledge base:

pðdijd0Þ ¼Xkj¼1

pðdijzjÞpðzjjd0Þ: ð10Þ

Consider an example to illustrate how to calculate the

similarity between a Wikipedia article and a newly incoming

document. We assume that the latent topics might refer to the

following four aspects: <“Wearing”, “Outdoor Sports”,

“Game Hardware”, “Maths”> . Consider one article with

the title “Xbox NBA Live 2008,” which refers to a video game.

During the learning process of the topic model, each word in

this paper is assigned to one of the four latent topics. After

normalization, the topic distribution of this document can be

represented by a vector <0:3; 0:5; 0:7; 0:2>, which denotes

the strength of the soft association between this paper and

the four latent topics. Similarly, when a newly incoming

document talking about “Jordan Shoes” comes, using (9), it

can also be represented with the latent topic vectors as

<0:8; 0:6; 0:1; 0:1>. Finally, we can calculate the similarity

between “Xbox NBA Live 2008” and “Jordan Shoes” on the

latent topic space using (10), that is 0:3 0:8þ 0:5 0:6þ 0:7 0:1þ 0:2 0:1 ¼ 0:63.

Our above method differs from the method used in [18]

in that the inner product is calculated at the topic space

instead of the term space. Therefore, the computational cost

of document projection can be greatly reduced.Finally, in order to evaluate the relatedness of a document

with a domain, we aggregate the conditional distribution of

all the data in the same domain Dt or Ds:

pðdijDsÞ ¼1

Z

Xxsj2D

s

p�dijxsj

�ð11Þ

and we form the related domain data set C by collecting thetop M ranked candidates which are related to both Dt andDs according to their relevance score:

ScoreðdijDs;DtÞ ¼ pðdijDsÞpðdijDtÞ: ð12Þ

We can now select the most relevant documents based ontheir scores. Algorithm 2 shows how to preprocess the entireknowledge base to obtain a set of candidate documents.

Algorithm 2. CandSelect: Algorithm for Candidate

Selection

Input: Source domain data set Ds, target domain data set

Dt, entire knowledge base K, relevance threshold �

Output: A candidate set D0 from K

Initialize: Set D0 ¼ ;.for each di in K do

if ScoreðdijDs;DtÞ � � then

Let D0 ¼ D0 [ fdig.end if

end for

4 EXPERIMENTS

In this section, we demonstrate the effectiveness of oursemisupervised learning approach on several real-worlddomain adaptation tasks. To examine our approach in moredetail, we analyze the basic step of our algorithm toillustrate that we can successfully discover the informationgap. Furthermore, we demonstrate that our proposedalgorithm can reduce the information gap successfullythrough a small number of iterations and converge to a localoptima efficiently.

4.1 Tasks and Data Sets

We conduct our experiments with three different domainadaptation tasks. The first task is cross-domain textclassification on the 20 Newsgroups data set. The secondtask is a domain adaptation problem in sentiment classifica-tion and the third task is a cross-domain short textclassification task.

4.1.1 20 Newsgroups

The 20 Newsgroups dat set3 [36] is a text collection ofapproximately 20,000 newsgroup documents partitionedacross 20 different newsgroups nearly evenly. Since we arecomparing our algorithm’s performance with that of [9], wefollow exactly the same experimental settings as given in[9]. We create six different data sets for evaluating cross-domain classification algorithms and for each data set, twotop categories are chosen, one as positive and the other asnegative. Data are split based on subcategories anddifferent subcategories are considered as different domains.The six different data sets (shown in Table 1) are comp versussci, rec versus talk, rec versus sci, sci versus talk, comp versus rec,and comp versus talk. Detailed descriptions of these data sets

776 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

3. http://people.csail.mit.edu/jrennie/20Newsgroups/.

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

can also be found in [9]. For preprocessing details, all thealphabets are converted into lower cases and stemmedusing the Porter stemmer. Stop words are also removed andthe document frequency (DF) feature is used to cut downthe number of features.

4.1.2 Sentiment Reviews

The data of sentiment domain adaptation [12] consist ofAmazon product reviews for four different product types,including books, DVDs, electronics, and kitchen appliances.4

Each review consists of a rating with scores ranging from 0to 5, a reviewer name and location, a product name, areview title and date and the review text. Reviews withratings higher than three are labeled as positive andreviews with ratings lower than three are labeled asnegative, the rest are discarded since the polarity of thesereviews are ambiguous. The details of the data in differentdomains are summarized in Table 2. Such experimentalsettings are the same as that of [12]. To study theperformance of our approach in this task, we construct12 pairs of cross-domain sentiment classification tasks, e.g.,we use the reviews from domain A as training data andthen predict the sentiment of the reviews in the domain B.

4.1.3 Query Classification

We also construct a set of tasks on cross-domain queryclassification for a search engine. We use a set of searchsnippets gathered crawled from Google as our training dataand and some incoming unlabeled queries as the test data.The detailed descriptions of the process can be found in[19]. We use the labeled queries from AOL provided by [37](shown in Table 3) for evaluation.5 We consider queriesfrom five classes: Business, Computer, Entertainment, Health,and Sports which are shown in both of the training and testdata sets. We form 10 binary classification tasks for queryclassification and queries are enriched according to [38],[39], where we gather the top 50 search snippets for thequery enrichment.

4.2 Evaluation Metrics

To compare the performance of the classification methods,we use classification accuracy, which is defined as thepercentage of correct predictions among all test examples. Inorder to validate the robustness of our method, for each timewe randomly sampled 90 percent training data for trainingthe model. Each experiment was repeated ten times, and boththe mean and variance of the accuracies are reported.

For projecting the domain data onto the latent topicspace shared with the documents in the knowledge base,we train a topic model using the documents in theknowledge base using the GibbsLDA++6 package. In [19],they used the similar idea of using the topic model learnedfrom the knowledge base to reduce the data sparsenessproblem, and they also investigated the effect of differentparameter settings for learning a topic model fromWikipedia. Following their conclusion, we set the para-meter of topic dimension as 100, and use the default hyper-parameter settings, that is � ¼ 50=K (K is the number oftopics), and � ¼ 0:1. We follow such empirical parametersdescribed in [19] for simplifying our parameter settings.

4.3 Experimental Results

In this section, we answer the following two questions,empirically. 1) Can our min-margin-based semisupervisedlearning approach outperform traditional transfer learningapproaches on the domain adaptation tasks? 2) Can ouralgorithm automatically identify the most important docu-ments connecting the source domain to the target domain?

4.3.1 Question 1: Comparison with Traditional Transfer

Learning Methods

To answer the first question, we aim to compare ouralgorithm performance against some baseline algorithmspresented in previous transfer learning papers (shown inTables 4, 5, and 6). To evaluate the effectiveness of ourapproach, we adopt three classification models for domainadaptation: the first model is Support Vector Machines(SVM), which is usually used for supervised learning; thesecond model is Transductive SVM (TSVM) [29], which is asemisupervised learning model; and the last model is Co-Cluster-based Classification (CoCC) [9] which is a transferlearning model designed for cross domain classification. Forall of the three baseline models, we only use the labeleddocuments in the source domain and unlabeled documentsin the target domain for training, and evaluate the model onthe test documents in the target domain. Since our BIGalgorithm is based on TSVM, in order to obtain the optimal

XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 777

TABLE 1Data Description—20 Newsgroups

TABLE 2Data Description—Amazon Product Reviews

TABLE 3Data Description—AOL Queries

4. http://www.seas.upenn.edu/mdredze/data-sets/sentiment/.5. http://gregsadetsky.com/aol-data/. 6. http://gibbslda.sourceforge.net/.

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

parameter settings for SVM, TSVM, and BIG, we first tunethe parameter C to achieve the optimal accuracy via 10-foldcross validation on each data set individually, and wereport the optimal results of different models on differentdata sets in the following tables.

We first compare the experimental results on the20 Newsgroups data, which are shown in Table 4.7 Eachrow in Table 4 represents a different cross-domain textclassification task, including comp versus sci, rec versus talk,rec versus sci, sci versus talk, comp versus rec, and comp versus

talk, correspondingly. Each column denotes the mean andvariance of different models. We use bold face to mark thebest classifiers for each data set. We also add a column onthe right to denote our improvement to the best result inbaselines. For this and all subsequent experiments, theStudent’s t-test was applied to assess statistical significancewith respect to the best baseline method. We use thenotation ** to mark the rows in the improvement column inthe tables to indicate significance with p < 0:01; and use * toindicate significance with p < 0:05; unmarked resultsindicate no significant differences.

Comparing the results of SVM and TSVM, we find thatTSVM can boost the performance of SVM almost all thetime. When we involve the top 500 important Wikipedia

778 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

TABLE 4Accuracy on 20NG Data

TABLE 5Accuracy on Sentiment Data

TABLE 6Accuracy on Web Query Data

7. For all the tables, the best model for each data set is marked with boldface. We also measure the statistical significance of the improvement usingStudent’s t-test. * and ** mean that the improvement hypothesis areaccepted at different significant levels (p ¼ 0:05, 0.01).

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

articles as the unlabeled data to fill in the information gap,where the articles are selected by BIG, we are able to boostthe performance of TSVM even further. Furthermore,amongst all six tasks, our algorithm can outperform theCoCC algorithm.

Table 5 shows the experimental results on the sentimentclassification tasks. Each row in the table corresponds to adifferent cross-domain review sentiment classification task,where D, B, E, K corresponds to reviews on DVDs, books,electronics, and kitchen appliances, correspondingly. A! Bmeans we use reviews in domain A as training data and thenuse reviews in domain B as test data. Polarity classification isquite a different problem comparing with the traditionalgenre categorization. We can find that, different from 20NG,when the information gap between the domains becomeslarger, TSVM does not help a lot. This phenomenon might becaused by the fact that simply pulling the data acrossdifferent domains together is not reasonable. We can findthat, our BIG algorithm, after filling in the information gap,boosts the performance of TSVM. Almost for all tasks, ourBIG algorithm can outperform the TSVM method, againsuggesting that we are adding valuable information into theunlabeled data set.

In Table 6, we show our result on cross-domain queryclassification tasks. The results are similar to previous tasks,where BIG outperforms TSVM most of the time, and SVM,CoCC perform at the same level as TSVM.

4.3.2 Question 2: Comparison with Random Selection

Methods

To answer the second question, we can compare our BIGalgorithm with some Random Selection baselines. Weadopt two random selection methods for comparison. The“Random 1” method randomly selects 500 data instancesfrom the entire knowledge base; The “Random 2” methodfirst uses the CandSelect algorithm mentioned in Section 3to generate a candidate set, and then randomly selects500 instances from the candidate set. Once we haveselected the 500 instances through either kind of Randommethods, the 500 instances are added to the training setas unlabled data. Then we apply the TSVM model forsemisupervised learning. From Tables 4, 5, and 6, we canobserve that, the performance of our approach is muchbetter than both of the Random baselines, indicating thatwe really found the most important nodes during thedomain drift process through which distribution is driftedfrom the source domain to the target domain. We canalso observe that “Random 2” is better than “Random 1,”since the Topic-model-based CandSelect algorithm canreally filter out a huge amount of irrelevant data fromthe knowledge base.

All the experiments in different tasks are consistent andvalidate the effectiveness of our proposed algorithm. In thefollowing we will investigate how our proposed methodcan find where the bridging documents are located in theknowledge base, and verify that our BIG algorithm couldconverge and its performance is stable.

4.4 Information Gap

In this section, we propose to further analyze the effectivenessof our algorithm by showing directly what “related domains”we have actually found in our experiments. Here, we providethe related domains, as top-ranked Wikipedia concepts

which were extracted by BIG, from the sentiment domainadaptation task. Detailed results are shown in Table 7.

From our results, we can see that the top-rankedWikipedia concepts contain rich content that could be usedto fill in the information gap between the two domains. Takethe cross-domain text classification task “D! B” as anexample, the top-ranked Wikipedia concepts the algorithmextract include concepts like “James Bond” and “Batman,”which are good examples of concepts that contain knowl-edge about both DVD domains and the Book domains.Therefore, the detailed results of extracted related conceptscould also validate the effectiveness of our algorithm to fillin the information gap from another perspective.

4.5 Convergence and Stability

We first demonstrate that our algorithm can reduce theinformation gap between domains during the process ofincluding the unlabeled data from the related domains. Werandomly sample three tasks for each of the three data setsand display the performance (shown in Fig. 4), togetherwith their corresponding margin sizes (shown in Fig. 3). Foreach iteration, we include the top 100 unlabeled data thatare closest to the decision boundary for TSVM. The x-axis isthe iteration count. We find that our algorithm is able toreduce the information gap and converge quickly. We alsolist another group of figures to demonstrate the tradeoffbetween CPU time and accuracy (shown in Fig. 5).

Since BIG is based on the semisupervised learning modelTSVM, a very sensitive parameter is C, which is the errortolerance factor. It is true that the performance of BIG mightbe different under different settings of C. Another para-meter in our algorithm is the margin threshold t. Here, wewill discuss how to set the two parameters together. Sinceour margin is closely related to the error tolerance factor C,we also investigate how stable the algorithm performsunder different settings of C.

A very interesting observation is that our optimal marginthreshold is relatively stable over different data sets. Fig. 6shows the optimal threshold under different C values. Thismakes our method robust and reliable when applying todifferent tasks. Thus, for different domain adaptation tasks,we first use 10-fold cross validation on traditional TSVM todetermine the optimal parameter C. Then, based on theparameter C, we use 10-fold cross validation again to seekthe optimal parameter of threshold t for halting the BIGalgorithm.

We also investigate the impact of the scope of relateddomains. We randomly choose one task from each of the

XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 779

TABLE 7Related Concepts in the Sentiment Reviews Task

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

780 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

Fig. 4. The change of accuracy during iterations. We observe that the performance converges within a few iterations.

Fig. 3. The change of information gap (margin) during iterations. We can observe that the information gap is reducing during iterations. (a) shows theresults on 20 Newsgroups; (b) shows the result on sentiment; (c) shows the results on AOL query set. The position of the data sets are the same inthe following figures.

Fig. 5. The trade-off between accuracy and CPU time.

Fig. 6. The relation between optimal threshold and parameter C.

Fig. 7. The relation between the number of auxiliary data and the best performance.

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

three data sets. We vary the size of candidate related

documents from 4,000 to 10,000. We observe that the

performance is quite stable (shown in Fig. 7). This may be

due to the fact that we use the topic model to retrieve the

data from related domain as the candidate. The data

selection for filling in the information gap is achieved by

BIG, which means we can retrieve relevant data from

related domains as much as possible as the computational

cost is low. The BIG algorithm is able to discover the gap

and choose the useful data automatically.

4.6 Choice of Different Auxiliary Knowledge Bases

Since our algorithm does not depend on any specific

structure of the knowledge base, we are free to choose

any common knowledge repository available online, suchas Wikipedia8 or ODP.9 In this paper, we incorporateWikipedia as our external data source. Wikipedia iscurrently the largest knowledge repository on the Web,and the quality of its article content is remarkable due to theopen editing strategy. In our experiments, we use thesnapshot of Wikipedia on 30 Nov. 2006. There are totally1,614,132 articles and 34,172,627 hyperlinks between them.We preprocess the articles by stemming and stop wordsremoving. We also filtered out the short articles with lengthless than 500, and finally there remain 53,803 articles as ourknowledge base. For ODP, we download its snapshot on

XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 781

TABLE 8Different Knowledge Bases on 20NG Data

TABLE 9Different Knowledge Bases on Sentiment Data

TABLE 10Different Knowledge Bases on Web Query Data

8. http://en.wikipedia.org.9. http://www.dmoz.org/.

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

18 Oct. 2006, which covers 1,733,500 pages. We also filteredout the short articles with length less than 1,000, and finallythere remain 59,986 articles as our knowledge base.

From Tables 8, 10, and 9, we can observe that our BIGalgorithm can always boost the performance of TSVM. Wealso compare our BIG algorithm with the Random base-lines, suggesting that we are selecting useful unlabeled datain the semisupervised learning process. This furtherdemonstrates that it is necessary for us to design analgorithm to carefully “pick out” the bridging documents.

5 CONCLUSIONS AND FUTURE WORK

In this paper, we proposed a novel framework for tacklingthe problem of domain adaption under large informationgaps. We model the learning problem as a semisupervisedlearning problem aided by a method for filling in theinformation gap between the source and target domains withthe help of an auxiliary knowledge base (such as theWikipedia). By conducting experiments on different difficultdomain adaptation tasks, we show that our algorithm cansignificantly outperform several existing domain adaptationapproaches in situations when the source and target domainsare far from each other. In each case, an auxiliary domain canbe used to fill in the information gap efficiently.

We make three major contributions in this paper.1) Instead of the traditional instance-based or feature-basedperspective to view the problem of domain adaptation, weview the problem from a new perspective, i.e., we considerthe problem of transfer learning as one of filling in theinformation gap based on a large document corpus. Weshow that we can obtain useful information to bridge thesource and the target domains from auxiliary data sources.2) Instead of devising new models for tackling the domainadaptation problems, we show that we can successfullybridge the source and target domains using well developedsemisupervised learning algorithms. 3) We propose a min-margin algorithm that can effectively identify and reducethe information gap between two domains.

We plan to continue our research work on this direction inthe future, by pursuing several avenues. First, we plan tovalidate the effectiveness of our approach through othersemisupervised learning algorithms and other relationalknowledge bases to more extensively demonstrate theeffectiveness of our approach. Second, in this paper, weonly investigate the case where the source, target andauxiliary data sources share the same feature space. We plan toextend our approach to be able to consider heterogeneoustransfer learning [40]. Finally, since our current approach is aniterative model based on TSVM, which is quite slow for largelearning tasks, we will try to develop online TSVM methodsfor incremental cross-domain transductive learning.

ACKNOWLEDGMENTS

The authors thank the support from Hong Kong CERGgrant 621307 and a grant from NEC China Lab.

REFERENCES

[1] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for TransferLearning,” Proc. 24th Ann. Int’l Conf. Machine Learning (ICML ’07),pp. 193-200, June 2007.

[2] J. Jiang and C. Zhai, “Instance Weighting for Domain Adaptationin NLP,” Proc. 45th Ann. Meeting of the Assoc. for ComputationalLinguistics (ACL ’07), June 2007.

[3] G.-R. Xue, W. Dai, Q. Yang, and Y. Yu, “Topic-Bridged PLSA forCross-Domain Text Classification,” Proc. 31st Ann. Int’l ACMSIGIR Conf. Research and Development in Information Retrieval(SIGIR ’08), pp. 627-634, July 2008.

[4] A. Argyriou, C.A. Micchelli, M. Pontil, and Y. Ying, “ASpectral Regularization Framework for Multi-Task StructureLearning,” Proc. 21st Ann. Conf. Neural Information ProcessingSystems (NIPS ’07), Dec. 2007.

[5] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng, “Self-TaughtLearning: Transfer Learning from Unlabeled Data,” Proc. 24thAnn. Int’l Conf. Machine Learning (ICML ’07), pp. 759-766, June2007.

[6] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman,“Learning Bounds for Domain Adaptation,” Proc. 21st Ann. Conf.Neural Information Processing Systems (NIPS ’07), Dec. 2007.

[7] S.J. Pan and Q. Yang, “A Survey on Transfer Learning,”IEEE Trans. Knowledge and Data Eng., preprint, 12 Oct. 2009,doi: 10.1109/TKDE.2009.191.

[8] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis ofRepresentations for Domain Adaptation,” Proc. 20th Ann. Conf.Neural Information Processing Systems (NIPS ’06), pp. 137-144, Dec.2006.

[9] W. Dai, G.-R. Xue, Q. Yang, and Y. Yu, “Co-Clustering BasedClassification for Out-of-Domain Documents,” Proc. 13th ACMSIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’07),pp. 210-219, Aug. 2007.

[10] B. Zadrozny, “Learning and Evaluating Classifiers UnderSample Selection Bias,” Proc. 21th Ann. Int’l Conf. MachineLearning (ICML ’04), p. 114, July 2004.

[11] H.D. III, “Frustratingly Easy Domain Adaptation,” Proc. 45th Ann.Meeting of the Assoc. for Computational Linguistics (ACL ’07), June2007.

[12] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, Bollywood,Boom-Boxes and Blenders: Domain Adaptation for SentimentClassification,” Proc. 45th Ann. Meeting of the Assoc. for Computa-tional Linguistics (ACL ’07), pp. 440-447, June 2007.

[13] S.I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller, “Learning aMeta-Level Prior for Feature Relevance from Multiple RelatedTasks,” Proc. 24th Ann. Int’l Conf. Machine Learning (ICML ’07),pp. 489-496, June 2007.

[14] E. Gabrilovich and S. Markovitch, “Feature Generation for TextCategorization Using World Knowledge,” Proc. 19th Int’l JointConf. Artificial Intelligence (IJCAI ’05), pp. 1048-1053, July/Aug.2005.

[15] E. Gabrilovich and S. Markovitch, “Harnessing the Expertise of70,000 Human Editors: Knowledge-Based Feature Generation forText Categorization,” J. Machine Learning Research, vol. 8, pp. 2297-2345, 2007.

[16] E. Gabrilovich and S. Markovitch, “Overcoming the BrittlenessBottleneck Using Wikipedia: Enhancing Text Categorization withEncyclopedic Knowledge,” Proc. 21th Nat’l Conf. Artificial Intelli-gence and the 18th Innovative Applications of Artificial IntelligenceConf. (AAAI ’06), July 2006.

[17] P. Wang, J. Hu, H.-J. Zeng, L. Chen, and Z. Chen, “Improving TextClassification by Using Encyclopedia Knowledge,” Proc. SeventhIEEE Int’l Conf. Data Mining (ICDM ’07), Oct. 2007.

[18] E. Gabrilovich and S. Markovitch, “Computing Semantic Related-ness Using Wikipedia-Based Explicit Semantic Analysis,” Proc.20th Int’l Joint Conf. Artificial Intelligence (IJCAI ’07), pp. 1606-1611,Jan. 2007.

[19] X.H. Phan, M.L. Nguyen, and S. Horiguchi, “Learning to ClassifyShort and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections,” Proc. 17th Int’l Conf. World Wide Web(WWW ’08), pp. 91-100, Apr. 2008.

[20] T. Hofmann, “Probabilistic Latent Semantic Analysis,” Proc. 15thConf. Uncertainty in Artificial Intelligence (UAI ’99), pp. 289-296,July/Aug. 1999.

[21] D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Alloca-tion,” Proc. 14st Ann. Conf. Neural Information Processing Systems(NIPS ’01), pp. 601-608, Dec. 2001.

[22] P. Wang, C. Domeniconi, and J. Hu, “Using Wikipedia for Co-Clustering Based Cross-Domain Text Classification,” Proc. EighthIEEE Int’l Conf. Data Mining (ICDM ’08), pp. 1085-1090, Dec. 2008.

782 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.

[23] P. Wang, C. Domeniconi, and J. Hu, “Cross-Domain TextClassification Using Wikipedia,” The IEEE Intelligent InformaticsBull., vol. 9, pp. 5-17, Nov. 2008.

[24] P. Wang and C. Domeniconi, “Building Semantic Kernels for TextClassification Using Wikipedia,” Proc. 14th ACM SIGKDD Int’lConf. Knowledge Discovery and Data Mining (KDD ’08), pp. 713-721,Aug. 2008.

[25] S. Zelikovitz and H. Hirsh, “Improving Short-Text ClassificationUsing Unlabeled Background Knowledge to Assess DocumentSimilarity,” Proc. 17th Int’l Conf. Machine Learning (ICML ’00),pp. 1183-1190, 2000.

[26] S. Zelikovitz and H. Hirsh, “Using LSI for Text Classification inthe Presence of Background Text,” Proc. 10th ACM Int’l Conf.Information and Knowledge Management (CIKM ’01), pp. 113-118,Nov. 2001.

[27] A. Blum and S. Chawla, “Learning from Labeled and UnlabeledData Using Graph Mincuts,” Proc. 18th Int’l Conf. Machine Learning(ICML ’01), pp. 19-26, June/July 2001.

[28] A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Satawith Co-Training,” Proc. 11th Ann. Conf. Computational LearningTheory (COLT ’98), pp. 92-100, 1998.

[29] T. Joachims, “Transductive Inference for Text Classification UsingSupport Vector Machines,” Proc. 16th Int’l Conf. Machine Learning(ICML ’99), pp. 200-209, June 1999.

[30] G. Salton, A. Wong, and C.S. Yang, “A Vector Space Model forAutomatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620,1975.

[31] C. Bishop, Pattern Recognition and Machine Learning. Springer-Verlag, 2006.

[32] V.W. Zheng, S.J. Pan, Q. Yang, and J.J. Pan, “Transferring Multi-Device Localization Models Using Latent Multi-Task Learning,”Proc. 23rd Nat’l Conf. Artificial Intelligence (AAAI ’08), pp. 1427-1432, July 2008.

[33] V.W. Zheng, E.W. Xiang, Q. Yang, and D. Shen, “TransferringLocalization Models Over Time,” Proc. 23rd Nat’l Conf. ArtificialIntelligence (AAAI ’08), pp. 1421-1426, July 2008.

[34] V.W. Zheng, D.H. Hu, and Q. Yang, “Cross-Domain ActivityRecognition,” Proc. 11th Int’l Conf. Ubiquitous Computing (Ubicom’09), pp. 61-70, 2009.

[35] T.L. Griffiths and M. Steyvers, “Finding Scientific Topics,”Proc. Nat’l Academy of Sciences USA, vol. 101, suppl. 1,pp. 5228-5235, http://dx.doi.org/10.1073/pnas.0307752101,Apr. 2004.

[36] K. Lang, “Newsweeder: Learning to Filter Netnews,” Proc. 12thInt’l Machine Learning Conf. (ICML ’95), pp. 331-339, 1995.

[37] S.M. Beitzel, E.C. Jensen, O. Frieder, D.D. Lewis, A. Chowdhury,and A. Kolcz, “Improving Automatic Query Classification viaSemi-Supervised Learning,” Proc. Fifth IEEE Int’l Conf. DataMining (ICDM ’05), pp. 42-49, Nov. 2005.

[38] D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, “Building Bridges forWeb Query Classification,” Proc. 29th Ann. Int’l ACM SIGIR Conf.Research and Development in Information Retrieval (SIGIR ’06),pp. 131-138, Aug. 2006.

[39] A.Z. Broder, M. Fontoura, E. Gabrilovich, A. Joshi, V. Josifovski,and T. Zhang, “Robust Classification of Rare Queries Using WebKnowledge,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research andDevelopment in Information Retrieval, pp. 231-238, July 2007.

[40] Q. Yang, Y. Chen, G.-R. Xue, W. Dai, and Y. Yu, “HeterogeneousTransfer Learning with Real-World Applications,” Proc. 47th Ann.Meeting of the Assoc. for Computational Linguistics (ACL ’09), Aug.2009.

Evan Wei Xiang received the BE degree insoftware engineering from Nanjing University in2006 and is working toward the PhD degree in theDepartment of Computer Science and Engineer-ing, the Hong Kong University of Science andTechnology. His research interests include largescale data mining, transfer learning, and theirapplications in Social Web mining. More detailscan be found at http://ihome.ust.hk/~wxiang.

Bin Cao received the BS and MS degrees inmathematics from Xi’an Jiaotong Univeristy andPeking University, in 2004 and 2007, respec-tively, and and is currently working toward thePhD degree in the Department of ComputerScience and Engineering, the Hong Kong Uni-versity of Science and Technology. His researchinterests include data mining, machine learning,and information retrieval. More details can befound at http://ihome.ust.hk/~caobin.

Derek Hao Hu received the BS degree incomputer science from Nanjing University, in2007 and is working toward the PhD degree inthe Department of Computer Science andEngineering, the Hong Kong University ofScience and Technology. His research interestsinclude probabilistic graphical models, and theirapplications in sensor-based activity recognitionand Web mining. More details can be found athttp://www.cse.ust.hk/~derekhh.

Qiang Yang received the bachelor’s degree inastrophysics from Peking University and thePhD degree in computer science from theUniversity of Maryland, College Park. He is afaculty member in the Department of ComputerScience and Engineering, Hong Kong Universityof Science and Technology. His research inter-ests include data mining and machine learning,AI planning, and sensor-based activity recogni-tion. He is a fellow of the IEEE, member of AAAI

and ACM, a former associate editor for the IEEE TKDE, and a currentassociate editor for IEEE Intelligent Systems. More details can be foundat http://www.cse.ust.hk/~qyang.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

XIANG ET AL.: BRIDGING DOMAINS USING WORLD WIDE KNOWLEDGE FOR TRANSFER LEARNING 783

Authorized licensed use limited to: Gnanamani College of Technology. Downloaded on June 12,2010 at 18:23:20 UTC from IEEE Xplore. Restrictions apply.