Coclustering Base Classification For Out Of Domain Documents

Authors: Wenyuan Dai, Gui-Rong Xue,

Qiang Yang, Yong Yu

KDD’07, August 12-15, 2007, San Jose, California, USA.

Presenter: Rei-Zhe Liu

Co-clustering based classification for

out-of-domain documents

Outline

Introduction

Problem formulation

Co-clustering based classification algorithm

Experiments

Data sets

Methods

Data preprocessing

Evaluation metrics

Experimental results

Conclusions and future works

2

Introduction(1/3)

In this paper, we focus on the problem of classifying

documents across different domains.

We have a labeled data set Di from one domain called in-

domain, and another data set Do from a related but

different domain called out-of-domain which is unlabeled.

Di and Do are drawn from different distributions, since they

are from different domains.

3

Introduction(2/3)

We assume that, the class labels in Di and the labels to be

predicted in Do are drawn from the same class-label set C.

Furthermore, we assume that even though the two domains

are different in distributions, they are similar in the sense that

similar words describe similar categories.

4

Introduction(3/3)

This assumption is often true, since Di and Do are related

text domains, although some words in one domain may be

missing in the other domain.

They are the reason which makes the estimated probability in

the two domains to be quite different.

Under such circumstances, our objective is to accurately

classify the out-of-domain documents in Do, by making use

of the in-domain data Di and their labels.

5

Problem formulation(1/5)

The mutual information is a measure of the dependency

between random variables.

Let X and Y be random variable sets with a joint distribution

p(X, Y) and marginal distributions p(X) and p(Y). The

mutual information I(X, Y) is defined as

6


The use of mutual information can also be motivated using

the KL divergence, defined for two probability mass function

p(x) and q(x).

7


Co-clustering on out-of-domain data aims to simultaneously

cluster the out-of-domain documents Do and words W into

|C| document cluster and k word clusters.

Let ^Do denote the out-of-domain document clustering, and

^W denote the word clustering, where |^W| = k

8


We aims to minimize the loss in mutual information between

documents and words before and after clustering.

We aims to minimize the loss in mutual information between

class labels C and words W before and after clustering, for in-

domain data.

9


Integrating the two function, the loss function for co-

clustering based classification can be obtained:

where λ is a trade-off parameter that balances the effect to

word clusters from co-clustering.

10

Co-clustering based classification

algorithm(1/6)

Lemma 1: For a fixed co-clustering(^Do, ^W), we can write

the loss in mutual information as

Definition:

11


algorithm(2/6)

Lemma 2:

Note that

12


algorithm(3/6)

Lemma 3:

Lemma 2 tells us that minimizing

corresponding to a single document d can reduce the global

objective function value given in Lemma 1.

From Lemma 3, we can obtain the similar conclusion with

that in lemma 2.

13


algorithm(4/6)

14


algorithm(5/6)

15


algorithm(6/6)

Regarding the computational complexity, suppose the total

number of document-word co-occurrences in Do is N. The

number of iterations is T, therefore,

Considering space complexity, our algorithm need to store

all the document-word co-occurrences and their

corresponding probabilities. Thus the space complexity is

16

Experiments

Data sets(1/6) 20 Newsgroups, SRAA, Reuters-21578

18

Data sets(2/6)

The 20 newsgroups is a text collection of approximately

20000 newsgroup documents, partitioned across 20 different

newsgroup nearly evenly.

Generated 6 different data sets.

For each data set, 2 top categories are chosen.

One as positive, and the other as negative.

Different subcategories can be considered as different domains.

19

Data sets(3/6)

20

Data sets(4/6)

21

Data sets(5/6)

Figure 2 shows the document-word co-occurrence

distribution on the auto vs. aviation dataset.

The documents are ordered first by their domains, 1 to 8000

are from Di, while 8001 to 16000 are from Do, and second

by their categories, positive or negative.

22

Data sets(6/6)

The words are sorted by n+(w) / n-(w), where n+(w) and n-

(w) represent the number of word positions w appears in

positive and negative documents, respectively.

From Figure 2, it can be found that the distributions of in-

domain and out-of-domain data are somewhat different,

however also shows large commonness exists between the

two domains.

23

Data preprocessing

First, we converted all the letters in the text to lower case,

and stemmed the words using the Porter stemmer.

Besides, stop words were removed.

24

Evaluation metrics

Let C be the function which maps from document d to its

true class label c = C(d) and F be the function which maps

from document d to its prediction label c = F(d) given by the

classifiers.

25

Experimental results(1/5)

26


27


28


29


30

Conclusions and future works

In our experiment, it is shown that CoCC greatly

outperforms traditional supervised and semi-supervised

classification algorithms when classifying out-of-domain

documents.

In CoCC, the number of word clusters are quite large to

obtain good performance. Since the time complexity of

CoCC depends on the number of word clusters, it can be

inefficient.

In the future, we will try to speed up the algorithm to make

it more scalable for large data set.

31

Coclustering Base Classification For Out Of Domain Documents

Education

Transcript of Coclustering Base Classification For Out Of Domain Documents