Coclustering Base Classification For Out Of Domain Documents

31
Authors: Wenyuan Dai, Gui-Rong Xue, Qiang Yang,Yong Yu KDD’07, August 12-15, 2007, San Jose, California, USA. Presenter: Rei-Zhe Liu Co-clustering based classification for out-of-domain documents

description

Coclustering base classification for out-of-domain documents.pdf

Transcript of Coclustering Base Classification For Out Of Domain Documents

Page 1: Coclustering Base Classification For Out Of Domain Documents

Authors: Wenyuan Dai, Gui-Rong Xue,

Qiang Yang, Yong Yu

KDD’07, August 12-15, 2007, San Jose, California, USA.

Presenter: Rei-Zhe Liu

Co-clustering based classification for

out-of-domain documents

Page 2: Coclustering Base Classification For Out Of Domain Documents

Outline

Introduction

Problem formulation

Co-clustering based classification algorithm

Experiments

Data sets

Methods

Data preprocessing

Evaluation metrics

Experimental results

Conclusions and future works

2

Page 3: Coclustering Base Classification For Out Of Domain Documents

Introduction(1/3)

In this paper, we focus on the problem of classifying

documents across different domains.

We have a labeled data set Di from one domain called in-

domain, and another data set Do from a related but

different domain called out-of-domain which is unlabeled.

Di and Do are drawn from different distributions, since they

are from different domains.

3

Page 4: Coclustering Base Classification For Out Of Domain Documents

Introduction(2/3)

We assume that, the class labels in Di and the labels to be

predicted in Do are drawn from the same class-label set C.

Furthermore, we assume that even though the two domains

are different in distributions, they are similar in the sense that

similar words describe similar categories.

4

Page 5: Coclustering Base Classification For Out Of Domain Documents

Introduction(3/3)

This assumption is often true, since Di and Do are related

text domains, although some words in one domain may be

missing in the other domain.

They are the reason which makes the estimated probability in

the two domains to be quite different.

Under such circumstances, our objective is to accurately

classify the out-of-domain documents in Do, by making use

of the in-domain data Di and their labels.

5

Page 6: Coclustering Base Classification For Out Of Domain Documents

Problem formulation(1/5)

The mutual information is a measure of the dependency

between random variables.

Let X and Y be random variable sets with a joint distribution

p(X, Y) and marginal distributions p(X) and p(Y). The

mutual information I(X, Y) is defined as

6

Page 7: Coclustering Base Classification For Out Of Domain Documents

Problem formulation(2/5)

The use of mutual information can also be motivated using

the KL divergence, defined for two probability mass function

p(x) and q(x).

7

Page 8: Coclustering Base Classification For Out Of Domain Documents

Problem formulation(3/5)

Co-clustering on out-of-domain data aims to simultaneously

cluster the out-of-domain documents Do and words W into

|C| document cluster and k word clusters.

Let ^Do denote the out-of-domain document clustering, and

^W denote the word clustering, where |^W| = k

8

Page 9: Coclustering Base Classification For Out Of Domain Documents

Problem formulation(4/5)

We aims to minimize the loss in mutual information between

documents and words before and after clustering.

We aims to minimize the loss in mutual information between

class labels C and words W before and after clustering, for in-

domain data.

9

Page 10: Coclustering Base Classification For Out Of Domain Documents

Problem formulation(5/5)

Integrating the two function, the loss function for co-

clustering based classification can be obtained:

where λ is a trade-off parameter that balances the effect to

word clusters from co-clustering.

10

Page 11: Coclustering Base Classification For Out Of Domain Documents

Co-clustering based classification

algorithm(1/6)

Lemma 1: For a fixed co-clustering(^Do, ^W), we can write

the loss in mutual information as

Definition:

11

Page 12: Coclustering Base Classification For Out Of Domain Documents

Co-clustering based classification

algorithm(2/6)

Lemma 2:

Note that

12

Page 13: Coclustering Base Classification For Out Of Domain Documents

Co-clustering based classification

algorithm(3/6)

Lemma 3:

Lemma 2 tells us that minimizing

corresponding to a single document d can reduce the global

objective function value given in Lemma 1.

From Lemma 3, we can obtain the similar conclusion with

that in lemma 2.

13

Page 14: Coclustering Base Classification For Out Of Domain Documents

Co-clustering based classification

algorithm(4/6)

14

Page 15: Coclustering Base Classification For Out Of Domain Documents

Co-clustering based classification

algorithm(5/6)

15

Page 16: Coclustering Base Classification For Out Of Domain Documents

Co-clustering based classification

algorithm(6/6)

Regarding the computational complexity, suppose the total

number of document-word co-occurrences in Do is N. The

number of iterations is T, therefore,

Considering space complexity, our algorithm need to store

all the document-word co-occurrences and their

corresponding probabilities. Thus the space complexity is

16

Page 17: Coclustering Base Classification For Out Of Domain Documents

Experiments

Page 18: Coclustering Base Classification For Out Of Domain Documents

Data sets(1/6) 20 Newsgroups, SRAA, Reuters-21578

18

Page 19: Coclustering Base Classification For Out Of Domain Documents

Data sets(2/6)

The 20 newsgroups is a text collection of approximately

20000 newsgroup documents, partitioned across 20 different

newsgroup nearly evenly.

Generated 6 different data sets.

For each data set, 2 top categories are chosen.

One as positive, and the other as negative.

Different subcategories can be considered as different domains.

19

Page 20: Coclustering Base Classification For Out Of Domain Documents

Data sets(3/6)

20

Page 21: Coclustering Base Classification For Out Of Domain Documents

Data sets(4/6)

21

Page 22: Coclustering Base Classification For Out Of Domain Documents

Data sets(5/6)

Figure 2 shows the document-word co-occurrence

distribution on the auto vs. aviation dataset.

The documents are ordered first by their domains, 1 to 8000

are from Di, while 8001 to 16000 are from Do, and second

by their categories, positive or negative.

22

Page 23: Coclustering Base Classification For Out Of Domain Documents

Data sets(6/6)

The words are sorted by n+(w) / n-(w), where n+(w) and n-

(w) represent the number of word positions w appears in

positive and negative documents, respectively.

From Figure 2, it can be found that the distributions of in-

domain and out-of-domain data are somewhat different,

however also shows large commonness exists between the

two domains.

23

Page 24: Coclustering Base Classification For Out Of Domain Documents

Data preprocessing

First, we converted all the letters in the text to lower case,

and stemmed the words using the Porter stemmer.

Besides, stop words were removed.

24

Page 25: Coclustering Base Classification For Out Of Domain Documents

Evaluation metrics

Let C be the function which maps from document d to its

true class label c = C(d) and F be the function which maps

from document d to its prediction label c = F(d) given by the

classifiers.

25

Page 26: Coclustering Base Classification For Out Of Domain Documents

Experimental results(1/5)

26

Page 27: Coclustering Base Classification For Out Of Domain Documents

Experimental results(2/5)

27

Page 28: Coclustering Base Classification For Out Of Domain Documents

Experimental results(3/5)

28

Page 29: Coclustering Base Classification For Out Of Domain Documents

Experimental results(4/5)

29

Page 30: Coclustering Base Classification For Out Of Domain Documents

Experimental results(5/5)

30

Page 31: Coclustering Base Classification For Out Of Domain Documents

Conclusions and future works

In our experiment, it is shown that CoCC greatly

outperforms traditional supervised and semi-supervised

classification algorithms when classifying out-of-domain

documents.

In CoCC, the number of word clusters are quite large to

obtain good performance. Since the time complexity of

CoCC depends on the number of word clusters, it can be

inefficient.

In the future, we will try to speed up the algorithm to make

it more scalable for large data set.

31