Application of Web Page Classification in a Domain-specific Search
Coclustering Base Classification For Out Of Domain Documents
description
Transcript of Coclustering Base Classification For Out Of Domain Documents
Authors: Wenyuan Dai, Gui-Rong Xue,
Qiang Yang, Yong Yu
KDD’07, August 12-15, 2007, San Jose, California, USA.
Presenter: Rei-Zhe Liu
Co-clustering based classification for
out-of-domain documents
Outline
Introduction
Problem formulation
Co-clustering based classification algorithm
Experiments
Data sets
Methods
Data preprocessing
Evaluation metrics
Experimental results
Conclusions and future works
2
Introduction(1/3)
In this paper, we focus on the problem of classifying
documents across different domains.
We have a labeled data set Di from one domain called in-
domain, and another data set Do from a related but
different domain called out-of-domain which is unlabeled.
Di and Do are drawn from different distributions, since they
are from different domains.
3
Introduction(2/3)
We assume that, the class labels in Di and the labels to be
predicted in Do are drawn from the same class-label set C.
Furthermore, we assume that even though the two domains
are different in distributions, they are similar in the sense that
similar words describe similar categories.
4
Introduction(3/3)
This assumption is often true, since Di and Do are related
text domains, although some words in one domain may be
missing in the other domain.
They are the reason which makes the estimated probability in
the two domains to be quite different.
Under such circumstances, our objective is to accurately
classify the out-of-domain documents in Do, by making use
of the in-domain data Di and their labels.
5
Problem formulation(1/5)
The mutual information is a measure of the dependency
between random variables.
Let X and Y be random variable sets with a joint distribution
p(X, Y) and marginal distributions p(X) and p(Y). The
mutual information I(X, Y) is defined as
6
Problem formulation(2/5)
The use of mutual information can also be motivated using
the KL divergence, defined for two probability mass function
p(x) and q(x).
7
Problem formulation(3/5)
Co-clustering on out-of-domain data aims to simultaneously
cluster the out-of-domain documents Do and words W into
|C| document cluster and k word clusters.
Let ^Do denote the out-of-domain document clustering, and
^W denote the word clustering, where |^W| = k
8
Problem formulation(4/5)
We aims to minimize the loss in mutual information between
documents and words before and after clustering.
We aims to minimize the loss in mutual information between
class labels C and words W before and after clustering, for in-
domain data.
9
Problem formulation(5/5)
Integrating the two function, the loss function for co-
clustering based classification can be obtained:
where λ is a trade-off parameter that balances the effect to
word clusters from co-clustering.
10
Co-clustering based classification
algorithm(1/6)
Lemma 1: For a fixed co-clustering(^Do, ^W), we can write
the loss in mutual information as
Definition:
11
Co-clustering based classification
algorithm(2/6)
Lemma 2:
Note that
12
Co-clustering based classification
algorithm(3/6)
Lemma 3:
Lemma 2 tells us that minimizing
corresponding to a single document d can reduce the global
objective function value given in Lemma 1.
From Lemma 3, we can obtain the similar conclusion with
that in lemma 2.
13
Co-clustering based classification
algorithm(4/6)
14
Co-clustering based classification
algorithm(5/6)
15
Co-clustering based classification
algorithm(6/6)
Regarding the computational complexity, suppose the total
number of document-word co-occurrences in Do is N. The
number of iterations is T, therefore,
Considering space complexity, our algorithm need to store
all the document-word co-occurrences and their
corresponding probabilities. Thus the space complexity is
16
Experiments
Data sets(1/6) 20 Newsgroups, SRAA, Reuters-21578
18
Data sets(2/6)
The 20 newsgroups is a text collection of approximately
20000 newsgroup documents, partitioned across 20 different
newsgroup nearly evenly.
Generated 6 different data sets.
For each data set, 2 top categories are chosen.
One as positive, and the other as negative.
Different subcategories can be considered as different domains.
19
Data sets(3/6)
20
Data sets(4/6)
21
Data sets(5/6)
Figure 2 shows the document-word co-occurrence
distribution on the auto vs. aviation dataset.
The documents are ordered first by their domains, 1 to 8000
are from Di, while 8001 to 16000 are from Do, and second
by their categories, positive or negative.
22
Data sets(6/6)
The words are sorted by n+(w) / n-(w), where n+(w) and n-
(w) represent the number of word positions w appears in
positive and negative documents, respectively.
From Figure 2, it can be found that the distributions of in-
domain and out-of-domain data are somewhat different,
however also shows large commonness exists between the
two domains.
23
Data preprocessing
First, we converted all the letters in the text to lower case,
and stemmed the words using the Porter stemmer.
Besides, stop words were removed.
24
Evaluation metrics
Let C be the function which maps from document d to its
true class label c = C(d) and F be the function which maps
from document d to its prediction label c = F(d) given by the
classifiers.
25
Experimental results(1/5)
26
Experimental results(2/5)
27
Experimental results(3/5)
28
Experimental results(4/5)
29
Experimental results(5/5)
30
Conclusions and future works
In our experiment, it is shown that CoCC greatly
outperforms traditional supervised and semi-supervised
classification algorithms when classifying out-of-domain
documents.
In CoCC, the number of word clusters are quite large to
obtain good performance. Since the time complexity of
CoCC depends on the number of word clusters, it can be
inefficient.
In the future, we will try to speed up the algorithm to make
it more scalable for large data set.
31