Post on 11-May-2015
description
1
INFORMATION-THEORETIC CO-
CLUSTERINGAuthors / Inderjit S. Dhillon, Subramanyam Mallela and
Dharmendra S. Modha
Conference / ACM SIGKDD ’03, August 24-27, 2003, Washington
Presenter / Meng-Lun, Wu
2
OUTLINE Introduction Problem Formulation Co-Clustering Algorithm Experimental Result Conclusions And Future Work
3
INTRODUCTION (CONT.) Clustering is a fundamental tool in
unsupervised learning. Most clustering algorithms focus on one-
way clustering.
doc Word1 … Wordn
50 12 … 10
52 13 … 0
53 10 … 20
Clustering
doc Word1 … Wordn Cluster
50 12 … 10 Cluster0
52 13 … 0 Cluster1
53 10 … 20 Cluster0
4
INTRODUCTION (CONT.) It is often desirable to co-cluster or
simultaneously cluster both dimensions. The normalized non-negative
contingency table into a joint probability distribution between two discrete random variables.
The optimal co-clustering is one that leads to the largest mutual information between the clustered random variables.
5
INTRODUCTION (CONT.) The optimal co-clustering is one that
minimizes the loss in mutual information.
The mutual information of two random variables is a quantity that measures the mutual dependence of the two variables.
Formally, the mutual information can be defined as:
Xx Yy ypxp
yxpyxpYXI
)()(
),(log),();(
6
INTRODUCTION (CONT.) The Kullback-Leibler (K-L) divergence,
measures the difference between two probability distributions.
Given the true probability distribution p(x,y) and another distribution q(x,y) can be defined as:
x y
KL yxq
yxpyxpyxqyxpD
),(
),(log),(),(||),((
7
PROBLEM FORMULATION Let X and Y be discrete random
variables.X: {x1,…,xm}, Y: {y1,…,yn}
p(X, Y) denote the joint probability distribution.
Let the k clusters of X as: Let the l clusters of Y as: {ŷ1, ŷ2, . . . , ŷl}
)(ˆ}ˆ,...,ˆ,ˆ{},...,,{:
)(ˆ}ˆ,...,ˆ,ˆ{},...,,{:
2121
2121
YCYyyyyyyC
XCXxxxxxxC
YlnY
XkmX
}ˆ,...,ˆ,ˆ{ 21 kxxx
8
PROBLEM FORMULATION (CONT.) Definition
An optimal co-clustering minimizes
Subject to constraints on the number of row and column clusters.
For a fixed co-clustering (CX,CY), we can write the loss in mutual information.
)ˆ;ˆ();( YXIYXI
)),(||),(()ˆ;ˆ();( yxqyxpDYXIYXI KL
9
PROBLEM FORMULATION (CONT.)
)),(||),((),(
),(log),(
)ˆ|()ˆ|()ˆ,ˆ(
),(log),(
)ˆ()(
)ˆ()(
)ˆ,ˆ(
),(log),(
)ˆ()ˆ()ˆ,ˆ(
1
)()(
),(log),(
)ˆ()ˆ()ˆ,ˆ()()(),(
log),(
ˆ ˆ
ˆ ˆˆ ˆ
ˆ ˆˆ ˆ
yxqyxpDyxq
yxpyxp
yypxxpyxp
yxpyxp
ypyp
xpxp
yxp
yxpyxp
ypxpyxpypxp
yxpyxp
ypxpyxpypxpyxp
yxp
KLx y xx yy
x y xx yyx y xx yy
x y xx yyx y xx yy
10
PROBLEM FORMULATION (CONT.)
q(X,Y) is a distribution of the form
0.18 0.18 0.14 0.14 0.18 0.180.150.150.150.150.20.2
)ˆ(
)(
)ˆ(
)()ˆ,ˆ()ˆ|()ˆ|()ˆ,ˆ(),(
yp
yp
xp
xpyxpyypxxpyxpyxq
0.5 0.5
0.30.30.4
Suppose
054.05.0
18.0
3.0
15.03.0
11
CO-CLUSTERING ALGORITHM Input :
The joint probability distribution p(X,Y), k the desired number of row clusters and l the desired number of column clusters.
Output:The partition functions C†
X and C†Y
12
CO-CLUSTERING ALGORITHM (CONT.)
D(p||q)
0.041909
0.041909
0.05696
0.05696
0.0376
0.049641
^x3^x1
D(p||q)
0.05696
0.05696
0.04191
0.04191
0.049641
0.0376
^x3^x2
13
CO-CLUSTERING ALGORITHM (CONT.)
D(p||q)0.0211
80.0211
80.0224
30.0407
650.0489
30.0489
3
ŷ2 ŷ1
D(p||q)0.04813
80.04813
80.04194
20.0229
50.0205
20.0205
2
ŷ1 ŷ2
14
CO-CLUSTERING ALGORITHM (CONT.)
D(p||q)=0.02881
15
EXPERIMENTAL RESULTS For our experimental results we use
various subsets of the 20-Newsgroup data(NG20).
We use 1D-clustering to denote document clustering without any word clustering.
Evaluation MeasuresMicro-averaged-precision
Micro-averaged-recall
16
EXPERIMENTAL RESULTS (CONT.)
17
EXPERIMENTAL RESULTS (CONT.)
18
EXPERIMENTAL RESULTS (CONT.)
19
CONCLUSIONS AND FUTURE WORK The information-theoretic formulation
for co-clustering can be guaranteed to reach a local minimum in a finite number of steps.
Co-clustering for joint distribution of two random variables.
In this paper, the row and column clusters are pre-specified.
We hope that an information-theoretic regularization procedure may allow us to select the number of clusters.