Idea of Co-Clustering

10
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Idea of Co-Clustering Co-clustering To combine the row and column clustering of co-occurrence matrix together and bootstrap each other. Simultaneously cluster the rows X and columns Y of the co-occurrence matrix.

description

Idea of Co-Clustering. Co-clustering To combine the row and column clustering of co-occurrence matrix together and bootstrap each other. Simultaneously cluster the rows X and columns Y of the co-occurrence matrix. Hierarchical Co-Clustering Based on Entropy Splitting. - PowerPoint PPT Presentation

Transcript of Idea of Co-Clustering

Page 1: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Idea of Co-Clustering

• Co-clusteringTo combine the row and column clustering of co-

occurrence matrix together and bootstrap each other.Simultaneously cluster the rows X and columns Y of

the co-occurrence matrix.

Page 2: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Hierarchical Co-Clustering Based on Entropy Splitting

• View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables

• Objective: seeking a hierarchical co-clustering containing given number of clusters while maintaining as much “Mutual Information” between row and column clusters as possible.

yx

yxoccurenceco

yxoccurencecoyxp

,

),(#

),(#),(

XY

XY

c1 c2 c3 c4

r1 0.1 0 0.2 0

r2 0 0.1 0.1 0

r3 0.2 0.1 0.1 0

r4 0 0 0 0.1

Page 3: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Hierarchical Co-Clustering Based on Entropy Splitting

Y

0.1 0 0.2 0

0 0.1 0.1 0

0.2 0.1 0.1 0

0 0 0 0.1

X

0.1 0.2 0

0.2 0.4 0

0 0 0.1

Y

X

1

Y

X

1 0 2 0

0 1 1 0

2 1 1 0

0 0 0 1

1 0 2 0

0 1 1 0

2 1 1 0

0 0 0 1

1 0 2 0

0 1 1 0

2 1 1 0

0 0 0 1

0 0.4691 0.7751

Co-occurrence Matrices

Joint probability distribution between row & column cluster random variables

Page 4: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Hierarchical Co-Clustering Based on Entropy Splitting

Update cluster indicators

Pipeline: (recursive splitting)

While(Termination condition)While(Termination condition)

Find optimal row/column cluster split which achieves maximal ˆ ˆ( , )I X Y

Termination Condition: ˆ ˆ( , )

( , )ˆˆ| | max | | maxI X Y

r cI X Y or R or C

Page 5: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Hierarchical Co-Clustering Based on Entropy Splitting

Randomly split cluster S into S1 and S2

Converge at a local optima

How to find an optimal split at each step?

An Entropy-based Splitting Algorithm:

Input: Cluster S

Until Convergence Until Convergence

Update cluster indicators and probability values

For all element x in S, re-assign it to cluster S1 or S2 to minimize:

ˆ ˆ( ( | ) || ( | ) {1,2}jD p Y x p Y S j

Page 6: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Hierarchical Co-Clustering Based on Entropy Splitting

• Example Y1 Y2 Y3 Y4

X1 0.1 0 0 0

X2 0 0.2 0.2 0

X3 0 0.2 0.2 0

X4 0.1 0 0 0

S={X1, X2, X3, X4}

S1={X1} S2={X2, X3, X4}

Naïve method needs trying 7 splits.Exponential time to size of S.

Randomly split

Re-assign X4 to S1S2={X2, X3}S1={X1, X4}

Page 7: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Experiments

• Data sets Synthetic data 20 Newsgroups data

20 classes, 20000 documents

Page 8: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results-Synthetic Data

11.40

1000*1000 Matrix

Add noise to (a) by flipping values with probability 0.3

Randomly permute rows andcolumns of (b)

Clustering resultWith hierarchical structure

Page 9: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results-20 Newsgroups Data

Compare

with

baselines:

Method HICC NVBD ICC HCC

Dataset

m-pre

#clusters

m-pre #clusters

m-pre #clusters m-pre #clusters

Multi5subject 0.95 5 0.93 5 0.89 5 0.72 5 Multi5 0.93 5 N/A 0.87 5 0.71 5

Multi10subject 0.69 10 0.67 10 0.54 10 0.44 10 Multi10 0.67 10 N/A 0.56 10 0.61 10

HICC(merged) Single-Link UPGMA WPGMA Complete-Link

m-pre

#clusters m-pre

#clusters m-pre

#clusters m-pre

#clusters m-pre #clusters

0.96 30 0.27 30 0.73 30 0.65 30 0.89 300.96 30 0.29 30 0.59 30 0.71 30 0.85 30

0.74 60 0.24 60 0.60 60 0.58 60 0.67 600.74 60 0.24 60 0.61 60 0.62 60 0.60 60

Micro-averaged precision: M/NM:number of documents correctly clustered; N: total number of documents

Page 10: Idea of Co-Clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Thank You !

Questions?