Mining 3-Clusters in Vertically Partitioned Data

Mining 3-Clusters in Vertically Partitioned Data

Faris Alqadah & Raj BhatnagarUniversity of Cincinnati

Outline

• Introduction to 3-clustering in binary, (categorical) vertically partitioned data

• Proposed cluster quality measure• 3-Clu: algorithm for enumerating 3-clusters

from two datasets

Introduction

Traditional clustering

Bi-Clustering

3-Clustering

Why 3-clusters?

• Find correspondence between bi-clusters of two different datasets

• Sharpen local clusters with outside knowledge

• Alternative? “Join datasets then search”– Does not capture underlying interactions– Inefficient– Not always possible

Why 3-clusters?<A,1234>

<AB,134>

<AWB,13>

<AY,12>

<AX,24>

<AWBCYZ,1>

<ABDX,4>

Formal Definitions

Bi-cluster in Di

3-Cluster across D1 and D

2

Pattern in Di

Defining 3-clusters• D

1 is the “learner”

• Maximal rectangle of 1's under suitable permutation in learner

• Best Correspondence to rectangle of 1's in D

2

D1D1

D1

D2

Cluster Quality Measure

• Intuition: Maximize number of 1's while also maximizing number of items and objects

• Trade off between objects and items– More items...less objects– More objects...less items

Quality Measure

–Consider bi-clusters in learner alone

I1

O C1

C2

•Which is preferable ?•User decides

Quality Measure• Quality measure:

– Monotonic in both width and height• Reflects intuition

– Balances width and height according to user defined parameter

• Introduce β

• Amount of width(attributes) willing to trade for a single unit of height (objects)

Quality Measure

Extending to 3-clusters

• Utilize same intuition• Width of 3-cluster is sum of individual

widths

Selecting β

• Larger values yield 3-clusters that are “wide” and “short” in both D1 and D2 – Cluster key websites popular with large number

of democrats and republicans

• Smaller values produce 3-clusters that are “narrow” and “long”– Discover long list of websites utilized by few

select democrats and republicans

3-Clu: Our Algorithm

• Search for 3-clusters similar to search for closed itemsets

• How to formulate the search space?– Assumption that objects out-number attributes

may not hold– Several possible orderings of the search space

Algorithm

Algorithm

• Define search space with primacy to objects

• Only need to maintain one search tree• Mimic closed itemset algorithm with

simultaneous pruning of search space• Prune with quality measure

Algorithm

Algorithm

• Cluster quality measure is neither monotone nor anti-monotone in the search space

• Pruning is still possible

Is C2 of higher quality ?

Algorithm

Algorithm

• Pruning rule is very optimistic

• Can be adjusted with some a-priori information

• Example β = 0.5

• x=2.73...can't prune– This assumes w will

stay at 15 for 3 more levels

Algorithm Analysis

• Computational cost: O (|O|*i*N)– Only as expensive as enumerating bi-

clusters in single dataset

• Communication cost: O(N)

• Correctness guaranteed by FCA theory

Experimental Results

• Performance tests

• Randomly split benchmark datasets CHESS and CONNECT

• Genetic dataset: Genes, GO terms, Phenotypes

• Compared to LCM and CHARM

ChessConnect

GO-Pheno

Experimental Results

• Test validity of 3-clusters

• Randomly partitioned Mushrooms dataset by attributes

Conclusion

• Novel concept of 3-clusters in vertically partitioned data

• Introduced quality measure framework for 3-clusters• Presented efficient algorithm based on closed itemset

mining algorithms, with adaptations:– Defined search space to enable simultaneous pruning

– Incorporated novel pruning method based on cluster quality measure

Mining 3-Clusters in Vertically Partitioned Data

Technology

Transcript of Mining 3-Clusters in Vertically Partitioned Data