CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570:...

36
1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., ©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann, and ©2006 Tan, Steinbach & Kumar. Introd. Data Mining., Pearson. Addison Wesley. October 7, 2013

Transcript of CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570:...

Page 1: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

1

CS570: Introduction to Data Mining

Scalable Clustering Methods: BIRCH and Others

Reading: Chapter 10.3 Han, Chapter 9.5 Tan

Cengiz Gunay, Ph.D.

Slides courtesy of Li Xiong, Ph.D.,

©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann, and

©2006 Tan, Steinbach & Kumar. Introd. Data Mining., Pearson. Addison Wesley.

October 7, 2013

Page 2: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Previously: Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree

Can be visualized as a dendrogram, a tree like diagram

Clustering obtained by cutting at desired level

Do not have to assume any particular number of clusters

May correspond to meaningful taxonomies

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

October 7, 2013 2

Page 3: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Previously: Major Weaknesses of Hierarchical Clustering

Do not scale well (N: number of points)

Space complexity:

Time complexity:

Cannot undo what was done previously

Quality varies in terms of distance measures

MIN (single link): susceptible to noise/outliers

MAX/GROUP AVERAGE: may not work well with non-globular clusters

How to improve?

O(N2)

O(N3)

O(N2 log(N)) for some cases/approaches

Page 4: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 4

Scalable Hierarchical Clustering Methods

Combines hierarchical and partitioning approaches

Recent methods:

BIRCH (1996): uses CF-tree and incrementally adjusts the

quality of sub-clusters

CURE(1998): uses representative points for inter-cluster

distance

ROCK (1999): clustering categorical data by neighbor and

link analysis

CHAMELEON (1999): hierarchical clustering using dynamic

modeling on graphs

Page 5: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

BIRCH – A Tree-based Approach

October 8, 2013 Data Mining: Scalable Clustering 5

Page 6: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

BIRCH

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD’96)

SIGMOD 10-year “test of time” award Main ideas:

Incremental (does not need the whole dataset) Summarizes using in-memory clustering feature Combines hierarchical clustering for microclustering

and partitioning for macroclustering Features:

Scales linearly: single scan and improves the quality with a few additional scans

Weakness: handles only numeric data, and sensitive to the order of the data records.

October 8, 2013 Data Mining: Concepts and Techniques 6

Page 7: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 7

Cluster Statistics

Given a cluster of instances

Centroid:

Radius: average distance from member points to centroid

Diameter: average pair-wise distance within a cluster

Page 8: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 8

Intra-Cluster Distance

Centroid Euclidean distance: Centroid Manhattan distance: Average distance:

Given two clusters

Page 9: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 9

Clustering Feature (CF) in BIRCH

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

What is it good for?

Page 10: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 10

Properties of Clustering Feature

CF entry is more compact

Stores significantly less then all of the data points in the sub-cluster

A CF entry has sufficient information to calculate statistics about the cluster and intra-cluster distances

Additivity theorem allows us to merge sub-clusters incrementally & consistently

Page 11: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Cluster Statistics from CF

CF:

Radius (average distance to centroid)

Diameter (average pairwise distance)

October 7, 2013 Data Mining: Concepts and Techniques 11

22

2

1

0 /2/ nnLSLSnSSnxxRn

i

i

1/221/ 2

2

1 1

nnLSnSSnnxxDn

i

n

j

ji

SSLSn ,,

Page 12: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 12

Hierarchical CF-Tree

A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering

A nonleaf node in a tree has descendants or “children”

The nonleaf nodes store sums of the CFs of their children

A CF tree has two parameters

Branching factor: maximum number of children.

Threshold: max diameter of sub-clusters stored at the leaf nodes

Why a tree?

Page 13: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 13

The CF Tree Structure

CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6 prev next CF1 CF2 CF4

prev next

Root

Non-leaf node

Leaf node Leaf node

Page 14: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 14

CF-Tree Insertion

Traverse down from root (top-down), find the appropriate leaf

Follow the "closest"-CF path, w.r.t. intra-cluster distance measures

Modify the leaf

If the closest-CF leaf cannot absorb, make a new CF entry.

If there is no room for new leaf, split the parent node

Traverse back & up

Updating CFs on the path or splitting nodes

Page 15: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 15

BIRCH Overview

Page 16: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 16

The Algorithm: BIRCH

Phase 1: Scan database to build an initial in-memory CF-tree Subsequent phases become fast, accurate, less order

sensitive

Phase 2: Condense data (optional) Rebuild the CF-tree with a larger T (why?)

Phase 3: Global clustering Use existing clustering algorithm on CF entries Helps fix problem where natural clusters span nodes

Phase 4: Cluster refining (optional) Do additional passes over the dataset & reassign data

points to the closest centroid from phase 3

Page 17: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

BIRCH Summary

Main ideas: Incremental (does not need the whole dataset) Use in-memory clustering feature to summarize Use hierarchical clustering for microclustering and

other clustering methods (e.g. partitioning) for macroclustering

Features: Scales linearly: single scan and improves the quality

with a few additional scans Weakness:

handles only numeric data sensitive to the order of the data records

unnatural clusters because of leaf node limit

spherical clusters because of diameter (how to solve?)

October 7, 2013 Data Mining: Concepts and Techniques 17

Page 18: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

CURE

CURE: Clustering Using REpresentatives

CURE: An Efficient Clustering Algorithm for Large Databases (1998) S Guha, R Rastogi, K Shim

Addresses potential problems with min, max, centroid distance based hierarchical clustering

Main ideas:

Use representative points for inter-cluster distance

Random sampling and partitioning

Features:

More robust to outliers

Better for non-spherical shapes and non-uniform sizes

See Tan Ch 9.5.3, pp. 635-639

Page 19: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

A number of points (e.g., 10+) represents a cluster

Representative points:

Start with farthest from centroid

Add farthest from all selected, and so on until k points

“Shrink” them by to center of the cluster

Why shrink? Parallels to other methods?

Cluster similarity is the similarity of the closest pair of representative points from different clusters

CURE: Cluster Points

distance

Page 20: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Experimental Results: CURE

Picture from CURE, Guha, Rastogi, Shim.

Page 21: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Experimental Results: CURE

Picture from CURE, Guha, Rastogi, Shim.

(centroid)

(single link)

Page 22: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

CURE Cannot Handle Differing Densities

Original Points CURE

So far only numerical data?

Page 23: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 23

Clustering Categorical Data: The ROCK Algorithm

ROCK: RObust Clustering using linKs

S. Guha, R. Rastogi & K. Shim, Int. Conf. Data Eng. (ICDE) ’99

Major ideas

Use links to measure similarity/proximity

Sampling-based clustering

Features:

More meaningful clusters

Emphasizes interconnectivity but ignores proximity

Not in textbook, see paper above.

Page 24: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Scalable Clustering 24

Similarity Measure in ROCK

Market basket data clustering

Jaccard coefficient-based similarity function:

Example: Two groups (clusters) of transactions

C1. <a, b, c, d, e> {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c,

e}, {b, d, e}, {c, d, e}

C2. <a, b, f, g> {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

Jaccard coefficient may lead to wrong clustering result:

Sim T TT T

T T( , )

1 2

1 2

1 2

2.05

1

},,,,{

}{),( 21

edcba

cTTSim

5.04

2

},,,{

},{),( 31

fcba

fcTTSim

Page 25: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 25

Link Measure in ROCK

Neighbor:

Links: # of common neighbors

Reminder:

Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}

Let C1: <a, b, c, d, e>, C2: <a, b, f, g>

Example:

link(T1, T2) = 4, since they have 4 common neighbors

{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}

link(T1, T3) = 3, since they have 3 common neighbors

{a, b, d}, {a, b, e}, {a, b, g}

),( jiSim

Page 26: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Rock Algorithm

1. Obtain a sample of points from the data set

2. Compute the link value for each set of points (computed by Jaccard coefficient)

3. Perform an agglomerative (bottom-up) hierarchical clustering on the data using the “number of shared neighbors” as similarity measure

4. Assign the remaining points to the clusters that have been found

October 7, 2013 Data Mining: Concepts and Techniques 26

Page 27: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Cluster Merging: Limitations of Current Schemes

Existing schemes are static in nature:

MIN or CURE

merge two clusters based on their closeness (or minimum distance)

GROUP-AVERAGE or ROCK:

merge two clusters based on their average connectivity

Page 28: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Limitations of Current Merging Schemes

Closeness schemes will

merge (a) and (b)

(a)

(b)

(c)

(d)

Average connectivity schemes will merge (c)

and (d) Solution?

Page 29: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Chameleon: Clustering Using Dynamic Modeling

Adapt to the characteristics of the data set to find the natural clusters

Use a dynamic model to measure the similarity between clusters

Main property is the relative closeness and relative inter-connectivity of the cluster

Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters

The merging scheme preserves self-similarity

Page 30: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 30

CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999)

CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99

Basic ideas:

A graph-based clustering approach

A two-phase algorithm:

Partitioning: cluster objects into a large number of small sub-clusters

Agglomerative hierarchical clustering: repeatedly combine sub-clusters

Measures the similarity based on a dynamic model

interconnectivity and closeness (proximity)

Features:

Handles clusters of arbitrary shapes, sizes, and density

Scales well

See Sections Han 10.3.4 and Tan 9.4.4

Page 31: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Graph-Based Clustering

Uses the proximity graph

Start with the proximity matrix

Consider each point as a node in a graph

Each edge between two nodes has a weight which is the proximity between the two points

Fully connected proximity graph

MIN (single-link) and MAX (complete-link)

Sparsification eliminates data for graph algos.

kNN

Threshold based

Clusters are connected components in the graph

Page 32: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 32

Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

Page 33: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Chameleon: Steps

Preprocessing Step: Represent the Data by a Graph

Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors

Concept of neighborhood is captured dynamically (even if region is sparse)

Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices

Each cluster should contain mostly points from one “true” cluster, i.e., is a sub-cluster of a “real” cluster

Page 34: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

Chameleon: Steps (cont)

Phase 2: Use Hierarchical Agglomerative (bottom-up) Clustering to merge sub-clusters

Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters

Two key properties used to model cluster similarity:

Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters

Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters

Page 35: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 35

CHAMELEON (Clustering Complex Objects)

Page 36: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 1 CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading:

October 7, 2013 Data Mining: Concepts and Techniques 36

Chapter 7. Cluster Analysis

Overview

Partitioning methods

Hierarchical methods

Density-based methods

Other methods

Cluster evaluation

Outlier analysis

Summary