Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

58
Copyright © 2012, SAS Institute Inc. All rights reserved. DETERMINING THE NUMBER OF CLUSTERS IN A DATASET USING ABC I. KABUL, P. HALL, J. SILVA, W. SARLE ENTERPRISE MINER R&D SAS INSTITUTE

Transcript of Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Page 1: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

DETERMINING THE NUMBER OF CLUSTERS IN A DATASET USING ABC I. KABUL, P. HALL, J. SILVA, W. SARLE

ENTERPRISE MINER R&DSAS INSTITUTE

Page 2: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

CLUSTERING

Objects within a cluster are as

similar as possible

Objects from different clusters

are as dissimilar as possible

Hossein Parsaei

Page 3: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

CHALLENGES IN CLUSTERING

• No prior knowledge• Which similarity measure ?• Which clustering algorithm? • How to evaluate the results?• How many clusters?

The Aligned Box Criterion (ABC) addresses the unsolved, important problem of determining the number of clusters in a data set.

ABC can be applied in Market Segmentation and many other types of statistical, data mining and machine learning analyses.

Page 4: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

CONTENTS

• Background• Aligned Box Criterion (ABC) Method• Results• ABC Method in Parallel and Distributed Architecture• Conclusions

Page 5: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

BACKGROUND

Page 6: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

FINDING THE RIGHT NUMBER OF CLUSTERS

• Many methods have been proposed:

• Calinski-Harabasz index [Calinski 1974]• Cubic clustering criterion (CCC) [Sarle 1983]• Silhouette statistic [Rousseeuw 1987]• Gap statistic [Tibshirani 2001]• Jump method [Sugar 2003] • Prediction strength [Tibshirani 2005]• Dirichlet process [Teh 2006]

Page 7: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

WITHIN CLUSTER SUM OF SQUARES

• A good clustering yields clusters where observations have small within-cluster sum-of-squares (and high between-cluster sum-of-squares).

• Low values when the partition is good, BUT these are by construction monotone nonincreasing (within cluster dissimilarity always decreases with more clusters)

r

r r

Ciir

Ci Cjjir

xxn

xxD

2

2

2

k

rr

rk D

nW

1 21

Within-cluster SSE:

Measure of compactness of clusters

Page 8: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

BACKGROUND USING WK TO DETERMINE # OF CLUSTERS

Elbow method (L-curve method)

Idea: use the k corresponding to the “elbow”

Problem: no reference clustering to compare

the differences Wk Wk1’s are not normalized for comparison

Page 9: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

BACKGROUND REFERENCE DISTRIBUTIONS

• Cubic Clustering Criterion (CCC), Gap Statistic and ABC amplify the elbow phenomenon by using differences between within cluster sum of squares of a clustering solution in the training data (Wk) and a clustering solution in a reference distribution (Wk

*).

• Aligned box criterion (ABC)

• Gap statistic

• Cubic clustering criterion (CCC)

Referencedistributioncomplexity

Cubic Clustering Criterion (CCC): SAS Technical Report A-108, 1983 Gap Statistic: Tibshirani et al, J.R. Statist. Soc., 2001

Page 10: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

CCC METHOD

Instead of using Wk directly, CCC uses R2 .

For CCC calculation, R2 and E(R2) are approximated by heuristic formulas.

Cubic Clustering Criterion (CCC): SAS Technical Report A-108, 1983

Derived from numerous Monte Carlo simulations to generate one hyper-cube reference distribution based on the dimensions of the given training dataset to test all k of interest.

Page 11: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

GAP STATISTICS METHOD

The Gap Statistic computes the (log) ratio Wk* / Wk.

Wk* is calculated from a clustering solution in the reference distribution.

Finds k that maximizes Gap(k) (within some tolerance)

Page 12: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

TWO TYPES OF UNIFORM

DISTRIBUTIONS1. Align with feature axes (data-geometry independent)

Observations Bounding Box (aligned with feature axes)

Monte Carlo Simulations

Page 13: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

TWO TYPES OF UNIFORM

DISTRIBUTIONS2. Align with principal axes (data-geometry dependent)

Observations Bounding Box (aligned with principal axes)

Monte Carlo Simulations

Page 14: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

COMPUTATION OF THE GAP

STATISTIC

for l = 1 to BCompute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)

for k = 1 to K Cluster the observations into k groups and compute log Wk

for l = 1 to B

Cluster the M.C. sample into k groups and compute log Wkb

Compute

Compute sd(k), the standard deviation of {log Wkb}l=1,…,B

Set the total s.e.

Find the smallest k such that

)(/11 ksdBsk

B

bkkb WW

BkGap

1

loglog1)(

1)1()( kskGapkGap

Page 15: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

GAP STATISTIC

Page 16: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

NO-CLUSTER EXAMPLE (JOURNAL VERSION)

Page 17: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ABC (ALIGNED BOX CRITERION)

Page 18: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ABC METHOD ABC improves upon CCC and Gap Statistics by generating better estimates for Wk*.

ABC uses k reference distributions, one for each tested k (k is number of clusters). • Data-driven Monte Carlo simulation of reference distribution at each tested k. • The reference distribution is k uniform hyper boxes aligned with the Principal

Components from the clustering solution of the input data.

Gap Statistic Reference Distribution ABC Reference Distribution

Page 19: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ABC METHOD Why multiple reference distributions?

The gap statistic performs hypothesis testing between k clusters/no-clusters for the whole input space

• ABC is similar to recursive hypothesis testing between 1 cluster/2 clusters for each of the k candidate clusters

• More stringent test. It is harder for larger k to pass this test. This is desirable.

Gap Statistic Reference Distribution ABC Reference Distribution

Page 20: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k REFERENCE DISTRIBUTIONS

Sample Data

Page 21: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k REFERENCE DISTRIBUTIONS

Aligned Box Criterion

Page 22: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k REFERENCE DISTRIBUTIONS

Aligned Box Criterion

Page 23: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

Aligned Box Criterion

ESTIMATING k REFERENCE DISTRIBUTIONS

Page 24: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

Aligned Box Criterion

ESTIMATING k REFERENCE DISTRIBUTIONS

Page 25: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

Aligned Box Criterion

ESTIMATING k REFERENCE DISTRIBUTIONS

Page 26: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

Aligned Box Criterion

ESTIMATING k REFERENCE DISTRIBUTIONS

Page 27: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

Aligned Box Criterion

ESTIMATING k REFERENCE DISTRIBUTIONS

Page 28: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

Aligned Box Criterion

ESTIMATING k REFERENCE DISTRIBUTIONS

Page 29: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

Aligned Box Criterion

ESTIMATING k REFERENCE DISTRIBUTIONS

Page 30: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

Aligned Box Criterion

ESTIMATING k REFERENCE DISTRIBUTIONS

Page 31: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ALIGNED BOX CRITERION

(ABC)

for k = 1 to K Cluster the observations into k groups and compute log Wk

for l = 1 to BConsidering each cluster k separately

Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)

Cluster the M.C. sample into k groups and compute log Wkb

Compute

Compute sd(k), the s.d. of {log Wkb}l=1,…,B

Set the total s.e.

Find the smallest k such that

)(/11 ksdBsk

1)1()( kskABCkABC

𝐴𝐵𝐶(𝑘)= log𝑊𝑘+¿ −log𝑊 𝑘¿

Page 32: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ABC METHODRESULTS

Page 33: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k REFERENCE DISTRIBUTIONS

Wk*decreases

faster.

Gap Statistic Aligned Box Criterion

Page 34: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k REFERENCE DISTRIBUTIONS

Gap Statistic Aligned Box Criterion

Alig

ned

Box

Crit

erio

n

Clearer Maxima.

Page 35: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS

Page 36: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS

• Observations: 7,000

• Variables: 2

• Monte Carlo Replications: 20

CCC method ABC method

Page 37: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS SIMULATED: SEVEN OVERLAPPING CLUSTERS

Page 38: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k CLAIMS PREDICTION CHALLENGE DATA

• Anonymized customer data

• 32 customer and product features

• 13,184,290 customer records

Page 39: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k EXECUTING CALCULATIONS

• Cubic clustering criterion: PROC FASTCLUS

• Gap statistic: R cluster package in the Open Source Integration Node in SAS Enterprise Miner

• Aligned box criterion: PROC HPCLUS

Page 40: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k INTERPRETING RESULTS

Cubic Clustering Criterion

Page 41: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k INTERPRETING RESULTS

Gap Statistic

Page 42: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ESTIMATING k INTERPRETING RESULTS

Aligned Box Criterion

Page 43: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

REFERENCE DISTRIBUTION EFFECT OF CHANGING NUMBER OF OBSERVATIONS

• How the number of observations in the reference distribution affects the result

• Based on the number of observations n in the input dataset, we generated w*n number of observations in the reference distribution where w is between 0 and 1

Page 44: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS SIMPLE CASE

Page 45: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS DATA SET WITH MORE CLUSTERS

Page 46: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS DATA SET WITH MORE OBSERVATIONS

Page 47: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS REAL DATA

Kaggle Claims Prediction Challenge (n= 13,184,290, p= 35), 50 runs

Page 48: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS SCALABILITY

Page 49: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS STABILITY

Page 50: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ABC METHODFOR PARALLEL AND DISTRIBUTED ARCHITECTURES

Page 51: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

PARALLEL ABC PART 1-2

Node1

Root

…..

Node2

Node3

NodeN

1) Run clustering k-means (in parallel) for k clusters 2) Assign each observation to a cluster3) Compute

1) Assign each cluster to a node 2) Collect the XX’ matrix for each cluster in the assigned node using a tree-based algorithm3) Do PCA using XX’ matrix

Node1

…..

Node2

Node3

NodeN

Page 52: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

PARALLEL ABC PART 3-4

Node1

…..

Node2

Node3

NodeN

1) Eigenvectors are broadcasted to every node2) Based on their assigned clusters, the observations in each node are projected into the new space

1) Bounding boxes are computed locally at each node for each cluster k 2) Bounding box information from each node is collected at the root and the root computes the bounding box coordinates for each cluster k3) This information is distributed to each node and each node generatesreference distributions

Node1

…..

Node2

Node3

NodeN

Node1

Root

…..

Node2

Node3

NodeN

Node1

…..

Node2

Node3

NodeN

Page 53: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

PARALLEL ABC PART 5

Node1

Root

…..

Node2

Node3

NodeN

Run k-means clustering in parallel for the reference distribution and compute

Do this for B number of reference distributions

Compute ABC for cluster k

Page 54: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

PARALLEL ABC PART 6

What about the O(n^3) complexity of SVD???

- Computation of XX’ is parallelized

- Or, do stochastic SVD

Page 55: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

ABC METHODCONCLUSION

Page 56: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

RESULTS

More accurate reference distributions lead to:

• Better defined maxima.

• Wk* values decreasing rapidly, especially for K > k.

• Exposure of possible alternative solutions.

Page 57: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved.

CONCLUSION

For large, highly dimensional or noisy data ABC is found to be: • Stable• Scalable

Moreover, it exhibits desirable properties:• Clearer peaks• More stringent hypothesis test promotes smaller k

values

Page 58: Jorge Silva, Sr. Research Statistician Developer, SAS at MLconf ATL - 9/18/15

Copyr igh t © 2012, SAS Ins t i tute Inc . A l l r i gh ts r es erved. www.SAS.com

Q&ATHANK YOU