1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences...

46
Towards Theoretical Foundations of Clustering Margareta Ackerman University of Waterloo 1

Transcript of 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences...

Page 1: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Towards Theoretical Foundations of Clustering

Margareta Ackerman

University of Waterloo

1

Page 2: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Clustering is one of the most widely used tools

for exploratory data analysis. Social Sciences Biology Astronomy Computer Science ….

All apply clustering to gain a first understanding of the structure of large data sets.

The Theory-Practice Gap

2

Page 3: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

The Theory-Practice Gap

“While the interest in and application of cluster analysis has been rising rapidly, the abstract nature of the tool is still poorly understood” (Wright, 1973)

“There has been relatively little work aimed at reasoning about clustering independently of any particular algorithm, objective function, or generative data model” (Kleinberg, 2002)

Both statements still apply today. 3

Page 4: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Clustering aims to assign data into groups of similar items

Beyond that, there is very little consensus on the definition of clustering

Inherent Obstacles:Clustering is ill-defined

4

Page 5: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Clustering is inherently ambiguous– There may be multiple reasonable

clusterings– There is usually no ground truth

• There are many clustering algorithms with different (often implicit) objective functions

Inherent Obstacles

5

Page 6: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Previous work• Clustering algorithm selection• Characterization of Linkage-Based clustering

– Sketch of proof– Hierarchical algorithms that are not linkage-

based • Conclusions and future work

6

Outline

Page 7: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Clustering in the weighted setting (Wright, ‘73)

• Axioms of clustering distance functions (Meila, ACM ‘05)

• Impossibility result (Kleinberg, NIPS ‘02)

• Rebuttal to impossibility result (Ackerman & Ben-David, NIPS ‘08)

7

Previous Work Towards a General Theory: Axiomatizing clustering

Page 8: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Conditions for efficiently uncovering the target clustering [(Balcan, Blum, and Vempala, STOC ‘08),(Balcan, Blum and Gupta, SODA ‘09)]

• Theoretical study of clusterability (Ackerman & Ben-David, AISTATS ‘09)]. Notions of clusterability are pairwise distinct Data sets that are more clusterable are

computationally easier to cluster well.

8

Previous Work Towards a General Theory: Clusterability

Page 9: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Previous work• Clustering algorithm selection• Characterization of Linkage-Based clustering

– Sketch of proof– Heirarchical algorithms that are not linkage-

based• Conclusions and future work

9

Outline

Page 10: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

There are a wide variety of clustering algorithms, which often produce very different clusterings.

Clustering Algorithm Selection

10

How should a user decide which algorithm to use for

a given application?

Page 11: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Users rely on cost related considerations: running

times, space usage, software purchasing costs, etc…

There is inadequate emphasis on

input-output behaviour

Clustering Algorithm Selection

11

Page 12: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

12

Radical Differences in Input/Output Behavior of Clustering Algorithms

Page 13: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

13

Radical Differences in Input/Output Behavior of Clustering Algorithms

Page 14: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

We propose a framework that lets a user utilize prior knowledge to select an algorithm

• Identify properties that distinguish between different input-output behaviour of clustering paradigms

• The properties should be:1) Intuitive and “user-friendly”2) Useful for distinguishing clustering

algorithms

Our Framework for Clustering Algorithm Selection

14

Page 15: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• The long-term goal is to construct a large property-based classification for many useful clustering algorithms

• This would facilitates the application of prior knowledge.

• Enables users to identify a suitable algorithm without the overhead of executing many algorithms

• This framework helps understand behaviour of existing and new algorithms

Our Framework for Clustering Algorithm Selection

15

Page 16: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Taxonomy of Partitional Algorithms(Ackerman, Ben-David, Loker, NIPS 2010)

Local OuterCon.

InnerCon.

Refinm.Preserv

OrderInv.

OuterRich.

ScaleInv.

Iso.Inv.

Single linkage

Average linkage

Complete linkage

K-means K-median Min-Sum Ratio-cut Normalized-cut

16

Page 17: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Axioms VS Properties

Local OuterCon.

InnerCon.

Refinm.Preserv

OrderInv.

OuterRich.

ScaleInv.

Iso.Inv.

Single linkage

Average linkage

Complete linkage

K-means K-median Min-Sum Ratio-cut Normalized-cut

Properties Axioms

17

Page 18: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Characterization of Linkage-Based Clustering(Ackerman, Ben-David, Loker, COLT 2010)

Local OuterCon.

InnerCon.

Refinm.Preserv

OrderInv.

OuterRich.

ScaleInv.

Iso.Inv.

Single linkage

Average linkage

Complete linkage

K-means K-median Min-Sum Ratio-cut Normalized-cut

18

Page 19: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Characterization of Linkage-Based Clustering(Ackerman, Ben-David, Loker, COLT 2010)

The 2010 characterization applies in the partitional setting, by using the k-stopping criteria.

This characterization distinguished linkage-based algorithms from other partitional algorithms.

Local OuterCon.

InnerCon.

Refinm.Preserv

OrderInv.

OuterRich.

ScaleInv.

Iso.Inv.

Single linkage

Average linkage

Complete linkage

19

Page 20: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Propose two intuitive properties that uniquely indentify hierarchical linkage-based clustering algorithms.

• Show that common hierarchical algorithms, including bisecting k-means, cannot be simulated by any linkage-based algorithm

Characterizing Linkage-Based Clustering in the Heirarchical Setting

(Ackerman and Ben-David, IJCAI 2011)

20

Page 21: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Previous work• Clustering algorithm selection• Characterization of Linkage-Based clustering

– Sketch of proof– Hierarchical algorithms that are not linkage-

based• Conclusions and future work

21

Outline

Page 22: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

C_i is a cluster in a dendrogram D if there exists a node in the dendrogram so that C_i is the set of its leaf descendents.

Formal Setup: Dendrograms and clusterings

22

Page 23: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

C = {C1, … , Ck} is a clustering in a dendrogram D if

– Ci is a cluster in D for all 1≤ i ≤ k, and

– Clusters are disjoint

Formal Setup: Dendrograms and clusterings

23

Page 24: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Formal Setup: Heirarchical clustering algorithm

A Hierarchical Clustering Algorithm A maps

Input: A data set X with a dissimilarity function d, denoted (X,d)

toOutput: A dendrogram of X

24

Page 25: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Create a leaf node for every elements of X

Linkage-Based Algorithm

Insert image

25

Page 26: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Create a leaf node for every elements of X

• Repeat the following until a single tree remains:– Consider clusters represented by the remaining root nodes.

Linkage-Based Algorithm

26

Page 27: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Create a leaf node for every elements of X

• Repeat the following until a single tree remains:– Consider clusters represented by the remaining root nodes.

Merge the closest pair of clusters by assigning them a common parent node.

Linkage-Based Algorithm

27

?

Page 28: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• The choice of linkage function distinguishes between different linkage-based algorithms.

• Examples of common linkage-functions– Single-linkage: shortest between-cluster distance– Average-linkage: average between-cluster distance– Complete-linkage: maximum between-cluster distance

Examples of Linkage-Based Algorithms

X1 X2

28

Page 29: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Locality Informal Definition

If we select a set of disjoint clusters from a dendrogram, and run the algorithm on the union of these clusters, we obtain a result that is consistent with the original dendrogram.

D = A(X,d) D’ = A(X’,d)X’={x1, …, x6}

29

Page 30: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Locality Informal Definition

If we select a set of disjoint clusters from a dendrogram, and run the algorithm on the union of these clusters, we obtain a result that is consistent with the original dendrogram.

D = A(X,d) D’ = A(X’,d)X’={x1, …, x6}

30

Page 31: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

A(X,d)

C

C on dataset (X,d)C on dataset (X,d’)

Outer-consistent change

31

Outer Consistency

If A is outer-consistent, then A(X,d’) will also include the clustering C.

Page 32: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Theorem (Ackerman & Ben-David, IJCAI 2011):

A hierarchical clustering algorithm is

Linkage-Based if and only if

it is Local and Outer-Consistent.

Characterization of Linkage-Based Clustering

32

Page 33: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Previous work• Clustering algorithm selection• Characterization of Linkage-Based clustering

– Sketch of proof– Heirarchical algorithms that are not linkage-

based• Conclusions and future work

33

Outline

Page 34: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Every Linkage-Based hierarchical clustering algorithm is Local and Outer-Consistent.

The proof is quite straightforward.

Easy Direction of Proof

34

Page 35: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

If A is Local and Outer-Consistent, then A is Linkage-Based.

To prove this direction we first need to formalize Linkage-Based clustering, by formally defining what is a Linkage Function.

Interesting Direction of Proof

35

Page 36: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

A Linkage Function is a function

l:{(X1, X2 ,d): d is a distance function over X1uX2 }→ R+

that satisfies the following:

What Do We Expect From Linkage Functions?

- Representation independence: Doesn’t change if we re-label data - Monotonicity: if we increase edges that go between X1 and X2,

then l(X1, X2 ,d) doesn’t decrease.

(X1uX2,d)

X1 X2

36

Page 37: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Recall direction: If A satisfies Outer-Consistency and Locality, then A is Linkage-Based.

Goal: Define a linkage function l so that the linkage-based clustering based on l outputs A(X,d) (for every X and d).

Sketch of proof

37

Page 38: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Define an operator <A :

(X,Y,d1) <A (Z,W,d2) if when we run A on (XuYuZuW,d), where d extends d1 and d2, X and Y are merged before Z and W.

Sketch of proof

A(X,d)

Z W X Y

• Prove that <A can be extended to a partial ordering

• Use the ordering to define l

38

Page 39: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Sketch of proof continue:Show that <A is a partial ordering

We show that <A is cycle-free.

Lemma: Given a hierarchical algorithm A that is Local and Outer-Consistent, there exists no finite sequence so that

(X1,Y1,d1) <A …. <A(Xn,Yn,dn) <A (X1,Y1,d1).

39

Page 40: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• By the above Lemma, the transitive closure of <A is a partial ordering.

• This implies that there exists an order preserving function l that maps pairs of data sets to R+.

• It can be shown that l satisfies the properties of a Linkage Function.

Sketch of proof (continued…)

40

Page 41: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

• Previous work• Clustering algorithm selection• Characterization of Linkage-Based clustering

– Sketch of proof– Hierarchical algorithms that are not linkage-

based• Conclusions and future work

41

Outline

Page 42: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Hierarchical but Not Linkage-Based

P -Divisive algorithms construct dendrograms top-downusing a partitional 2-clustering algorithm P to split nodes.

42

Apply partitional clustering P Ex. k-means for k=2

Page 43: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Hierarchical but Not Linkage-Based

A partitional 2-clustering algorithm P is Context Sensitive if there exist d d’⊂ so that

P({x,y,z},d) = {x, {y,z}} and P({x,y,z,w} ,d’)= {{x,y}, {z,w}}.

Ex. K-means, min-sum, min-diameter.

43

Theorem [Ackerman & Ben-David, IJCAI ’11]:

If P is context-sensitive, then the P –divisive algorithm fails the locality

property.

Page 44: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Hierarchical but Not Linkage-Based

• The input-output behaviour of some natural divisive algorithms is distinct from that of all linkage-based algorithms.

• The bisecting k-means algorithm, and other natural divisive algorithms, cannot be simulated by any linkage-based algorithm.

44

Page 45: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

Conclusions

• We present a new framework for clustering algorithm selection

• Provide a property-based classification of common clustering algorithms

• Characterize linkage-based clustering in terms of two natural properties

• Show that no linkage-based algorithm can simulate some natural divisive algorithms

45

Page 46: 1. Clustering is one of the most widely used tools for exploratory data analysis. Social Sciences Biology Astronomy Computer Science …. All apply clustering.

What’s Next?• Our approach to selecting clustering algorithms can

be applied to any clustering application (ex. phylogeny).

• Classify applications in terms of their clustering needs– Target research on common clustering needs or specific

applications– Identify when results are relevant to specific applications

• Bridging the gap in other clustering settings (ex. clustering with a “noise cluster”)

• Axioms of clustering algorithms46