Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a...

33
Clustering (2)

Transcript of Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a...

Page 1: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Clustering (2)

Page 2: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree

• Can be visualized as a dendrogram– A tree like diagram that records the sequences of merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 3: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Strengths of Hierarchical Clustering• Do not have to assume any particular number of clusters

– ‘cut’ the dendogram at the proper level

• They may correspond to meaningful taxonomies– Example in biological sciences e.g.,

• animal kingdom,

• phylogeny reconstruction,

• …

Page 4: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering• Start with the points as individual clusters• At each step, merge the closest pair of clusters until only one

cluster left.

AlgorithmLet each data point be a clusterCompute the proximity matrixRepeat

Merge the two closest clustersUpdate the proximity matrix

Until only a single cluster remains

• Key operation is the computation of the proximity of two clusters.

Page 5: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Starting Situation • Start with clusters of individual points and a proximity matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 6: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Intermediate Situation• After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 7: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Intermediate Situation• We want to merge the two closest clusters (C2 and C5) and update the

proximity matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 8: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

After Merging• The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 9: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Distance?

• MIN• MAX• Group Average

Proximity Matrix

Page 10: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN• MAX• Group Average

Page 11: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN• MAX• Group Average

Page 12: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

• MIN• MAX• Group Average

Page 13: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Cluster Similarity: MIN• Similarity of two clusters is based on the two most similar

(closest) points in the different clusters– Determined by one pair of points

Page 14: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Page 15: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Strength of MIN

Original Points Two Clusters

Can handle non-globular shapes

Page 16: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Limitations of MIN

Sensitive to noise and outliers

Original Points Four clusters Three clusters:

The yellow points got wrongly merged with the red ones, as opposed to the green one.

Page 17: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Cluster Similarity: MAX• Similarity of two clusters is based on the two least similar (most

distant) points in the different clusters– Determined by all pairs of points in the two clusters

Page 18: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering: MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

6

1

2 5

3

4

Page 19: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Strengths of MAX

Less susceptible respect to noise and outliers

Original Points Four clusters Three clusters:

The yellow points get now merged with the green one.

Page 20: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Limitations of MAX

Original Points Two Clusters

Tends to break large clusters

Page 21: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Cluster Similarity: Group Average• Proximity of two clusters is the average of pairwise proximity between points

in the two clusters.

||Cluster||Cluster

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

ji

jijjii

Page 22: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering: Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

1

2

3

4

5

6

1

2

5

3

4

Page 23: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering: Time and Space

• O(N2) space since it uses the proximity matrix. – N is the number of points.

• O(N3) time in many cases– There are N steps and at each step the size, N2, proximity matrix

must be updated and searched

– Complexity can be reduced to O(N2 log(N) ) time for some approaches

Page 24: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering Example

Page 25: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Hierarchical Clustering Example

From“Indo-European languages tree by Levenshtein distance”by M. Serva1 and F. Petroni

Page 26: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

DBSCANDBSCAN is a density-based algorithm.

Locates regions of high density that are separated from one another by regions of low density.

• Density = number of points within a specified radius (Eps)

• A point is a core point if it has more than a specified number of points

(MinPts) within Eps – These are points that are at the interior of a cluster

• A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

• A noise point is any point that is neither a core point nor a border point.

Page 27: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

DBSCAN: Core, Border, and Noise Points

Page 28: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

DBSCAN Algorithm• Any two core points that are close enough---within a distance

Eps of one another---are put in the same cluster.

• Likewise, any border point that is close enough to a core point is put in the same cluster as the core point.

• Ties may need to be resolved if a border point is close to core points from different clusters.

• Noise points are discarded.

Page 29: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border and noise

Eps = 10, MinPts = 4

Page 30: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Page 31: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

When DBSCAN Does NOT Work Well

Why DBSCAN doesn’t work well here?

Page 32: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

DBSCAN: Determining EPS and MinPts• Look at the behavior of the distance from a point to its k-th

nearest neighbor, called the k dist.

• For points that belong to some cluster, the value of k dist will be small [if k is not larger than the cluster size].

• However, for points that are not in a cluster, such as noise points, the k dist will be relatively large.

• So, if we compute the k dist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k dist that corresponds to a suitable value of Eps.

• If we select this distance as the Eps parameter and take the value of k as the MinPts parameter, then points for which k dist is less than Eps will be labeled as core points, while other points will be labeled as noise or border points.

Page 33: Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

DBSCAN: Determining EPS and MinPts

• Eps determined in this way depends on k, but does not change dramatically as k changes.

• If k is too small ? then even a small number of closely spaced points that are noise or outliers will be incorrectly labeled as clusters.

• If k is too large ? then small clusters (of size less than k) are likely to be labeled as noise.

• Original DBSCAN used k = 4, which appears to be a reasonable value for most data sets.