Hierarchical clustering and k-mean...

43
Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein

Transcript of Hierarchical clustering and k-mean...

Page 1: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Clustering Hierarchical clustering and k-mean clustering

Genome 373

Genomic Informatics

Elhanan Borenstein

Page 2: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

The clustering problem:

partition genes into distinct sets with high homogeneity and high separation

Many different representations

Many possible distance metrics Metric matters

Homogeneity vs separation

A quick review

Page 3: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

A good clustering solution should have two features:

1. High homogeneity: homogeneity measures the similarity between genes assigned to the same cluster.

2. High separation: separation measures the distance/dis-similarity between clusters. (If two clusters have similar expression patterns, then they should probably be merged into one cluster).

The clustering problem

Page 4: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

“Unsupervised learning” problem

No single solution is necessarily the true/correct!

There is usually a tradeoff between homogeneity and separation:

More clusters increased homogeneity but decreased separation

Less clusters Increased separation but reduced homogeneity

Method matters; metric matters; definitions matter;

There are many formulations of the clustering problem; most of them are NP-hard (why?).

In most cases, heuristic methods or approximations are used.

The “philosophy” of clustering

Page 5: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Many algorithms: Hierarchical clustering

k-means

self-organizing maps (SOM)

Knn

PCC

CAST

CLICK

The results (i.e., obtained clusters) can vary drastically depending on: Clustering method

Parameters specific to each clustering method (e.g. number of centers for the k-mean method, agglomeration rule for hierarchical clustering, etc.)

One problem, numerous solutions

Page 6: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Hierarchical clustering

Page 7: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

An agglomerative clustering method

Takes as input a distance matrix

Progressively regroups the closest objects/groups

The result is a tree - intermediate nodes represent clusters

Branch lengths represent distances between clusters

Hierarchical clustering

object 2

object 4

object 1

object 3

object 5

c1

c2

c3

c4

leaf

nodes

branch

node

root

Tree representation

ob

ject

1

ob

ject

2

ob

ject

3

ob

ject

4

ob

ject

5

object 1 0.00 4.00 6.00 3.50 1.00

object 2 4.00 0.00 6.00 2.00 4.50

object 3 6.00 6.00 0.00 5.50 6.50

object 4 3.50 2.00 5.50 0.00 4.00

object 5 1.00 4.50 6.50 4.00 0.00

Distance matrix

Page 8: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

mmm… Déjà vu anyone?

Page 9: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

1. Assign each object to a separate cluster.

2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster.

3. Repeat 2 until there is a single cluster.

Hierarchical clustering algorithm

Page 10: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

1. Assign each object to a separate cluster.

2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster.

3. Repeat 2 until there is a single cluster.

Hierarchical clustering algorithm

Page 11: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Hierarchical clustering

One needs to define a (dis)similarity metric between two groups. There are several possibilities

Average linkage: the average distance between objects from groups A and B

Single linkage: the distance between the closest objects from groups A and B

Complete linkage: the distance between the most distant objects from groups A and B

1. Assign each object to a separate cluster.

2. Find the pair of clusters with the shortest distance, and regroup them into a single cluster.

3. Repeat 2 until there is a single cluster.

Page 12: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Impact of the agglomeration rule These four trees were built from the same distance matrix,

using 4 different agglomeration rules.

Note: these trees were computed from a matrix of random numbers. The impression of structure is thus a complete artifact.

Single-linkage typically creates nesting clusters

Complete linkage create more balanced trees.

Page 13: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Hierarchical clustering result

13

Five clusters

Page 14: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering

Divisive

Non-hierarchical

Page 15: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center

Note that this is a somewhat strange definition:

Assignment of a point to a cluster is based on the proximity of the point to the cluster mean

But the cluster mean is calculated based on all the points assigned to the cluster.

K-mean clustering

cluster_2 mean cluster_1

mean

Page 16: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center

The chicken and egg problem: I do not know the means before I determine the partitioning into clusters I do not know the partitioning into clusters before I determine the means

Key principle - cluster around mobile centers:

Start with some random locations of means/centers, partition into clusters according to these centers, and then correct the centers according to the clusters (somewhat similar to expectation-maximization algorithm)

K-mean clustering: Chicken and egg

Page 17: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

The number of centers, k, has to be specified a-priori

Algorithm:

1. Arbitrarily select k initial centers

2. Assign each element to the closest center

3. Re-calculate centers (mean position of the assigned elements)

4. Repeat 2 and 3 until ….

K-mean clustering algorithm

Page 18: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

The number of centers, k, has to be specified a-priori

Algorithm:

1. Arbitrarily select k initial centers

2. Assign each element to the closest center

3. Re-calculate centers (mean position of the assigned elements)

4. Repeat 2 and 3 until one of the following termination conditions is reached:

i. The clusters are the same as in the previous iteration

ii. The difference between two iterations is smaller than a specified threshold

iii. The maximum number of iterations has been reached

K-mean clustering algorithm

How can we do this efficiently?

Page 19: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Assigning elements to the closest center

Partitioning the space

B

A

Page 20: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Assigning elements to the closest center

Partitioning the space

B

A

closer to A than to B

closer to B than to A

Page 21: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Assigning elements to the closest center

Partitioning the space

B

A

C

closer to A than to B

closer to B than to A

closer to B than to C

Page 22: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Assigning elements to the closest center

Partitioning the space

B

A

C

closest to A

closest to B

closest to C

Page 23: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Assigning elements to the closest center

Partitioning the space

B

A

C

Page 24: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Decomposition of a metric space determined by distances to a specified discrete set of “centers” in the space

Each colored cell represents the collection of all points in this space that are closer to a specific center s than to any other center

Several algorithms exist to find the Voronoi diagram.

Voronoi diagram

Page 25: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

The number of centers, k, has to be specified a priori

Algorithm:

1. Arbitrarily select k initial centers

2. Assign each element to the closest center (Voronoi)

3. Re-calculate centers (mean position of the assigned elements)

4. Repeat 2 and 3 until one of the following termination conditions is reached:

i. The clusters are the same as in the previous iteration

ii. The difference between two iterations is smaller than a specified threshold

iii. The maximum number of iterations has been reached

K-mean clustering algorithm

Page 26: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example Two sets of points

randomly generated 200 centered on (0,0)

50 centered on (1,1)

Page 27: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example Two points are

randomly chosen as centers (stars)

Page 28: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example Each dot can now

be assigned to the cluster with the closest center

Page 29: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example First partition into

clusters

Page 30: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Centers are re-calculated

K-mean clustering example

Page 31: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example And are again used

to partition the points

Page 32: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example Second partition into

clusters

Page 33: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example Re-calculating centers

again

Page 34: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example And we can again

partition the points

Page 35: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example Third partition

into clusters

Page 36: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering example After 6 iterations:

The calculated centers remains stable

Page 37: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering: Summary The convergence of k-mean is usually quite fast

(sometimes 1 iteration results in a stable solution)

K-means is time- and memory-efficient

Strengths:

Simple to use

Fast

Can be used with very large data sets

Weaknesses:

The number of clusters has to be predetermined

The results may vary depending on the initial choice of centers

Page 38: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

K-mean clustering: Variations

Expectation-maximization (EM): maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means.

k-means++: attempts to choose better starting points.

Some variations attempt to escape local optima by swapping points between clusters

Page 39: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

The take-home message

D’haeseleer, 2005

Hierarchical clustering

K-mean clustering

?

Page 40: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK
Page 41: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

What else are we missing?

Page 42: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

What if the clusters are not “linearly separable”?

What else are we missing?

Page 43: Hierarchical clustering and k-mean clusteringelbo.gs.washington.edu/courses/GS_373_15_sp/slides/... · Hierarchical clustering k-means self-organizing maps (SOM) Knn PCC CAST CLICK

Clustering in both dimensions We can cluster genes, conditions (samples), or both.