Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern...

27
Ulf Schmitz, Pattern recognition - Clust ering 1 www. .uni-rostock. Bioinformatics Bioinformatics Pattern recognition - Clustering Pattern recognition - Clustering Ulf Schmitz [email protected] Bioinformatics and Systems Biology Group www.sbi.informatik.uni-rostock.de
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    2

Transcript of Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern...

Page 1: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 1

www. .uni-rostock.de

BioinformaticsBioinformaticsPattern recognition - ClusteringPattern recognition - Clustering

Ulf [email protected]

Bioinformatics and Systems Biology Groupwww.sbi.informatik.uni-rostock.de

Page 2: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 2

www. .uni-rostock.de

Outline

1. Introduction

2. Hierarchical clustering

3. Partitional clustering k-means and derivatives

4. Fuzzy Clustering

Page 3: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 3

www. .uni-rostock.de

Introduction into Clustering algorithms

• Clustering is the classification of similar objects into separated groups – or the partitioning of a data set into subsets

(clusters)– so that the data in each subset (ideally) share

some common trait • Machine learning typically regards clustering

as a form of unsupervised learning.

• we distinguish:– Hierarchical Clustering (finds successive

clusters using previously established clusters)– Partitional Clustering (determines all clusters

at once)

Page 4: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 4

www. .uni-rostock.de

Introduction into Clustering algorithms

• gene expression data analysis• identification of regulatory

binding sites • phylogenetic tree clustering (for inference of horizontally transferred

genes)• protein domain identification• identification of structural motifs

Applications

Page 5: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 5

www. .uni-rostock.de

Introduction into Clustering algorithms

• data matrix collects observations of n objects, described by m measurements

• rows refer to objects, characterised by values in the columns

nm2n1n

m22221

m11211

xxx

xxx

xxx

X

if units of measurements, associated with the columns of X differ, it’s necessary to normalise

Data matrix

)x(

xxx

j

jjj

new

: column vector

: mean

: standard deviation

jx

jx

)x( j

Page 6: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 6

www. .uni-rostock.de

Hierarchical clustering

1. find dis/similarity between every pair of objects in the data set by evaluating a distance measure

2. group the objects into a hierarchical cluster tree (dendrogram) by linking newly formed clusters

3. obtain a partition of the data set into clusters by selecting a suitable ‘cut-level’ of the cluster tree

produces a sequence of nested partitions, the steps are:

Page 7: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 7

www. .uni-rostock.de

Hierarchical clustering

1. start with n clusters, each containing one object and calculate the distance matrix D1

2. determine from D1 which of the objects are least distant (e.g. I and J)

3. merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster

4. repeat steps 2 and 3 a total of m-1 times until a single cluster is formed• record which clusters are merged at each step• record the distances between the clusters that are merged in that step

Agglomerative Hierarchical clustering

Page 8: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 8

www. .uni-rostock.de

Hierarchical clustering

one treats the data matrix X as a set of n (row) vectors with m elements

calculating the distances

m

1j

2sjrj

2rs xxd

Euclidian distance

sr x,x are row vectors of X

City block distance

m

jsjrjrs xxd

1

5.24

5.14

22

5.45.2

21

x

an example

Page 9: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 9

www. .uni-rostock.de

Hierarchical clustering

an example

5.24

5.14

22

5.45.2

21

x

Euclidian distance

City block distance

06.2d35

5.235 d

Page 10: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 10

www. .uni-rostock.de

Hierarchical clustering

1. start with n clusters, each containing one object and calculate the distance matrix D1

2. determine from D1 which of the objects are least distant (e.g. I and J)

3. merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster

4. repeat steps 2 and 3 a total of m-1 times until a single cluster is formed• record which clusters are merged at each step• record the distances between the clusters that are merged in that step

Agglomerative Hierarchical clustering

Page 11: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 11

www. .uni-rostock.de

Hierarchical clusteringx1 x2 x3 x4 x5

x1 0 2.9155 1.0000 3.0414 3.0414

x2 2.9155 0 2.5495 3.3541 2.5000

x3 1.0000 2.5495 0 2.0616 2.0616

x4 3.0414 3.3541 2.0616 0 1.0000

x5 3.0414 2.5000 2.0616 1.0000 0

distance matrix

5.24

5.14

22

5.45.2

21

X

x1, x3 X2 x4, x5

x1, x3 0 2.9155 2.0616

X2 2.9155 0 2.5000

x4, x5 2.0616 2.5000 0

Page 12: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 12

www. .uni-rostock.de

Hierarchical clustering

dIJ

dIJ

d15

d14

d13

d25

d24

d23

single linkage:

JI :minIJ j and id d ij

complete linkage:

JI :maxIJ j and id d ij

JII JIJ NN/ddi j ij

group average:

1

25

4

3

Methods to define a distance between clusters:

N is the number of members in a cluster

centroid linkage:

Jn

iiJ

JJJI x

nxxxdd

1IJ

1 where ,

Page 13: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 13

www. .uni-rostock.de

Hierarchical clustering

Page 14: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 14

www. .uni-rostock.de

Hierarchical clustering

1. start with n clusters, each containing one object and calculate the distance matrix D1

2. determine from D1 which of the objects are least distant (e.g. I and J)

3. merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster

4. repeat steps 2 and 3 a total of m-1 times until a single cluster is formed• record which clusters are merged at each step• record the distances between the clusters that are merged in that step

Agglomerative Hierarchical clustering

Page 15: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 15

www. .uni-rostock.de

Hierarchical clustering

1. the choice of distance measure is important

2. there is no provision for reassigning objects that have been incorrectly grouped

3. errors are not handled explicitly in the procedure

4. no method of calculating intercluster distances is universally the best

• but, single-linkage clustering is least successful• and, group average clustering tends to be fairly well

Limits of hierarchical clustering

Page 16: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 16

www. .uni-rostock.de

Partitional clustering – K means

• Involves prior specification of the number of clusters, k• no pairwise distance matrix is required• The relevant distance is the distance from the object to the

cluster center (centroid)

Page 17: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 17

www. .uni-rostock.de

Partitional clustering – K means

1. partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects)

2. calculate the centroids of the clusters3. assign or reassign each object to that cluster whose centroid is

closest (distance is calculated as Euclidean distance)4. recalculate the centroids of the new clusters formed after the gain or

loss of objects to or from the previous clusters5. repeat steps 3 and 4 for a predetermined number of iterations or until

membership of the groups no longer changes

Page 18: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 18

www. .uni-rostock.de

Partitional clustering – K means

object x1 x2

A 1 1

B 3 1

C 4 8

D 8 10

E 9 6

step 1: make an arbitrary partition of the objects into clusters:e.g. objects with into Cluster 1, all other into Cluster 2A,B and C in Cluster 1, and D and E in Cluster 2

step 2: calculate the centroids of the clusterscluster 1: cluster 2:

step 3: calculate the Euclidean distance between each object and each of the two clusters centroids:

61 x

33.3 ,67.2 21 cc00.8 ,50.8 21 cc

object d(x1,c1) d(x2,c2)

A 2.87 10.26

B 2.35 8.90

C 4.86 4.50

D 8.54 2.06

E 6.87 2.06A

D

B

C

E

2

4

6

8

10

2 4 6 8 10

Page 19: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 19

www. .uni-rostock.de

Partitional clustering – K means

1. partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects)

2. calculate the centroids of the clusters3. assign or reassign each object to that cluster whose centroid is

closest (distance is calculated as Euclidean distance)4. recalculate the centroids of the new clusters formed after the

gain or loss of objects to or from the previous clusters5. repeat steps 3 and 4 for a predetermined number of iterations or

until membership of the groups no longer changes

Page 20: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 20

www. .uni-rostock.de

Partitional clustering – K means

step 4: C turns out to be closer to Cluster 2 and has to be reassignedrepeat step2 and step3

object d(X,1) d(X,2)

A 1.00 9.22

B 1.00 8.06

C 7.28 3.00

D 10.82 2.24

E 8.60 2.83

cluster 1: cluster 2:

00.1 ,00.2 21 xx00.8 ,00.7 21 xx

no further reassignments are necessary

Page 21: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 21

www. .uni-rostock.de

Partitional clustering – K means

Page 22: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 22

www. .uni-rostock.de

Fuzzy clustering

• is an extension of k – means clustering– an objects belongs to a cluster in a certain degree

• for all objects the degrees of membership in the k clusters adds up to one:

• a fuzzy weight is introduced, which determines the fuzziness of the resulting clusters– for ω → 1, the cluster becomes a hard partition– for ω → ∞, the degree of membership approximates 1/k

– typical values are ω = 1.25 and ω = 2

1uk

1i ij

,1

Page 23: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 23

www. .uni-rostock.de

Fuzzy clustering

fix k, 2 ≤ k < n and choose a distance measure (Euclidean, city block, etc.), a termination tolerance δ>0 (e.g. 0.01 or 0.001), and fix ω, 1 ≤ ω < ∞. Initialize first cluster set randomly.

ki ,u

xu

c n

1j

1lij

n

1jj

1lij

li

1

step1: compute cluster centers

step2: compute distances between objects and cluster centers

m

jirjrc cxd

i1

22

Page 24: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 24

www. .uni-rostock.de

Fuzzy clustering

step3: update partition matrix:

k

rrjr

lijr

lij

cxdcxdu

1

1/122 ,/,

1

until:

1ll UU

the algorithm is terminated if changes in the partition matrix are negligible

Page 25: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 25

www. .uni-rostock.de

Clustering Software

• Cluster 3.0 (for gene expression data analysis )

• PyCluster (Python Module)

• Algorithm::Cluster (Perl package)

• C clustering library

http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm

Page 26: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 26

www. .uni-rostock.de

Outlook

• Bioperl

Page 27: Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de.

Ulf Schmitz, Pattern recognition - Clustering 27

www. .uni-rostock.de

Thanx for your attention!!!