Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern...

Ulf Schmitz, Pattern recognition - Clustering 1

www. .uni-rostock.de

BioinformaticsBioinformaticsPattern recognition - ClusteringPattern recognition - Clustering

Ulf [email protected]

Bioinformatics and Systems Biology Groupwww.sbi.informatik.uni-rostock.de



Outline

1. Introduction

2. Hierarchical clustering

3. Partitional clustering k-means and derivatives

4. Fuzzy Clustering



Introduction into Clustering algorithms

• Clustering is the classification of similar objects into separated groups – or the partitioning of a data set into subsets

(clusters)– so that the data in each subset (ideally) share

some common trait • Machine learning typically regards clustering

as a form of unsupervised learning.

• we distinguish:– Hierarchical Clustering (finds successive

clusters using previously established clusters)– Partitional Clustering (determines all clusters

at once)




• gene expression data analysis• identification of regulatory

binding sites • phylogenetic tree clustering (for inference of horizontally transferred

genes)• protein domain identification• identification of structural motifs

Applications




• data matrix collects observations of n objects, described by m measurements

• rows refer to objects, characterised by values in the columns

nm2n1n

m22221

m11211

xxx

xxx

xxx

X

if units of measurements, associated with the columns of X differ, it’s necessary to normalise

Data matrix

)x(

xxx

j

jjj

new

: column vector

: mean

: standard deviation

jx

jx

)x( j



Hierarchical clustering

1. find dis/similarity between every pair of objects in the data set by evaluating a distance measure

2. group the objects into a hierarchical cluster tree (dendrogram) by linking newly formed clusters

3. obtain a partition of the data set into clusters by selecting a suitable ‘cut-level’ of the cluster tree

produces a sequence of nested partitions, the steps are:




1. start with n clusters, each containing one object and calculate the distance matrix D1

2. determine from D1 which of the objects are least distant (e.g. I and J)

3. merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster

4. repeat steps 2 and 3 a total of m-1 times until a single cluster is formed• record which clusters are merged at each step• record the distances between the clusters that are merged in that step

Agglomerative Hierarchical clustering




one treats the data matrix X as a set of n (row) vectors with m elements

calculating the distances

m

1j

2sjrj

2rs xxd

Euclidian distance

sr x,x are row vectors of X

City block distance

m

jsjrjrs xxd

1

5.24

5.14

22

5.45.2

21

x

an example




an example

5.24

5.14

22

5.45.2

21

x

Euclidian distance

City block distance

06.2d35

5.235 d



Hierarchical clusteringx1 x2 x3 x4 x5

x1 0 2.9155 1.0000 3.0414 3.0414

x2 2.9155 0 2.5495 3.3541 2.5000

x3 1.0000 2.5495 0 2.0616 2.0616

x4 3.0414 3.3541 2.0616 0 1.0000

x5 3.0414 2.5000 2.0616 1.0000 0

distance matrix

5.24

5.14

22

5.45.2

21

X

x1, x3 X2 x4, x5

x1, x3 0 2.9155 2.0616

X2 2.9155 0 2.5000

x4, x5 2.0616 2.5000 0




dIJ

dIJ

d15

d14

d13

d25

d24

d23

single linkage:

JI :minIJ j and id d ij

complete linkage:

JI :maxIJ j and id d ij

JII JIJ NN/ddi j ij

group average:

1

25

4

3

Methods to define a distance between clusters:

N is the number of members in a cluster

centroid linkage:

Jn

iiJ

JJJI x

nxxxdd

1IJ

1 where ,




1. the choice of distance measure is important

2. there is no provision for reassigning objects that have been incorrectly grouped

3. errors are not handled explicitly in the procedure

4. no method of calculating intercluster distances is universally the best

• but, single-linkage clustering is least successful• and, group average clustering tends to be fairly well

Limits of hierarchical clustering



Partitional clustering – K means

• Involves prior specification of the number of clusters, k• no pairwise distance matrix is required• The relevant distance is the distance from the object to the

cluster center (centroid)




1. partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects)

2. calculate the centroids of the clusters3. assign or reassign each object to that cluster whose centroid is

closest (distance is calculated as Euclidean distance)4. recalculate the centroids of the new clusters formed after the gain or

loss of objects to or from the previous clusters5. repeat steps 3 and 4 for a predetermined number of iterations or until

membership of the groups no longer changes




object x1 x2

A 1 1

B 3 1

C 4 8

D 8 10

E 9 6

step 1: make an arbitrary partition of the objects into clusters:e.g. objects with into Cluster 1, all other into Cluster 2A,B and C in Cluster 1, and D and E in Cluster 2

step 2: calculate the centroids of the clusterscluster 1: cluster 2:

step 3: calculate the Euclidean distance between each object and each of the two clusters centroids:

61 x

33.3 ,67.2 21 cc00.8 ,50.8 21 cc

object d(x1,c1) d(x2,c2)

A 2.87 10.26

B 2.35 8.90

C 4.86 4.50

D 8.54 2.06

E 6.87 2.06A

D

B

C

E

2

4

6

8

10

2 4 6 8 10




1. partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects)

2. calculate the centroids of the clusters3. assign or reassign each object to that cluster whose centroid is

closest (distance is calculated as Euclidean distance)4. recalculate the centroids of the new clusters formed after the

gain or loss of objects to or from the previous clusters5. repeat steps 3 and 4 for a predetermined number of iterations or

until membership of the groups no longer changes




step 4: C turns out to be closer to Cluster 2 and has to be reassignedrepeat step2 and step3

object d(X,1) d(X,2)

A 1.00 9.22

B 1.00 8.06

C 7.28 3.00

D 10.82 2.24

E 8.60 2.83

cluster 1: cluster 2:

00.1 ,00.2 21 xx00.8 ,00.7 21 xx

no further reassignments are necessary



Fuzzy clustering

• is an extension of k – means clustering– an objects belongs to a cluster in a certain degree

• for all objects the degrees of membership in the k clusters adds up to one:

• a fuzzy weight is introduced, which determines the fuzziness of the resulting clusters– for ω → 1, the cluster becomes a hard partition– for ω → ∞, the degree of membership approximates 1/k

– typical values are ω = 1.25 and ω = 2

1uk

1i ij

,1



Fuzzy clustering

fix k, 2 ≤ k < n and choose a distance measure (Euclidean, city block, etc.), a termination tolerance δ>0 (e.g. 0.01 or 0.001), and fix ω, 1 ≤ ω < ∞. Initialize first cluster set randomly.

ki ,u

xu

c n

1j

1lij

n

1jj

1lij

li

1

step1: compute cluster centers

step2: compute distances between objects and cluster centers

m

jirjrc cxd

i1

22



Fuzzy clustering

step3: update partition matrix:

k

rrjr

lijr

lij

cxdcxdu

1

1/122 ,/,

1

until:

1ll UU

the algorithm is terminated if changes in the partition matrix are negligible



Clustering Software

• Cluster 3.0 (for gene expression data analysis )

• PyCluster (Python Module)

• Algorithm::Cluster (Perl package)

• C clustering library

http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm



Outlook

• Bioperl



Thanx for your attention!!!

Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern...

Documents

Transcript of Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern...