Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern...
-
date post
22-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern...
Ulf Schmitz, Pattern recognition - Clustering 1
www. .uni-rostock.de
BioinformaticsBioinformaticsPattern recognition - ClusteringPattern recognition - Clustering
Bioinformatics and Systems Biology Groupwww.sbi.informatik.uni-rostock.de
Ulf Schmitz, Pattern recognition - Clustering 2
www. .uni-rostock.de
Outline
1. Introduction
2. Hierarchical clustering
3. Partitional clustering k-means and derivatives
4. Fuzzy Clustering
Ulf Schmitz, Pattern recognition - Clustering 3
www. .uni-rostock.de
Introduction into Clustering algorithms
• Clustering is the classification of similar objects into separated groups – or the partitioning of a data set into subsets
(clusters)– so that the data in each subset (ideally) share
some common trait • Machine learning typically regards clustering
as a form of unsupervised learning.
• we distinguish:– Hierarchical Clustering (finds successive
clusters using previously established clusters)– Partitional Clustering (determines all clusters
at once)
Ulf Schmitz, Pattern recognition - Clustering 4
www. .uni-rostock.de
Introduction into Clustering algorithms
• gene expression data analysis• identification of regulatory
binding sites • phylogenetic tree clustering (for inference of horizontally transferred
genes)• protein domain identification• identification of structural motifs
Applications
Ulf Schmitz, Pattern recognition - Clustering 5
www. .uni-rostock.de
Introduction into Clustering algorithms
• data matrix collects observations of n objects, described by m measurements
• rows refer to objects, characterised by values in the columns
nm2n1n
m22221
m11211
xxx
xxx
xxx
X
if units of measurements, associated with the columns of X differ, it’s necessary to normalise
Data matrix
)x(
xxx
j
jjj
new
: column vector
: mean
: standard deviation
jx
jx
)x( j
Ulf Schmitz, Pattern recognition - Clustering 6
www. .uni-rostock.de
Hierarchical clustering
1. find dis/similarity between every pair of objects in the data set by evaluating a distance measure
2. group the objects into a hierarchical cluster tree (dendrogram) by linking newly formed clusters
3. obtain a partition of the data set into clusters by selecting a suitable ‘cut-level’ of the cluster tree
produces a sequence of nested partitions, the steps are:
Ulf Schmitz, Pattern recognition - Clustering 7
www. .uni-rostock.de
Hierarchical clustering
1. start with n clusters, each containing one object and calculate the distance matrix D1
2. determine from D1 which of the objects are least distant (e.g. I and J)
3. merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster
4. repeat steps 2 and 3 a total of m-1 times until a single cluster is formed• record which clusters are merged at each step• record the distances between the clusters that are merged in that step
Agglomerative Hierarchical clustering
Ulf Schmitz, Pattern recognition - Clustering 8
www. .uni-rostock.de
Hierarchical clustering
one treats the data matrix X as a set of n (row) vectors with m elements
calculating the distances
m
1j
2sjrj
2rs xxd
Euclidian distance
sr x,x are row vectors of X
City block distance
m
jsjrjrs xxd
1
5.24
5.14
22
5.45.2
21
x
an example
Ulf Schmitz, Pattern recognition - Clustering 9
www. .uni-rostock.de
Hierarchical clustering
an example
5.24
5.14
22
5.45.2
21
x
Euclidian distance
City block distance
06.2d35
5.235 d
Ulf Schmitz, Pattern recognition - Clustering 10
www. .uni-rostock.de
Hierarchical clustering
1. start with n clusters, each containing one object and calculate the distance matrix D1
2. determine from D1 which of the objects are least distant (e.g. I and J)
3. merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster
4. repeat steps 2 and 3 a total of m-1 times until a single cluster is formed• record which clusters are merged at each step• record the distances between the clusters that are merged in that step
Agglomerative Hierarchical clustering
Ulf Schmitz, Pattern recognition - Clustering 11
www. .uni-rostock.de
Hierarchical clusteringx1 x2 x3 x4 x5
x1 0 2.9155 1.0000 3.0414 3.0414
x2 2.9155 0 2.5495 3.3541 2.5000
x3 1.0000 2.5495 0 2.0616 2.0616
x4 3.0414 3.3541 2.0616 0 1.0000
x5 3.0414 2.5000 2.0616 1.0000 0
distance matrix
5.24
5.14
22
5.45.2
21
X
x1, x3 X2 x4, x5
x1, x3 0 2.9155 2.0616
X2 2.9155 0 2.5000
x4, x5 2.0616 2.5000 0
Ulf Schmitz, Pattern recognition - Clustering 12
www. .uni-rostock.de
Hierarchical clustering
dIJ
dIJ
d15
d14
d13
d25
d24
d23
single linkage:
JI :minIJ j and id d ij
complete linkage:
JI :maxIJ j and id d ij
JII JIJ NN/ddi j ij
group average:
1
25
4
3
Methods to define a distance between clusters:
N is the number of members in a cluster
centroid linkage:
Jn
iiJ
JJJI x
nxxxdd
1IJ
1 where ,
Ulf Schmitz, Pattern recognition - Clustering 13
www. .uni-rostock.de
Hierarchical clustering
Ulf Schmitz, Pattern recognition - Clustering 14
www. .uni-rostock.de
Hierarchical clustering
1. start with n clusters, each containing one object and calculate the distance matrix D1
2. determine from D1 which of the objects are least distant (e.g. I and J)
3. merge these objects into one cluster and form a new distance matrix by deleting the entries for the clustered objects and add distances for the new cluster
4. repeat steps 2 and 3 a total of m-1 times until a single cluster is formed• record which clusters are merged at each step• record the distances between the clusters that are merged in that step
Agglomerative Hierarchical clustering
Ulf Schmitz, Pattern recognition - Clustering 15
www. .uni-rostock.de
Hierarchical clustering
1. the choice of distance measure is important
2. there is no provision for reassigning objects that have been incorrectly grouped
3. errors are not handled explicitly in the procedure
4. no method of calculating intercluster distances is universally the best
• but, single-linkage clustering is least successful• and, group average clustering tends to be fairly well
Limits of hierarchical clustering
Ulf Schmitz, Pattern recognition - Clustering 16
www. .uni-rostock.de
Partitional clustering – K means
• Involves prior specification of the number of clusters, k• no pairwise distance matrix is required• The relevant distance is the distance from the object to the
cluster center (centroid)
Ulf Schmitz, Pattern recognition - Clustering 17
www. .uni-rostock.de
Partitional clustering – K means
1. partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects)
2. calculate the centroids of the clusters3. assign or reassign each object to that cluster whose centroid is
closest (distance is calculated as Euclidean distance)4. recalculate the centroids of the new clusters formed after the gain or
loss of objects to or from the previous clusters5. repeat steps 3 and 4 for a predetermined number of iterations or until
membership of the groups no longer changes
Ulf Schmitz, Pattern recognition - Clustering 18
www. .uni-rostock.de
Partitional clustering – K means
object x1 x2
A 1 1
B 3 1
C 4 8
D 8 10
E 9 6
step 1: make an arbitrary partition of the objects into clusters:e.g. objects with into Cluster 1, all other into Cluster 2A,B and C in Cluster 1, and D and E in Cluster 2
step 2: calculate the centroids of the clusterscluster 1: cluster 2:
step 3: calculate the Euclidean distance between each object and each of the two clusters centroids:
61 x
33.3 ,67.2 21 cc00.8 ,50.8 21 cc
object d(x1,c1) d(x2,c2)
A 2.87 10.26
B 2.35 8.90
C 4.86 4.50
D 8.54 2.06
E 6.87 2.06A
D
B
C
E
2
4
6
8
10
2 4 6 8 10
Ulf Schmitz, Pattern recognition - Clustering 19
www. .uni-rostock.de
Partitional clustering – K means
1. partition the objects in k clusters (can be done by random partitioning or by arbitrarily clustering around two or more objects)
2. calculate the centroids of the clusters3. assign or reassign each object to that cluster whose centroid is
closest (distance is calculated as Euclidean distance)4. recalculate the centroids of the new clusters formed after the
gain or loss of objects to or from the previous clusters5. repeat steps 3 and 4 for a predetermined number of iterations or
until membership of the groups no longer changes
Ulf Schmitz, Pattern recognition - Clustering 20
www. .uni-rostock.de
Partitional clustering – K means
step 4: C turns out to be closer to Cluster 2 and has to be reassignedrepeat step2 and step3
object d(X,1) d(X,2)
A 1.00 9.22
B 1.00 8.06
C 7.28 3.00
D 10.82 2.24
E 8.60 2.83
cluster 1: cluster 2:
00.1 ,00.2 21 xx00.8 ,00.7 21 xx
no further reassignments are necessary
Ulf Schmitz, Pattern recognition - Clustering 21
www. .uni-rostock.de
Partitional clustering – K means
Ulf Schmitz, Pattern recognition - Clustering 22
www. .uni-rostock.de
Fuzzy clustering
• is an extension of k – means clustering– an objects belongs to a cluster in a certain degree
• for all objects the degrees of membership in the k clusters adds up to one:
• a fuzzy weight is introduced, which determines the fuzziness of the resulting clusters– for ω → 1, the cluster becomes a hard partition– for ω → ∞, the degree of membership approximates 1/k
– typical values are ω = 1.25 and ω = 2
1uk
1i ij
,1
Ulf Schmitz, Pattern recognition - Clustering 23
www. .uni-rostock.de
Fuzzy clustering
fix k, 2 ≤ k < n and choose a distance measure (Euclidean, city block, etc.), a termination tolerance δ>0 (e.g. 0.01 or 0.001), and fix ω, 1 ≤ ω < ∞. Initialize first cluster set randomly.
ki ,u
xu
c n
1j
1lij
n
1jj
1lij
li
1
step1: compute cluster centers
step2: compute distances between objects and cluster centers
m
jirjrc cxd
i1
22
Ulf Schmitz, Pattern recognition - Clustering 24
www. .uni-rostock.de
Fuzzy clustering
step3: update partition matrix:
k
rrjr
lijr
lij
cxdcxdu
1
1/122 ,/,
1
until:
1ll UU
the algorithm is terminated if changes in the partition matrix are negligible
Ulf Schmitz, Pattern recognition - Clustering 25
www. .uni-rostock.de
Clustering Software
• Cluster 3.0 (for gene expression data analysis )
• PyCluster (Python Module)
• Algorithm::Cluster (Perl package)
• C clustering library
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm
Ulf Schmitz, Pattern recognition - Clustering 26
www. .uni-rostock.de
Outlook
• Bioperl
Ulf Schmitz, Pattern recognition - Clustering 27
www. .uni-rostock.de
Thanx for your attention!!!