Clustering by soft-constraint affinity propagation: applications to gene-expression data
description
Transcript of Clustering by soft-constraint affinity propagation: applications to gene-expression data
![Page 1: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/1.jpg)
Clustering by soft-constraint affinity propagation: applications to gene-
expression dataMichele Leone, Sumedha and Martin
WeightBioinformatics, 2007
![Page 2: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/2.jpg)
Outline
• Introduction• The Algorithm and Method Analysis• Experimental results• Discussion
2
![Page 3: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/3.jpg)
Introduction
• Affinity Propagation seeks to identify each cluster by one of its elements, exemplar.– each point in the cluster refers to this exemplar.– each exemplar is required to refer to itself as a
self-exemplar.
• However, it forces clusters to appear as stars.
3
There’s only one central node, and all other nodes are directly connected to it.
![Page 4: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/4.jpg)
Introduction
• Some drawbacks in Affinity Propagation:– The hard constraint in AP relies strongly on
cluster-shape regularity.– All information about the internal structure and
the hierarchical merging/dissociation of cluster is lost.
– AP has robustness limitations.– AP forces each exemplar to point to itself.
4
![Page 5: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/5.jpg)
Introduction
• How to improve it?• The hard constraint: exemplars would be self-
exemplars.• We relax the hard constraint by introducing a
finite penalty term for each constraint violation.
5
![Page 6: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/6.jpg)
The Algorithm and Method Analysis
• The Soft Constraint Affinity Propagation(SCAP) equations.
• Efficient implementation of the algorithm.• Extracting cluster signatures.
6
![Page 7: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/7.jpg)
The SCAP equations
• We write the constraint attached to a given data point as follows, with :
The first case assigns a penalty if data point is chosen as exemplar by some other data point , without being a self-exemplar.
7
![Page 8: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/8.jpg)
The SCAP equations
• The penalty presents a compromise between the minimization the cost function and the search of compact clusters.
• Then, we introduce a positive real-valued parameter weighing the relative importance of the cost minimization with respect to the constraints.
8
![Page 9: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/9.jpg)
The SCAP equations
• So, we can define the probability of an arbitrary clustering as:
• Original AP is recovered by taking since any violated constraint sets to zero.
9
![Page 10: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/10.jpg)
The SCAP equations
• For general , the optimal clustering can be determined by maximizing the marginal probabilities for all data points :
10
![Page 11: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/11.jpg)
The SCAP equations
• Assume , we find the SCAP equations:
• The exemplar of any data point can be computed as:
11
![Page 12: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/12.jpg)
The SCAP equations
• Compared to original AP, SCAP amounts to an additional threshold on the self-availabilities and the self-responsibilities .
• For small enough , in many case.• The self-responsibility is substituted
with .• For (i.e. ), the original AP equations
are recovered.
12
![Page 13: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/13.jpg)
The SCAP equations
• This means that variables are discouraged to be self-exemplars beyond a given threshold, even in the case someone is already pointing at them.
13
![Page 14: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/14.jpg)
Efficient implementation
• The iterative solution:
14
![Page 15: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/15.jpg)
Efficient implementation
• Difference between the original AP:– Step 3 is formulated as a sequential update.– The original AP used damped parallel update.
15
![Page 16: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/16.jpg)
Extracting cluster signatures
• Only a few components carry useful information about the cluster structure, they are called cluster signatures.
• We assume the similarity between data points
and to be additive in single-gene contributions:
16
![Page 17: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/17.jpg)
Extracting cluster signatures
• Having found a clustering given by the exemplar selection , we can calculate the similarity of a cluster C defined as a connected component of the directed graph:
as a sum over single-gene contributions
17
![Page 18: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/18.jpg)
Extracting cluster signatures
• Then, we compare to random exemplar choices which are characterized by their mean:
and variance
18
![Page 19: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/19.jpg)
Extracting cluster signatures
• The relevance of a gene can be ranked by
which measures the distance of the actual from the distribution of random exemplar mappings.
• Genes can be ranked according to , highest-ranking genes are considered a cluster signature.
19
![Page 20: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/20.jpg)
Experimental results
• Iris data• Brain cancer data• Other benchmark cancer data– Lymphoma cancer data– SRBCT cancer data– Leukemia
20
![Page 21: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/21.jpg)
Iris data
• Three clusters: setosa, versicolor, virginica.• Four features for 150 flowers:– sepal length– sepal width– petal length– petal width
21
![Page 22: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/22.jpg)
Iris data
• Experimental results:– Affinity Propagation: 16 errors.– SCAP: 9 errors with Manhattan distance measure
for the similarity.
• On increasing the value of , the clusters for Versicolor and Virginica merge with each other, reflecting the fact that they are closer to each other than to Setosa.
22
![Page 23: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/23.jpg)
Brain cancer data
• Five diagnosis types for 42 patients:– 10 medulloblastoma– 10 malignant glioma– 10 atypical teratoid/rhabdoid tumors– 4 normal cerebella– 8 primitive neuroectodermal tumors – PNET
23
![Page 24: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/24.jpg)
Brain cancer data
• Clustering with AP(for ):
24
There are three well-distinguishable clusters.
Five clusters for lowest errors.
![Page 25: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/25.jpg)
Brain cancer data
• Clustering with SCAP:
25
The SCAP identifies four clusters with 8 errors.
![Page 26: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/26.jpg)
Brain cancer data
• Eight errors are due to misclassifications of the fifth diagnosis(PNET).
• We use the procedure to extract cluster signatures in the case of four clusters:
• No. 34~41 are the fifth diagnosis.
26
![Page 27: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/27.jpg)
Other benchmark cancer data
• Lymphoma cancer data– Three diagnoses for 62 patients.
• SRBCT cancer data– Four expression diagnosis patterns for 63 samples.
• Leukemia– Two diagnoses for 72 samples.
27
![Page 28: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/28.jpg)
Other benchmark cancer data
• Lymphoma cancer data– AP: 3 errors with 3 clusters.– SCAP: 1 error with 3 clusters.
• SRBCT cancer data– AP: 22 errors with 5 clusters.– SCAP: 7 errors with 4 clusters.
• Leukemia– AP: 4 errors with 2 clusters.– SCAP: 2 errors with 2 clusters.
28
![Page 29: Clustering by soft-constraint affinity propagation: applications to gene-expression data](https://reader033.fdocuments.in/reader033/viewer/2022051401/56814fff550346895dbdc83a/html5/thumbnails/29.jpg)
Discussion
• If clusters cannot be well represented by a single cluster exemplar, AP has to fail.
• SCAP is more efficient than AP in particular in the case of noisy, irregularly organized data and thus in biological applications concerning microarray data.
• The cluster structure can be efficiently probed.
29