Penalized and weighted K-means for clustering with...

Penalized and weighted K-means for clustering with noise and prior

information incorporation

George C. TsengDepartment of Biostatistics

Department of Human GeneticsUniversity of Pittsburgh

OutlineIntro of cluster analysis

Model-based clusteringHeuristic methods

Hierarchical clusteringK-means & K-memoids……

A motivating example (yeast cell cycle microarray data)Penalized weighted K-means

Penalty term and weightsSome propertiesEstimate parameters (k and λ)

ApplicationsSimulationYeast cell cycle microarray dataCID fragmentation patterns in MS/MS

Discussion

Intro. of cluster analysis

Cluster analysis:Data X={xi, i=1,…, n}, each object xi∈Rp.

Given a dissimilarity measure d(xi, xj), assign the nobjects into k disjoint clusters; i.e. C={C1 , … , Ck} and

0 10 20 30

Cluster analysis:

1.Estimate the number of clusters k.

2.Decide which clustering method to use.

3.Evaluation and re-validation of clustering results.

Long history in statistics, computer science and applied math literature.

( ) ( )⎭⎬⎫

⎩⎨⎧

⎥⎦

⎤⎢⎣

⎡Σ=Σ ΣΠ

==jjij

ixfL ,;log,,

11µπµπ

Model-based clustering:

1. Mixture maximum likelihood (ML):

2. Classification maximum likelihood (CML):

( ) ( )

C, X , C , CC

xfCLji

1,;log,,

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

Σ=Σ ΠΠ µµ

Model selection of model-based clustering: Bayesian Information criterion (BIC) to determine k and Σj

( ) ( ) BICnmxlconstMxp MM ≡−⋅≈+ )log(ˆ,2.|log2 θ

p(x|M) is the (integrated) likelihood of the data for the model M

lM(x, θ) is the maximized mixture loglikelihood for the model

mM is the number of independent parameters to be estimated in the model

Hierarchical clustering:

Iteratively agglomerate nearest nodes to form bottom-up tree.

Single Linkage: shortest distance between points in the two nodes.

Complete Linkage: largest distance between points in the two nodes.

Note: Clusters can be obtained by cutting the hierarchical tree.

K-means criterion:Minimize the within-group sum-squared dispersion to obtain C:

.cluster ofcenter theis

jimeansK

∑ ∑= ∈

− −=

K-memoids criterion:

.cluster in point median theis

xxdkCW

jimemoidsK

=∑ ∑= ∈

.cluster ofcenter theis

jimeansK

∑ ∑= ∈

− −=

Proposition: K-means is a special case of CML under Gaussian model of identical spherical clusters.

K-means:

K-means

Clusters contain many false positives because the algorithm has to assign all genes into clusters.

Many genes are noise (scattered) genes!!

A motivating exampleYeast cell cycle microarray data

Traditional:Assign all genes into clusters.

0 10 20 30

-10 0 10 20 30 40

Question:Can we allow a set of scattered (noise) genes?

K-means penalized K-means

C1C2C3

583139

Prior information:Six groups of validated cell cycle genes:

Goal 1:Allow a set of scattered genes without being clustered.

Goal 2:Incorporation of prior information in cluster formation.

PW-KmeansFormulation:Assign n objects into k clusters and a possible noise set.i.e. C={C1 , … , Ck , S},Extend K-means criterion to:

d(xi, Cj): dispersion of point xi in Cj.|S|: # of objects in noise set S.w(xi; P): weight function to incorporate prior info P.λ: a tuning parameter

PW-Kmeans

How does it work?

Penalty term λ: assign outlying objects of a cluster to the noise set S.

Weighting term w: utilize prior knowledge of preferred or prohibited patterns P.

( ) ( )

( ) ( ).,);,(,);,( , If .3

.),(),( , If .2

.,);,(,);,( , If 1.. and given minimizer

the),(),,(),...,,(),( Denote:nPropositio

λλλλλλ

λλλλ

λλλλλ

kkCWkkCW

kkCWkkCWkkk

kSkCkCkC k

PW-Kmeans

∑ ∑= ∈

− −=k

jimeansK

xxkCW1

2)();(

K-means and K-memoids are two special cases of the new generalized form of PW-Kmeans. (i.e. w(⋅,⋅)=1, λ=∞)

( )∑ ∑= ∈

− =k

jimemoidsK

xxdkCW1

)(,);(

Relation to classification likelihood

K-means loss function:

Classification likelihood: (Gaussian model)

PW-Kmeans

Penalized K-means loss function:

Classification likelihood: (Gaussian model)

V is the space where noise set is uniformly distributed.

PW-KmeansRelation to classification likelihood

PW-Kmeans

Estimate k and λ

Tibshirani et al. 2001

[ ][ ]

[ ]( )

centroids.set trainingby thecluster same theto assigned also areat cluster thin that pairsn observatio

of proportion thecompute wecluster,each test For

1),,()1(

1min)(

. sets;cluster thebe),(

cluster. same in the are and if 1s".membership-co" denotingmatrix an

clusters.k into cluster when operation clustering

∑∈≠

≤≤=

jAiiiitetr

iitrtr

XkXCDInn

AnkAAX

i'i,k),XC(XDnn,k),XC(XD

X,k)C(X

Estimate k and λ

PW-Kmeans

λ is inversely related to the number of noise genes |S|.

Simulation

Penalized K-means (no weight term)

Estimate k and λ

Simulation

Prior informationSix groups of validated cell cycle genes:

ApplicationsI: Yeast cell cycle microarray data

8 histone genes tightly coregulated

in S phase

Prior informationSix groups of validated cell cycle genes:

The 8 histone genes are left in noise set S

without being clustered.

Penalized K-means

Prior knowledge of p pathways

The weight is designed as a transformation of logistic function:

Penalized weighted K-means

Design of weight function

The 8 histone genes are now in cluster 3.

Take three randomly selected histone genes as prior information, P. Then perform penalized weighted K-means.

: unannotated genes

p-value calculation (null is hypergeometric distribution):

G: total of 1663 genesD(F): # of genes in the functional category (23+5=28)n(C): # of genes in the cluster (4+23+71=98)d(F): # of genes in the cluster and the functional category (23)

Annotation prediction from clusters

Given a p-value threshold δ (δ=0.01), we can compute:

Predictions made: 42+98+98=238Accuracy: (10+4+23)/(42+98+98) =15.55%

Varying δ gives varying “Predictions made” and “Accuracy”

Evaluation of annotation prediction

Accuracy of random guess

ApplicationsI: Yeast cell cycle microarray dataEvaluation of annotation predictionδ=10-4,…10-20

• P-kmeans generally better than Kmeans.

• P-kmeans makes fewer predictions than Kmeans but produce much higher accuracy.

• Smaller λ result in smaller clusters and # of prediction made but with better accuracy.

Conclusion: Evaluation of annotation prediction

Protein #1

Protein #2

HPLC MS

SIYDGK

FWSEFR

TLLHPYK

PeptideSequence

Enzyme

Peptide Sequencing Algorithm

CID(CID)

ApplicationsII: CID fragmentation pattern in MS/MS

Energy

LITSHLVDTDPEVD SIIKDEIER

Energy

LITSHL VDTDPEVDSIIKDEIER

Collision-Induced Dissociation (CID)

The abundance of such cleavages are recorded as intensities.

b10y13

LITSHLVDTDPEVDSIIKDEIER

A C D E F G H I K L M N P Q0 1 2 3 4 5 6 7 8 9 10 11 12 13

1st 2nd count intensitiesA A 2 0 0.381A C 0A D 0A E 1 0.031A F 0A G 0A H 0A I 0A K 0A L 0A M 1 0.514A N 0A P 0A Q 1 0.096A R 0A S 0A T 0A V 0A W 0A Y 0C A 0

One single peptide: AAAMDAQAEAK

All intensities normalized to [0,1]

Assume intensities measure the probability of dissociation.

A C D E F G H I K L M N P Q0 1 2 3 4 5 6 7 8 9 10 11 12 13

1st 2nd count intensitiesA A 188 0 0 0 0 0 0 0 0 0 0 0A C 1 0.105A D 91 0 0 0 0 0 0 0 0 0 0 0A E 94 0 0 0 0 0 0 0 0 0 0 0A F 41 0 0 0 0 0 0 0 0 0.008 0.008 0.01A G 129 0 0 0 0 0 0 0 0 0 0 0A H 0A I 69 0 0 0 0 0 0 0 0 0 0 0.019A K 5 0 0.01 0.013 0.122 0.139A L 137 0 0 0 0 0 0 0 0 0 0 0A M 28 0.004 0.02 0.023 0.034 0.034 0.06 0.062 0.068 0.093 0.101 0.122A N 52 0 0 0 0 0.006 0.022 0.025 0.025 0.033 0.039 0.046A P 152 0 0 0 0 0.012 0.029 0.041 0.042 0.049 0.059 0.068A Q 58 0 0 0 0 0 0 0 0 0 0 0A R 0A S 59 0 0 0 0 0 0.01 0.021 0.031 0.038 0.04 0.049A T 78 0 0 0 0 0 0 0 0 0 0.013 0.018A V 88 0 0 0 0 0 0 0 0 0 0 0A W 11 0 0 0 0 0.029 0.149 0.205 0.321 0.428 0.454 1A Y 26 0 0 0 0 0.008 0.011 0.047 0.063 0.065 0.078 0.079C A 0

For a specific set of peptides (720 peptides):

20×20=400 independent distributionsEach with 0 or multiple (up to hundreds) observations

Protein #1

Protein #2

HPLC MS

SIYDGK

FWSEFR

TLLHPYK

PeptideSequence

Enzyme

Peptide Sequencing Algorithm

CID(CID)

Current protein identification algorithms assume completely random dissociation probability pattern.

This is, however, found not true and the dissociation pattern depends on the charge state and the peptide sequence motif.

A A 188 0 0 0 0 0 0 0 0 0 0A C 1 0.105A D 91 0 0 0 0 0 0 0 0 0 0A E 94 0 0 0 0 0 0 0 0 0 0A F 41 0 0 0 0 0 0 0 0 0.008 0.008A G 129 0 0 0 0 0 0 0 0 0 0A H 0A I 69 0 0 0 0 0 0 0 0 0 0A K 5 0 0.01 0.013 0.122 0.139A L 137 0 0 0 0 0 0 0 0 0 0A M 28 0.004 0.02 0.023 0.034 0.034 0.06 0.062 0.068 0.093 0.101A N 52 0 0 0 0 0.006 0.022 0.025 0.025 0.033 0.039A P 152 0 0 0 0 0.012 0.029 0.041 0.042 0.049 0.059A Q 58 0 0 0 0 0 0 0 0 0 0A R 0

D A 73 0.018 0.026 0.032 0.046 0.046 0.093 0.108 0.146 0.155 0.173 0.193D C 1 0D D 38 0 0.012 0.056 0.097 0.097 0.118 0.135 0.142 0.156 0.182 0.197D E 53 0 0 0 0.021 0.026 0.035 0.05 0.073 0.08 0.085 0.095D F 38 0 0.003 0.044 0.1 0.119 0.214 0.232 0.283 0.41 0.468 0.507D G 44 0.024 0.128 0.128 0.226 0.239 0.247 0.383 0.395 0.491 0.529 0.693D H 0D I 38 0.029 0.054 0.063 0.128 0.15 0.173 0.229 0.233 0.257 0.268 0.284D K 0D L 76 0 0 0 0 0 0.01 0.057 0.064 0.113 0.114 0.126D M 15 0 0.147 0.212 0.376 0.419 0.709 0.806 0.841 0.885 0.887 0.947D N 18 0.047 0.108 0.232 0.442 0.458 0.506 0.508 0.575 0.585 0.844 0.904D P 63 0.122 0.444 0.481 0.61 0.631 0.675 0.753 0.882 0.883 1 1D Q 14 0 0.098 0.163 0.228 0.262 0.297 0.421 0.459 0.463 0.55 0.667D R 0

AX: low

DX: high

Visualization of a distribution:Ten concentric donuts to represent 5%, 15%,… , 95% percentiles. Value represented by gradient color.

The dissociation pattern depends on the charge state and the peptide sequence motif.

Distances cannot be defined for most pairs of peptides. (more than 95% missing values)Distance between a peptide and a set of peptides can be defined.K-means and PW-Kmeans are applicable while most other clustering methods fail.

[…P…R]+

[…P…R]2+

• Intensity data of each peptide contain >95% missing values. Most clustering methods would not work.

• Dissimilarity between two peptides cannot be defined.

• Fortunately dissimilarity between one peptide and a set of peptides can be calculated and penalized K-means can be used.

Discussion

Model-basedheuristic

Gaussian mixture model

Bayesian clustering

K-memoidsK-meansPW-KmeansSOM

Hierarchical clustering

Tight clustering(re-evaluation machine by re-sampling techniques)

Conclusion

Penalized and weighted K-means (PW-Kmeans) provides a general and flexible way for clustering complex data.

The penalty term allows a noise set not being clustered, avoiding information dilution by noises.

The weights can be designed to incorporate biological prior pathway information. Similar to Bayesian approach but avoids specific modelling.

In the situation of many missing values (MS/MS example), most methods are hard to pursue but P-Kmeans worked well.

MS/MS data collaboration with YingyingHuang from Vicki Wysocki’s lab in University of Arizona

Discussion and comments from HaiyanHuang, Eleanor Feingold and Wing Wong.

Acknowledgement

Penalized and weighted K-means for clustering with...

Documents

Transcript of Penalized and weighted K-means for clustering with...

Weighted Clustering for Bees Detection on Video ImagesWeighted Clustering for Bees Detection on Video Images JerzyDembski[0000 0002 6011 1955] and JulianSzymański[0000 0001 5029 6768

Weighted Clustering Ensembles - Society for … Clustering Ensembles Muna Al-Razgan∗ Carlotta Domeniconi∗ Abstract Cluster ensembles oﬀer a solution to challenges inherent to

A novel clustering algorithm based on weighted support and its application

CHAPTER 4 TEXT CLUSTERING WITH QUERY EXPANSION USING ...shodhganga.inflibnet.ac.in/bitstream/10603/24791/9/09_chapter4.pdf · WEIGHTED APPROACH 4.1 INTRODUCTION Text clustering is

Self-Weighted Multi-View Clustering with Deep Matrix ...proceedings.mlr.press/v101/cui19a/cui19a.pdf · Notations. Throughout the paper, matrices are written as uppercase. Given a

Recommendation System for Criminal Behavioral …Recommendation System for Criminal Behavioral Analysis on Social Network using Genetic Weighted K-Means Clustering V. 2Soundarya1*,

Weighted similarity estimation in data streams · clustering. However, for many Big data applications, sim-ilarity computation can be prohibitively expensive, hence necessitating

Kernelized Weighted SUSAN based Fuzzy C-Means Clustering ...

Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference

LIFE TIME SENSITIVE WEIGHTED CLUSTERING ON WIRELESS …etd.lib.metu.edu.tr/upload/12616280/index.pdf · called “cluster-head” and the remaining local nodes, which are connected

Clustering IV. Outline Impossibility theorem for clustering Density-based clustering and subspace clustering Bi-clustering or co-clustering.

A Strongly Consistent Sparse k-means Clustering with ...proved k-prototypes [24], Minkowski Weighted k-means [13], Feature Weight Self-Adjustment k-Means [10], Feature Group Weighted

Weighted Graph Clustering with Non-Uniform Uncertaintiesguppy.mpe.nus.edu.sg/~mpexuh/papers/weighted_ICML.pdf · Graph clustering concerns with ﬁnding densely connected groups of

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.

A Multi-Hop Weighted Clustering of Homogeneous MANETs Using Combined Closeness Index

Managing Cost Overrun Risk in Project Funding Allocationresearch.economics.unsw.edu.au/ctseng/COR_AOR.pdf · Managing Cost Overrun Risk in Project Funding Allocation ... the total

Fuzzy Clustering Based Segmentation of Vertebrae in T1-Weighted Spinal MR Images

The Partial Weighted Set Cover Problem with Applications to ...ceur-ws.org/Vol-1670/paper-67.pdfThe Partial Weighted Set Cover Problem with Applications to Outlier Detection and Clustering

Using Cumulative Weighted Slopes for Clustering Time ... · Using Cumulative Weighted Slopes for Clustering Time Series Data Durga Toshniwal1 and R. C. Joshi1 1 Department of Electronics

EEWC: energy-efficient weighted clustering method …...efﬁcient clustering technique has been proposed based on a genetic algorithm with the newly deﬁned objective function. The