Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances,...
Transcript of Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances,...
ClusteringUsing clusters for classification
Introduction to Bioalgorithms:Clustering
Vincent Voelz
Temple University
October 12, 2015
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
ClusteringExamples of clusteringDistance metrics
RMSD
Clustering algorithmsk-meansk-centers (k-medoids)Multidimensional Scaling (MDS)hierarchical clustering
Using clusters for classificationEigenfaces
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Protein conformations
conformations sampled by molecular dynamicssimulations and the transitions between them asthe network nodes and links, respectively. Thenetwork analysis allows us to identify the topo-logical properties that are common to both beta3s,which folds to a unique three-dimensional struc-ture,15,19 and a random heteropolymer which lacks
a single preferential conformation like the nativestate despite the fact that it has the same residuecomposition as beta3s. These properties include thepresence of several free-energy minima and highlyconnected conformations (hubs). On the otherhand, a hierarchical modularity20 in the proximityof the native state is peculiar of a folding sequence.
Figure 1. The beta3s conformation space network. The size and color coding of the nodes reflect the statistical weightwand average neighbor connectivity knn respectively. White, cyan, and red nodes have knn!30, 30%knn%70, and knnO70,respectively. Representative conformations are shown by a pipe colored according to secondary structure: white standsfor coil, red for a-helix, orange for bend, cyan for strand and the N terminus is in blue. The variable radius of the pipereflects structural variability within snapshots in a conformation. The yellow diamonds are folding TS conformations(TSE1, TSE2, see the text for details) characterized by a connectivity/weight ratio k=2 !wO0:3, a clustering coefficientC!0.3, and 60!knn!80. This Figure was made using visone (www.visone.de) and MOLMOL40 visualization tools.
300 The Protein Folding Network
Rao, F., & Caflisch, A. (2004). The Protein Folding Network. Journal ofMolecular Biology, 342(1), 299?306.
http://doi.org/10.1016/j.jmb.2004.06.063
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Gene Expression Microarray Data
-3.00
-2
.00
-1.00
0.
00
1.00
2.
00
3.00
Ovar
y
Testi
s
Hea
d (F)
Head
(M)
Integ
umen
t (F)
Integ
umen
t (M
)Fa
t bod
y (F)
Fat b
ody (
M)
Mid
gut (
F)M
idgu
t (M
)Ho
moc
yte (
F)Ho
moc
yte (
M)
Malp
ighi
an tu
bule
(F)
Malp
ighi
an tu
bule
(M)
A/M
SG (F
)
A/M
SG (M
)
PSG
(F)
PSG
(M)
Bmae46-sw07727 Bmae53-sw05012 Bmae10-sw09787 Bmae30-sw15265 Bmae38-sw19995 Bmae39-sw20808 Bmae3-sw15605 Bmae16-sw12821 Bmae21-sw11948 Bmae19-sw13684 Bmae18-sw07946 Bmae49-sw05060 Bmae22-sw11949 Bmae17-sw21466 Bmae2-sw12242 Bmae27-sw03697 Bmjhe2-sw00495 Bmae24-sw14575 Bmae4-sw09185 Bmae5-sw20245 Bmae48-sw05730 Bmae15-sw08854 Bmae40-sw09098 Bmae41-sw06296 Bmie1-sw20886 Bmae35-sw06237 Bmun1-sw00733 Bmie2-sw08442 Bmae36-sw16332 Bmae32-sw13499 Bmnlg6-sw12843 Bmbe2-sw01051 Bmae33-sw09501 Bmjhe1-sw14035 Bmae50-sw08757 Bmae43-sw07726 Bmae25-sw22280 Bmjhe4-sw03795 Bmae23-sw06448 Bmae52-sw13618 Bmae14-sw21154 Bmae11-sw15854 Bmae55-sw22224 Bmae13-sw12117 Bmnlg3-sw14720
Ⅲ
Ⅱ
Ⅰ
Yu, Q.-Y., Lu, C., Li, W.-L., Xiang, Z.-H., & Zhang, Z. (2009). Annotationand expression of carboxylesterases in the silkworm, Bombyx mori. BMC
Genomics, 10(1), 553?14. http://doi.org/10.1186/1471-2164-10-553
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Enzyme Function Superfamilies
Schnoes, A. M., Brown, S. D., Dodevski, I., & Babbitt, P. C. (2009).Annotation Error in Public Databases: Misannotation of Molecular Function in
Enzyme Superfamilies. PLoS Computational Biology, 5(12), e1000605?13.http://doi.org/10.1371/journal.pcbi.1000605
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Evolutionary Phylogenetic Timetrees
18
Fig 1
at Temple U
niversity Law School Library on O
ctober 12, 2015http://m
be.oxfordjournals.org/D
ownloaded from
Hedges, S. B., Marin, J., Suleski, M., Paymer, M., & Kumar, S. (2015). Treeof life reveals clock-like speciation and diversification. Molecular Biology and
Evolution, 32(4), 835?845. http://doi.org/10.1093/molbev/msv037
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Distance metrics
A distance metric is any function d(i , j) that is mathematicallywell-behaved (non-negative, symmetric, etc.) and satisfies thetriangle inequality.
I d(i , i) = 0 conincidence
I d(i , i) ≥ 0 non-negative
I d(i , j) = d(j , i) symmetric
I d(i , k) ≤ d(i , j) + d(j , k)triangle inequality
i
j
kQ: is sequence similarity a distance metric?
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Sequence similarity is not necessarily a distance metric
A: It depends.Some similarity measures are metrics. Using our distance(s1,s2, w penalty=5.0, sigma=0.0), we find
I d(xylidines, xylose) = 26.33
I d(xylidines, xylophone) = 4.77
I d(xylophone, xylose) = 19.5
We have a violation of the triangle inequality d(xylidines, xylose)> d(xylidines, xylophone) + d(xylophone, xylose). Thus,distance() is not a metric!
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Protein structure RMSDs
I A protein structure of N atoms can be described as an arrayof 3-element (x , y , z) vectors ~vi for i = 1...N.
I The RMSD (root-mean-squared deviation) of two proteinstructures ~v and ~w is
RMSD =
√√√√ 1
N
N∑i
||~vi − ~wi ||2
I where ||~v − ~wi ||2 = (vx − wx)2i + (vy − wy )2
i + (vz − wz)2i .
I Usually, the RMSD reported is the value after optimallysuperimposing and orienting the two structures.
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Protein structure RMSDs - example
150 molecular structures for GB1 hairpin(GEWTYDDATKTFTVTE)
Razavi, A. M., & Voelz, V. A. (2015). Kinetic network models of tryptophan mutations in β-hairpins reveal the
importance of non-native interactions. Journal of Chemical Theory and Computation, 11, 2801?2812.
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Protein structure RMSDs - example
Distribution of RMSD values for all non-zero pairwise distances:
0 2 4 6 8 10 12 14
RMSD ( )
P(R
MSD
)
These results are similar to larger proteins:I ”Structure prediction” is if we predict Cα positions to < 3− 4A.
I Simulation of crystal structures stabilize around to 1− 2A
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Two kinds of clustering algorithms
Partitioning
Dividing samples into k groups
Hierarchical
A series of models at differentresolutions
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
k-means clustering
I GOAL: Given N objects x1, ....xN and a distance metricd(xi , xj), assign the objects into k groups C1, ...,Ck such thatthe average distance to the center of each group is minimized.
I Minimize the objective function:
f =K∑
k=1
N∑i∈Ck
d2(xi − µi )
where µi is the mean position in each cluster i .
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
k-means clustering: Step 1
Step 1: Randomly pick k cluster centers
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
k-means clustering: Step 2
Step 2: Calculate distances to each cluster center, and assigngroup membership
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
k-means clustering: Step 3
Step 3: Re-calculate the locations of the centers of each cluster
**
*Step 4: Repeat until no re-assignments occur, or until somedesired tolerance.
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
k-centers (k-medoids)
I A variant of k-means: instead of means, use an exemplar datapoint as the center of each cluster
Example: ”mean” protein atom coordinates are usually unphysicalVincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Important issues with k-means/centers
Q: What if I don’t know the number of clusters ahead oftime?
I Cluster multiple times (there is stochasticity) for multiplevalues of k, and choose the best model based on statisticalcriteria
I Alternatively, covering algorithms exist (clustering variant)that use a distance threshold to decide cluster membership
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Important issues with k-means/centers
Q: What if I have a huge data set?
I Clustering requires calculating all pairwise distances, soO(N2). Data sets of N > 105 (typically) will not fit incomputer memory!
I One trick is to cluster on a subsample M � N of your data,then assign the rest. Clustering will be O(M2), butassignment is only O(N).
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Important issues with k-means/centers
Q: How can I visualize clusters if I only have distances?
I Example: objects whose distances are calculated from a set offeatures (genes, patients, metadata)
I Multidimensional scaling (MDS) methods can be used toperform dimensionality reduction, i.e. projecting data to somelow-dimensional subspace so that it can be more easilyvisualized.
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Multidimensional Scaling (MDS) via distance geometry
~x1
~x0
~x2
I Example: solving an 3-D protein structure using NOE distancerestraints
I Consider N data points in an m-dimensional space
I We only know their distance from each other, not theircoordinates or dimensionality m.
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
~x1
~x0
~x2d10d20
d12
Pick ~x0 as an arbitrary origin. Then, by the law of cosines,
d2ij = d2
i0 + d2j0 − 2|~xi ||~xj | cos θ (1)
= d2i0 + d2
j0 − 2~xi · ~xj (2)
Therefore,
~xi · ~xj =1
2(d2
i0 + d2j0 − d2
ij )
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
~x1
~x0
~x2d10d20
d12
To extract ~xi from the distances d2ij , consider the matrix
G = XTX
where X is an N ×m matrix with data points as columns, such thatGij = ~xi · ~xj . Diagonalization of G yields
G = VTΛV = (VTΛ1/2)(Λ1/2V) = XTX
The eigenvector columns of V are the principal components (PC’s) of the
space. We can project to a low-dimensional subspace by using only the
PC’s with the largest eigenvalues λ1/2i .
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
MDS on all the ’x’ words
50 40 30 20 10 0 10 20PC1
30
20
10
0
10
20
30PC
2 xanthate
xanthates
xanthein
xantheins
xanthene
xanthenes
xanthicxanthin
xanthine
xanthines
xanthinsxanthoma
xanthomas
xanthomata
xanthone
xanthones
xanthous
xebec
xebecs
xenia
xenial
xenias
xenic
xenogamies
xenogamy
xenogenies
xenogeny
xenolith
xenoliths
xenon
xenons
xenophobe
xenophobes
xenophobia
xerarch
xeric
xerosere
xeroseres
xerosesxerosis
xerotic
xerus
xeruses
xi
xiphoid
xiphoids
xis
xu
xylan
xylansxylem xylems
xylene
xylenes
xylic
xylidinxylidine
xylidines
xylidins
xylocarp
xylocarps
xyloid
xylol
xylols
xylophone
xylophones
xylophonistxylophonists
xylose
xyloses
xylotomies
xylotomy
xylyl
xylyls
xyst
xysterxysters
xysti
xystoi
xystos
xysts
xystus
I Two largest PC’s of all the ’x’ words (distance(w penalty=5.0))
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
MDS on all the ’q’ words
70 60 50 40 30 20 10 0 10PC1
20
10
0
10
20
30PC
2
qaid
qaids
qindarqindars
qintar
qintars
qiviut
qiviuts
qoph
qophsqua
quackquacked
quackeries
quackery
quackingquackish
quackism
quackisms
quacks
quad
quadded
quadding
quadrangle
quadrangles
quadrangular
quadransquadrantquadrantes
quadrants
quadrat
quadrate
quadrated
quadrates
quadrating
quadrats
quadric
quadrics
quadrigaquadrigae
quadrilateral
quadrilaterals
quadrille
quadrilles
quadroonquadroons
quadruped
quadrupedal
quadrupedsquadruple
quadrupled
quadruplesquadruplet
quadruplets
quadrupling
quadsquaere
quaeres
quaestor
quaestorsquaffquaffed
quaffer
quaffers
quaffing
quaffs
quagquaggaquaggas
quaggier
quaggiest
quaggy
quagmire
quagmires
quagmirierquagmiriest
quagmiry
quags
quahaug
quahaugs quahogquahogsquai
quaich
quaiches quaichs
quaigh
quaighsquail
quailed
quailing
quailsquaint
quainter
quaintest
quaintly
quaintness
quaintnesses
quais
quake
quaked
quaker
quakers
quakes
quakier
quakiest
quakily
quaking
quaky
quale
qualia
qualification
qualifications
qualified
qualifier
qualifiers
qualifies
qualify
qualifying
qualitative
qualities
quality
qualm
qualmierqualmiest
qualmish
qualms
qualmy
quamash
quamashesquandang
quandangs
quandaries
quandaryquandong
quandongs
quant
quanta
quantal
quantedquanticquantics
quantified
quantifies
quantify
quantifying
quanting
quantitative
quantities
quantity
quantizequantized
quantizes
quantizing
quantongquantongs
quants
quantum
quarantine
quarantined
quarantines
quarantining
quarequark
quarks
quarrel
quarreled
quarrelingquarrelled
quarrelling
quarrels
quarrelsome
quarried
quarrierquarriers quarries
quarry
quarrying
quartquartan
quartans
quarte
quarter
quarterback
quarterbacked
quarterbacking
quarterbacks
quartered
quartering
quarterlies
quarterly
quartermaster
quartermasters
quartern
quarterns
quarters
quartesquartet
quartets quarticquartics
quartile
quartiles
quarto
quartos
quarts
quartz
quartzes
quasar
quasars quash
quashed
quashes
quashing
quasi
quassquasses
quassiaquassias
quassin
quassinsquatequatorze
quatorzes
quatrain
quatrains
quatre
quatres
quaver
quavered
quaverer
quaverers
quavering
quaversquavery
quay
quayage
quayages
quaylike
quays
quayside
quaysides
quean
queans
queasierqueasiest
queasily
queasinessqueasinesses
queasy
queazier
queaziest
queazy
queenqueened
queeningqueenlier
queenliest
queenlyqueens
queerqueered
queerer
queerest
queeringqueerish
queerly
queerness
queernesses
queers
quell
quelled
queller
quellers
quelling
quellsquench
quenchable
quenched
quencher
quenchers
quenches
quenchingquenchless
quenellequenellesquercine querida
queridas
queried
querierqueriersqueries
querist
querists
quern
querns
querulous
querulously
querulousnessquerulousnesses
query
querying
quest
quested
quester
questers
questing
question
questionable
questioned
questioner
questioners
questioningquestionnaire
questionnaires
questionniare
questionniares
questions
questor
questors
quests
quetzal
quetzales
quetzals
queue
queued
queueing
queuer
queuers
queues
queuing
quey
queys
quezal
quezales
quezalsquibble
quibbled
quibbler
quibblers
quibbles
quibbling
quiche
quiches
quick
quicken
quickened
quickening
quickens
quicker
quickestquickie
quickies
quickly
quickness
quicknesses
quicks
quicksand
quicksands
quickset
quicksetsquicksilver
quicksilvers
quid
quiddities
quiddity
quidnuncquidnuncs
quids
quiescence
quiescences
quiescent
quietquieted
quieten
quietened
quietening
quietens
quieter
quietersquietest
quieting
quietism
quietismsquietist
quietists
quietly
quietness
quietnesses
quiets
quietude
quietudes
quietus
quietusesquiff
quiffs
quillquillai
quillais
quilled
quillet
quillets
quilling
quills
quilt
quilted
quilter
quilters
quilting
quiltings
quilts
quinaries
quinary
quinate
quince
quinces
quincunx
quincunxesquinellaquinellas
quinic
quiniela
quinielas
quinin
quinina
quininas
quinine
quinines quinins
quinnat
quinnats
quinoaquinoas
quinoidquinoids
quinol
quinolinquinolins
quinols
quinonequinones
quinsies
quinsy
quint
quintain
quintains
quintal
quintals
quintan
quintans
quintar
quintars
quintessence
quintessences
quintessential
quintet
quintetsquintic
quintics
quintile
quintiles
quintin
quintins
quints
quintuple
quintupled
quintuplesquintuplet
quintuplets
quintupling
quip
quipped
quippingquippish
quippu
quippus
quipsquipster
quipsters
quipu
quipus
quire
quired
quires
quiring
quirk
quirkedquirkier
quirkiest
quirkily
quirking
quirks
quirky
quirt
quirted
quirting
quirts
quisling
quislings
quitquitch
quitchesquite
quitrent
quitrentsquits
quitted
quitter
quitters
quitting
quittor
quittors
quiver
quivered
quiverer
quiverers
quivering
quiversquivery
quixotequixotes
quixotic
quixotries
quixotryquiz
quizmaster
quizmasters
quizzed
quizzer
quizzers
quizzes
quizzing
quod
quods
quoin
quoined
quoining
quoinsquoit
quoited
quoiting
quoits
quomodo
quomodos
quondam
quorumquorums
quota
quotable
quotably
quotas
quotation
quotations
quotequoted
quoter
quoters
quotes
quothquotha
quotient
quotients
quoting
qurshqurshes
qurush
qurushes
I Two largest PC’s of all the ’q’ words (distance(w penalty=5.0))
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
MDS on GB1 peptide conformations
I Principal components track overall shape similarities (but notkinetics!)
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Hierarchical Clustering
divisive agglomerative
I Most efficient algorithms are agglomerative
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Agglomeration
Various linkage criteria can be used to rank which states toagglomerate
I single linkage (nearest-neighbor)
I complete linkage (furthest-neighbor)
I average linkage
I centroid linkage
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Examples of clusteringDistance metricsClustering algorithms
Dendrograms
A dendrogram (”branch” plot) allows you to visualize the completehierarchy of clusterings
16661101311283410714189361337011912609122742013111714135738410410511243116140385832685357939922312172130971154849792912312659103108831141491244412981323564139102131981115610069374619134788750312580138154512096102540511459054674996117218255214252148869551097610626391471847332871421131467513763881012427611226530418194118012777713685143621440
20
40
60
80
100
120
140
160
RMSD Ward hierarchical clustering
from scipy import cluster
linkage = cluster.hierarchy.ward(distances)
plt.figure(figsize=(12,6))
plt.title(’RMSD Ward hierarchical clustering’)
cluster.hierarchy.dendrogram(linkage, no_labels=False, count_sort=’descendent’)
plt.savefig(’gb1_dendrogram.pdf’)
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Eigenfaces
Classification
GOAL: Given some previous learning on training data, can weclassify new data as belonging to the correct group?
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Eigenfaces
Distance classifiers
Distance classifier examples:
I face recognition
I character/handwriting recognition
Note that there are many other machine learning algorithms forclassification:
I Support Vector Machines (w/ kernel trick)
I Naive Bayes, Logistic Regression
I Neural Networks (e.g. Google Deep Dream, PSIPRED)
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Eigenfaces
Eigenfaces
Images are just arrays of numbers, i.e. m-dimensional vectors ~x .
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Eigenfaces
Eigenfaces
Steps:
1. Perform MDS (e.g.Principal ComponentAnalysis) on training data
2. Assign new data pointsusing the previously-learnedcluster centers
PC1
PC3
PC2
Vincent Voelz Introduction to Bioalgorithms: Clustering
ClusteringUsing clusters for classification
Eigenfaces
Character recognition
Linear Discriminant Analysis (LDA) of handwritten numerals
http://scikit-learn.org/stable/modules/manifold.html
Vincent Voelz Introduction to Bioalgorithms: Clustering