Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances,...

ClusteringUsing clusters for classification

Introduction to Bioalgorithms:Clustering

Vincent Voelz

Temple University

October 12, 2015

Vincent Voelz Introduction to Bioalgorithms: Clustering


ClusteringExamples of clusteringDistance metrics

RMSD

Clustering algorithmsk-meansk-centers (k-medoids)Multidimensional Scaling (MDS)hierarchical clustering

Using clusters for classificationEigenfaces



Examples of clusteringDistance metricsClustering algorithms

Protein conformations

conformations sampled by molecular dynamicssimulations and the transitions between them asthe network nodes and links, respectively. Thenetwork analysis allows us to identify the topo-logical properties that are common to both beta3s,which folds to a unique three-dimensional struc-ture,15,19 and a random heteropolymer which lacks

a single preferential conformation like the nativestate despite the fact that it has the same residuecomposition as beta3s. These properties include thepresence of several free-energy minima and highlyconnected conformations (hubs). On the otherhand, a hierarchical modularity20 in the proximityof the native state is peculiar of a folding sequence.

Figure 1. The beta3s conformation space network. The size and color coding of the nodes reflect the statistical weightwand average neighbor connectivity knn respectively. White, cyan, and red nodes have knn!30, 30%knn%70, and knnO70,respectively. Representative conformations are shown by a pipe colored according to secondary structure: white standsfor coil, red for a-helix, orange for bend, cyan for strand and the N terminus is in blue. The variable radius of the pipereflects structural variability within snapshots in a conformation. The yellow diamonds are folding TS conformations(TSE1, TSE2, see the text for details) characterized by a connectivity/weight ratio k=2 !wO0:3, a clustering coefficientC!0.3, and 60!knn!80. This Figure was made using visone (www.visone.de) and MOLMOL40 visualization tools.

300 The Protein Folding Network

Rao, F., & Caflisch, A. (2004). The Protein Folding Network. Journal ofMolecular Biology, 342(1), 299?306.

http://doi.org/10.1016/j.jmb.2004.06.063


http://doi.org/10.1016/j.jmb.2004.06.063



Gene Expression Microarray Data

-3.00

-2

.00

-1.00

0.

00

1.00

2.

00

3.00

Ovar

y

Testi

s

Hea

d (F)

Head

(M)

Integ

umen

t (F)

Integ

umen

t (M

)Fa

t bod

y (F)

Fat b

ody (

M)

Mid

gut (

F)M

idgu

t (M

)Ho

moc

yte (

F)Ho

moc

yte (

M)

Malp

ighi

an tu

bule

(F)

Malp

ighi

an tu

bule

(M)

A/M

SG (F

)

A/M

SG (M

)

PSG

(F)

PSG

(M)

Bmae46-sw07727 Bmae53-sw05012 Bmae10-sw09787 Bmae30-sw15265 Bmae38-sw19995 Bmae39-sw20808 Bmae3-sw15605 Bmae16-sw12821 Bmae21-sw11948 Bmae19-sw13684 Bmae18-sw07946 Bmae49-sw05060 Bmae22-sw11949 Bmae17-sw21466 Bmae2-sw12242 Bmae27-sw03697 Bmjhe2-sw00495 Bmae24-sw14575 Bmae4-sw09185 Bmae5-sw20245 Bmae48-sw05730 Bmae15-sw08854 Bmae40-sw09098 Bmae41-sw06296 Bmie1-sw20886 Bmae35-sw06237 Bmun1-sw00733 Bmie2-sw08442 Bmae36-sw16332 Bmae32-sw13499 Bmnlg6-sw12843 Bmbe2-sw01051 Bmae33-sw09501 Bmjhe1-sw14035 Bmae50-sw08757 Bmae43-sw07726 Bmae25-sw22280 Bmjhe4-sw03795 Bmae23-sw06448 Bmae52-sw13618 Bmae14-sw21154 Bmae11-sw15854 Bmae55-sw22224 Bmae13-sw12117 Bmnlg3-sw14720

Ⅲ

Ⅱ

Ⅰ

Yu, Q.-Y., Lu, C., Li, W.-L., Xiang, Z.-H., & Zhang, Z. (2009). Annotationand expression of carboxylesterases in the silkworm, Bombyx mori. BMC

Genomics, 10(1), 553?14. http://doi.org/10.1186/1471-2164-10-553


http://doi.org/10.1186/1471-2164-10-553



Enzyme Function Superfamilies

Schnoes, A. M., Brown, S. D., Dodevski, I., & Babbitt, P. C. (2009).Annotation Error in Public Databases: Misannotation of Molecular Function in

Enzyme Superfamilies. PLoS Computational Biology, 5(12), e1000605?13.http://doi.org/10.1371/journal.pcbi.1000605


http://doi.org/10.1371/journal.pcbi.1000605



Evolutionary Phylogenetic Timetrees

18

Fig 1

at Temple U

niversity Law School Library on O

ctober 12, 2015http://m

be.oxfordjournals.org/D

ownloaded from

Hedges, S. B., Marin, J., Suleski, M., Paymer, M., & Kumar, S. (2015). Treeof life reveals clock-like speciation and diversification. Molecular Biology and

Evolution, 32(4), 835?845. http://doi.org/10.1093/molbev/msv037


http://doi.org/10.1093/molbev/msv037



Distance metrics

A distance metric is any function d(i , j) that is mathematicallywell-behaved (non-negative, symmetric, etc.) and satisfies thetriangle inequality.

I d(i , i) = 0 conincidence

I d(i , i) ≥ 0 non-negative

I d(i , j) = d(j , i) symmetric

I d(i , k) ≤ d(i , j) + d(j , k)triangle inequality

i

j

kQ: is sequence similarity a distance metric?




Sequence similarity is not necessarily a distance metric

A: It depends.Some similarity measures are metrics. Using our distance(s1,s2, w penalty=5.0, sigma=0.0), we find

I d(xylidines, xylose) = 26.33

I d(xylidines, xylophone) = 4.77

I d(xylophone, xylose) = 19.5

We have a violation of the triangle inequality d(xylidines, xylose)> d(xylidines, xylophone) + d(xylophone, xylose). Thus,distance() is not a metric!




Protein structure RMSDs

I A protein structure of N atoms can be described as an arrayof 3-element (x , y , z) vectors ~vi for i = 1...N.

I The RMSD (root-mean-squared deviation) of two proteinstructures ~v and ~w is

RMSD =

√√√√ 1

N

N∑i

||~vi − ~wi ||2

I where ||~v − ~wi ||2 = (vx − wx)2i + (vy − wy )2

i + (vz − wz)2i .

I Usually, the RMSD reported is the value after optimallysuperimposing and orienting the two structures.




Protein structure RMSDs - example

150 molecular structures for GB1 hairpin(GEWTYDDATKTFTVTE)

Razavi, A. M., & Voelz, V. A. (2015). Kinetic network models of tryptophan mutations in β-hairpins reveal the

importance of non-native interactions. Journal of Chemical Theory and Computation, 11, 2801?2812.




Protein structure RMSDs - example

Distribution of RMSD values for all non-zero pairwise distances:

0 2 4 6 8 10 12 14

RMSD ( )

P(R

MSD

)

These results are similar to larger proteins:I ”Structure prediction” is if we predict Cα positions to < 3− 4A.

I Simulation of crystal structures stabilize around to 1− 2A




Two kinds of clustering algorithms

Partitioning

Dividing samples into k groups

Hierarchical

A series of models at differentresolutions




k-means clustering

I GOAL: Given N objects x1, ....xN and a distance metricd(xi , xj), assign the objects into k groups C1, ...,Ck such thatthe average distance to the center of each group is minimized.

I Minimize the objective function:

f =K∑

k=1

N∑i∈Ck

d2(xi − µi )

where µi is the mean position in each cluster i .




k-means clustering: Step 1

Step 1: Randomly pick k cluster centers





Step 2: Calculate distances to each cluster center, and assigngroup membership





Step 3: Re-calculate the locations of the centers of each cluster

**

*Step 4: Repeat until no re-assignments occur, or until somedesired tolerance.




k-centers (k-medoids)

I A variant of k-means: instead of means, use an exemplar datapoint as the center of each cluster

Example: ”mean” protein atom coordinates are usually unphysicalVincent Voelz Introduction to Bioalgorithms: Clustering



Important issues with k-means/centers

Q: What if I don’t know the number of clusters ahead oftime?

I Cluster multiple times (there is stochasticity) for multiplevalues of k, and choose the best model based on statisticalcriteria

I Alternatively, covering algorithms exist (clustering variant)that use a distance threshold to decide cluster membership





Q: What if I have a huge data set?

I Clustering requires calculating all pairwise distances, soO(N2). Data sets of N > 105 (typically) will not fit incomputer memory!

I One trick is to cluster on a subsample M � N of your data,then assign the rest. Clustering will be O(M2), butassignment is only O(N).





Q: How can I visualize clusters if I only have distances?

I Example: objects whose distances are calculated from a set offeatures (genes, patients, metadata)

I Multidimensional scaling (MDS) methods can be used toperform dimensionality reduction, i.e. projecting data to somelow-dimensional subspace so that it can be more easilyvisualized.




Multidimensional Scaling (MDS) via distance geometry

~x1

~x0

~x2

I Example: solving an 3-D protein structure using NOE distancerestraints

I Consider N data points in an m-dimensional space

I We only know their distance from each other, not theircoordinates or dimensionality m.




~x1

~x0

~x2d10d20

d12

Pick ~x0 as an arbitrary origin. Then, by the law of cosines,

d2ij = d2

i0 + d2j0 − 2|~xi ||~xj | cos θ (1)

= d2i0 + d2

j0 − 2~xi · ~xj (2)

Therefore,

~xi · ~xj =1

2(d2

i0 + d2j0 − d2

ij )




~x1

~x0

~x2d10d20

d12

To extract ~xi from the distances d2ij , consider the matrix

G = XTX

where X is an N ×m matrix with data points as columns, such thatGij = ~xi · ~xj . Diagonalization of G yields

G = VTΛV = (VTΛ1/2)(Λ1/2V) = XTX

The eigenvector columns of V are the principal components (PC’s) of the

space. We can project to a low-dimensional subspace by using only the

PC’s with the largest eigenvalues λ1/2i .




MDS on all the ’x’ words

50 40 30 20 10 0 10 20PC1

30

20

10

0

10

20

30PC

2 xanthate

xanthates

xanthein

xantheins

xanthene

xanthenes

xanthicxanthin

xanthine

xanthines

xanthinsxanthoma

xanthomas

xanthomata

xanthone

xanthones

xanthous

xebec

xebecs

xenia

xenial

xenias

xenic

xenogamies

xenogamy

xenogenies

xenogeny

xenolith

xenoliths

xenon

xenons

xenophobe

xenophobes

xenophobia

xerarch

xeric

xerosere

xeroseres

xerosesxerosis

xerotic

xerus

xeruses

xi

xiphoid

xiphoids

xis

xu

xylan

xylansxylem xylems

xylene

xylenes

xylic

xylidinxylidine

xylidines

xylidins

xylocarp

xylocarps

xyloid

xylol

xylols

xylophone

xylophones

xylophonistxylophonists

xylose

xyloses

xylotomies

xylotomy

xylyl

xylyls

xyst

xysterxysters

xysti

xystoi

xystos

xysts

xystus

I Two largest PC’s of all the ’x’ words (distance(w penalty=5.0))




MDS on all the ’q’ words

70 60 50 40 30 20 10 0 10PC1

20

10

0

10

20

30PC

2

qaid

qaids

qindarqindars

qintar

qintars

qiviut

qiviuts

qoph

qophsqua

quackquacked

quackeries

quackery

quackingquackish

quackism

quackisms

quacks

quad

quadded

quadding

quadrangle

quadrangles

quadrangular

quadransquadrantquadrantes

quadrants

quadrat

quadrate

quadrated

quadrates

quadrating

quadrats

quadric

quadrics

quadrigaquadrigae

quadrilateral

quadrilaterals

quadrille

quadrilles

quadroonquadroons

quadruped

quadrupedal

quadrupedsquadruple

quadrupled

quadruplesquadruplet

quadruplets

quadrupling

quadsquaere

quaeres

quaestor

quaestorsquaffquaffed

quaffer

quaffers

quaffing

quaffs

quagquaggaquaggas

quaggier

quaggiest

quaggy

quagmire

quagmires

quagmirierquagmiriest

quagmiry

quags

quahaug

quahaugs quahogquahogsquai

quaich

quaiches quaichs

quaigh

quaighsquail

quailed

quailing

quailsquaint

quainter

quaintest

quaintly

quaintness

quaintnesses

quais

quake

quaked

quaker

quakers

quakes

quakier

quakiest

quakily

quaking

quaky

quale

qualia

qualification

qualifications

qualified

qualifier

qualifiers

qualifies

qualify

qualifying

qualitative

qualities

quality

qualm

qualmierqualmiest

qualmish

qualms

qualmy

quamash

quamashesquandang

quandangs

quandaries

quandaryquandong

quandongs

quant

quanta

quantal

quantedquanticquantics

quantified

quantifies

quantify

quantifying

quanting

quantitative

quantities

quantity

quantizequantized

quantizes

quantizing

quantongquantongs

quants

quantum

quarantine

quarantined

quarantines

quarantining

quarequark

quarks

quarrel

quarreled

quarrelingquarrelled

quarrelling

quarrels

quarrelsome

quarried

quarrierquarriers quarries

quarry

quarrying

quartquartan

quartans

quarte

quarter

quarterback

quarterbacked

quarterbacking

quarterbacks

quartered

quartering

quarterlies

quarterly

quartermaster

quartermasters

quartern

quarterns

quarters

quartesquartet

quartets quarticquartics

quartile

quartiles

quarto

quartos

quarts

quartz

quartzes

quasar

quasars quash

quashed

quashes

quashing

quasi

quassquasses

quassiaquassias

quassin

quassinsquatequatorze

quatorzes

quatrain

quatrains

quatre

quatres

quaver

quavered

quaverer

quaverers

quavering

quaversquavery

quay

quayage

quayages

quaylike

quays

quayside

quaysides

quean

queans

queasierqueasiest

queasily

queasinessqueasinesses

queasy

queazier

queaziest

queazy

queenqueened

queeningqueenlier

queenliest

queenlyqueens

queerqueered

queerer

queerest

queeringqueerish

queerly

queerness

queernesses

queers

quell

quelled

queller

quellers

quelling

quellsquench

quenchable

quenched

quencher

quenchers

quenches

quenchingquenchless

quenellequenellesquercine querida

queridas

queried

querierqueriersqueries

querist

querists

quern

querns

querulous

querulously

querulousnessquerulousnesses

query

querying

quest

quested

quester

questers

questing

question

questionable

questioned

questioner

questioners

questioningquestionnaire

questionnaires

questionniare

questionniares

questions

questor

questors

quests

quetzal

quetzales

quetzals

queue

queued

queueing

queuer

queuers

queues

queuing

quey

queys

quezal

quezales

quezalsquibble

quibbled

quibbler

quibblers

quibbles

quibbling

quiche

quiches

quick

quicken

quickened

quickening

quickens

quicker

quickestquickie

quickies

quickly

quickness

quicknesses

quicks

quicksand

quicksands

quickset

quicksetsquicksilver

quicksilvers

quid

quiddities

quiddity

quidnuncquidnuncs

quids

quiescence

quiescences

quiescent

quietquieted

quieten

quietened

quietening

quietens

quieter

quietersquietest

quieting

quietism

quietismsquietist

quietists

quietly

quietness

quietnesses

quiets

quietude

quietudes

quietus

quietusesquiff

quiffs

quillquillai

quillais

quilled

quillet

quillets

quilling

quills

quilt

quilted

quilter

quilters

quilting

quiltings

quilts

quinaries

quinary

quinate

quince

quinces

quincunx

quincunxesquinellaquinellas

quinic

quiniela

quinielas

quinin

quinina

quininas

quinine

quinines quinins

quinnat

quinnats

quinoaquinoas

quinoidquinoids

quinol

quinolinquinolins

quinols

quinonequinones

quinsies

quinsy

quint

quintain

quintains

quintal

quintals

quintan

quintans

quintar

quintars

quintessence

quintessences

quintessential

quintet

quintetsquintic

quintics

quintile

quintiles

quintin

quintins

quints

quintuple

quintupled

quintuplesquintuplet

quintuplets

quintupling

quip

quipped

quippingquippish

quippu

quippus

quipsquipster

quipsters

quipu

quipus

quire

quired

quires

quiring

quirk

quirkedquirkier

quirkiest

quirkily

quirking

quirks

quirky

quirt

quirted

quirting

quirts

quisling

quislings

quitquitch

quitchesquite

quitrent

quitrentsquits

quitted

quitter

quitters

quitting

quittor

quittors

quiver

quivered

quiverer

quiverers

quivering

quiversquivery

quixotequixotes

quixotic

quixotries

quixotryquiz

quizmaster

quizmasters

quizzed

quizzer

quizzers

quizzes

quizzing

quod

quods

quoin

quoined

quoining

quoinsquoit

quoited

quoiting

quoits

quomodo

quomodos

quondam

quorumquorums

quota

quotable

quotably

quotas

quotation

quotations

quotequoted

quoter

quoters

quotes

quothquotha

quotient

quotients

quoting

qurshqurshes

qurush

qurushes

I Two largest PC’s of all the ’q’ words (distance(w penalty=5.0))




MDS on GB1 peptide conformations

I Principal components track overall shape similarities (but notkinetics!)




Hierarchical Clustering

divisive agglomerative

I Most efficient algorithms are agglomerative




Agglomeration

Various linkage criteria can be used to rank which states toagglomerate

I single linkage (nearest-neighbor)

I complete linkage (furthest-neighbor)

I average linkage

I centroid linkage




Dendrograms

A dendrogram (”branch” plot) allows you to visualize the completehierarchy of clusterings

16661101311283410714189361337011912609122742013111714135738410410511243116140385832685357939922312172130971154849792912312659103108831141491244412981323564139102131981115610069374619134788750312580138154512096102540511459054674996117218255214252148869551097610626391471847332871421131467513763881012427611226530418194118012777713685143621440

20

40

60

80

100

120

140

160

RMSD Ward hierarchical clustering

from scipy import cluster

linkage = cluster.hierarchy.ward(distances)

plt.figure(figsize=(12,6))

plt.title(’RMSD Ward hierarchical clustering’)

cluster.hierarchy.dendrogram(linkage, no_labels=False, count_sort=’descendent’)

plt.savefig(’gb1_dendrogram.pdf’)



Eigenfaces

Classification

GOAL: Given some previous learning on training data, can weclassify new data as belonging to the correct group?



Eigenfaces

Distance classifiers

Distance classifier examples:

I face recognition

I character/handwriting recognition

Note that there are many other machine learning algorithms forclassification:

I Support Vector Machines (w/ kernel trick)

I Naive Bayes, Logistic Regression

I Neural Networks (e.g. Google Deep Dream, PSIPRED)



Eigenfaces

Eigenfaces

Images are just arrays of numbers, i.e. m-dimensional vectors ~x .



Eigenfaces

Eigenfaces

Steps:

1. Perform MDS (e.g.Principal ComponentAnalysis) on training data

2. Assign new data pointsusing the previously-learnedcluster centers

PC1

PC3

PC2



Eigenfaces

Character recognition

Linear Discriminant Analysis (LDA) of handwritten numerals

http://scikit-learn.org/stable/modules/manifold.html


http://scikit-learn.org/stable/modules/manifold.html

Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances,...

Documents

Transcript of Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances,...