Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances,...

34
Clustering Using clusters for classification Introduction to Bioalgorithms: Clustering Vincent Voelz Temple University October 12, 2015 Vincent Voelz Introduction to Bioalgorithms: Clustering

Transcript of Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances,...

Page 1: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Introduction to Bioalgorithms:Clustering

Vincent Voelz

Temple University

October 12, 2015

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 2: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

ClusteringExamples of clusteringDistance metrics

RMSD

Clustering algorithmsk-meansk-centers (k-medoids)Multidimensional Scaling (MDS)hierarchical clustering

Using clusters for classificationEigenfaces

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 3: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Protein conformations

conformations sampled by molecular dynamicssimulations and the transitions between them asthe network nodes and links, respectively. Thenetwork analysis allows us to identify the topo-logical properties that are common to both beta3s,which folds to a unique three-dimensional struc-ture,15,19 and a random heteropolymer which lacks

a single preferential conformation like the nativestate despite the fact that it has the same residuecomposition as beta3s. These properties include thepresence of several free-energy minima and highlyconnected conformations (hubs). On the otherhand, a hierarchical modularity20 in the proximityof the native state is peculiar of a folding sequence.

Figure 1. The beta3s conformation space network. The size and color coding of the nodes reflect the statistical weightwand average neighbor connectivity knn respectively. White, cyan, and red nodes have knn!30, 30%knn%70, and knnO70,respectively. Representative conformations are shown by a pipe colored according to secondary structure: white standsfor coil, red for a-helix, orange for bend, cyan for strand and the N terminus is in blue. The variable radius of the pipereflects structural variability within snapshots in a conformation. The yellow diamonds are folding TS conformations(TSE1, TSE2, see the text for details) characterized by a connectivity/weight ratio k=2 !wO0:3, a clustering coefficientC!0.3, and 60!knn!80. This Figure was made using visone (www.visone.de) and MOLMOL40 visualization tools.

300 The Protein Folding Network

Rao, F., & Caflisch, A. (2004). The Protein Folding Network. Journal ofMolecular Biology, 342(1), 299?306.

http://doi.org/10.1016/j.jmb.2004.06.063

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 4: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Gene Expression Microarray Data

-3.00

-2

.00

-1.00

0.

00

1.00

2.

00

3.00

Ovar

y

Testi

s

Hea

d (F)

Head

(M)

Integ

umen

t (F)

Integ

umen

t (M

)Fa

t bod

y (F)

Fat b

ody (

M)

Mid

gut (

F)M

idgu

t (M

)Ho

moc

yte (

F)Ho

moc

yte (

M)

Malp

ighi

an tu

bule

(F)

Malp

ighi

an tu

bule

(M)

A/M

SG (F

)

A/M

SG (M

)

PSG

(F)

PSG

(M)

Bmae46-sw07727 Bmae53-sw05012 Bmae10-sw09787 Bmae30-sw15265 Bmae38-sw19995 Bmae39-sw20808 Bmae3-sw15605 Bmae16-sw12821 Bmae21-sw11948 Bmae19-sw13684 Bmae18-sw07946 Bmae49-sw05060 Bmae22-sw11949 Bmae17-sw21466 Bmae2-sw12242 Bmae27-sw03697 Bmjhe2-sw00495 Bmae24-sw14575 Bmae4-sw09185 Bmae5-sw20245 Bmae48-sw05730 Bmae15-sw08854 Bmae40-sw09098 Bmae41-sw06296 Bmie1-sw20886 Bmae35-sw06237 Bmun1-sw00733 Bmie2-sw08442 Bmae36-sw16332 Bmae32-sw13499 Bmnlg6-sw12843 Bmbe2-sw01051 Bmae33-sw09501 Bmjhe1-sw14035 Bmae50-sw08757 Bmae43-sw07726 Bmae25-sw22280 Bmjhe4-sw03795 Bmae23-sw06448 Bmae52-sw13618 Bmae14-sw21154 Bmae11-sw15854 Bmae55-sw22224 Bmae13-sw12117 Bmnlg3-sw14720

Yu, Q.-Y., Lu, C., Li, W.-L., Xiang, Z.-H., & Zhang, Z. (2009). Annotationand expression of carboxylesterases in the silkworm, Bombyx mori. BMC

Genomics, 10(1), 553?14. http://doi.org/10.1186/1471-2164-10-553

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 5: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Enzyme Function Superfamilies

Schnoes, A. M., Brown, S. D., Dodevski, I., & Babbitt, P. C. (2009).Annotation Error in Public Databases: Misannotation of Molecular Function in

Enzyme Superfamilies. PLoS Computational Biology, 5(12), e1000605?13.http://doi.org/10.1371/journal.pcbi.1000605

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 6: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Evolutionary Phylogenetic Timetrees

18

Fig 1

at Temple U

niversity Law School Library on O

ctober 12, 2015http://m

be.oxfordjournals.org/D

ownloaded from

Hedges, S. B., Marin, J., Suleski, M., Paymer, M., & Kumar, S. (2015). Treeof life reveals clock-like speciation and diversification. Molecular Biology and

Evolution, 32(4), 835?845. http://doi.org/10.1093/molbev/msv037

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 7: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Distance metrics

A distance metric is any function d(i , j) that is mathematicallywell-behaved (non-negative, symmetric, etc.) and satisfies thetriangle inequality.

I d(i , i) = 0 conincidence

I d(i , i) ≥ 0 non-negative

I d(i , j) = d(j , i) symmetric

I d(i , k) ≤ d(i , j) + d(j , k)triangle inequality

i

j

kQ: is sequence similarity a distance metric?

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 8: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Sequence similarity is not necessarily a distance metric

A: It depends.Some similarity measures are metrics. Using our distance(s1,s2, w penalty=5.0, sigma=0.0), we find

I d(xylidines, xylose) = 26.33

I d(xylidines, xylophone) = 4.77

I d(xylophone, xylose) = 19.5

We have a violation of the triangle inequality d(xylidines, xylose)> d(xylidines, xylophone) + d(xylophone, xylose). Thus,distance() is not a metric!

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 9: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Protein structure RMSDs

I A protein structure of N atoms can be described as an arrayof 3-element (x , y , z) vectors ~vi for i = 1...N.

I The RMSD (root-mean-squared deviation) of two proteinstructures ~v and ~w is

RMSD =

√√√√ 1

N

N∑i

||~vi − ~wi ||2

I where ||~v − ~wi ||2 = (vx − wx)2i + (vy − wy )2

i + (vz − wz)2i .

I Usually, the RMSD reported is the value after optimallysuperimposing and orienting the two structures.

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 10: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Protein structure RMSDs - example

150 molecular structures for GB1 hairpin(GEWTYDDATKTFTVTE)

Razavi, A. M., & Voelz, V. A. (2015). Kinetic network models of tryptophan mutations in β-hairpins reveal the

importance of non-native interactions. Journal of Chemical Theory and Computation, 11, 2801?2812.

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 11: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Protein structure RMSDs - example

Distribution of RMSD values for all non-zero pairwise distances:

0 2 4 6 8 10 12 14

RMSD ( )

P(R

MSD

)

These results are similar to larger proteins:I ”Structure prediction” is if we predict Cα positions to < 3− 4A.

I Simulation of crystal structures stabilize around to 1− 2A

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 12: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Two kinds of clustering algorithms

Partitioning

Dividing samples into k groups

Hierarchical

A series of models at differentresolutions

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 13: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

k-means clustering

I GOAL: Given N objects x1, ....xN and a distance metricd(xi , xj), assign the objects into k groups C1, ...,Ck such thatthe average distance to the center of each group is minimized.

I Minimize the objective function:

f =K∑

k=1

N∑i∈Ck

d2(xi − µi )

where µi is the mean position in each cluster i .

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 14: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

k-means clustering: Step 1

Step 1: Randomly pick k cluster centers

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 15: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

k-means clustering: Step 2

Step 2: Calculate distances to each cluster center, and assigngroup membership

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 16: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

k-means clustering: Step 3

Step 3: Re-calculate the locations of the centers of each cluster

**

*Step 4: Repeat until no re-assignments occur, or until somedesired tolerance.

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 17: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

k-centers (k-medoids)

I A variant of k-means: instead of means, use an exemplar datapoint as the center of each cluster

Example: ”mean” protein atom coordinates are usually unphysicalVincent Voelz Introduction to Bioalgorithms: Clustering

Page 18: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Important issues with k-means/centers

Q: What if I don’t know the number of clusters ahead oftime?

I Cluster multiple times (there is stochasticity) for multiplevalues of k, and choose the best model based on statisticalcriteria

I Alternatively, covering algorithms exist (clustering variant)that use a distance threshold to decide cluster membership

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 19: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Important issues with k-means/centers

Q: What if I have a huge data set?

I Clustering requires calculating all pairwise distances, soO(N2). Data sets of N > 105 (typically) will not fit incomputer memory!

I One trick is to cluster on a subsample M � N of your data,then assign the rest. Clustering will be O(M2), butassignment is only O(N).

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 20: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Important issues with k-means/centers

Q: How can I visualize clusters if I only have distances?

I Example: objects whose distances are calculated from a set offeatures (genes, patients, metadata)

I Multidimensional scaling (MDS) methods can be used toperform dimensionality reduction, i.e. projecting data to somelow-dimensional subspace so that it can be more easilyvisualized.

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 21: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Multidimensional Scaling (MDS) via distance geometry

~x1

~x0

~x2

I Example: solving an 3-D protein structure using NOE distancerestraints

I Consider N data points in an m-dimensional space

I We only know their distance from each other, not theircoordinates or dimensionality m.

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 22: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

~x1

~x0

~x2d10d20

d12

Pick ~x0 as an arbitrary origin. Then, by the law of cosines,

d2ij = d2

i0 + d2j0 − 2|~xi ||~xj | cos θ (1)

= d2i0 + d2

j0 − 2~xi · ~xj (2)

Therefore,

~xi · ~xj =1

2(d2

i0 + d2j0 − d2

ij )

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 23: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

~x1

~x0

~x2d10d20

d12

To extract ~xi from the distances d2ij , consider the matrix

G = XTX

where X is an N ×m matrix with data points as columns, such thatGij = ~xi · ~xj . Diagonalization of G yields

G = VTΛV = (VTΛ1/2)(Λ1/2V) = XTX

The eigenvector columns of V are the principal components (PC’s) of the

space. We can project to a low-dimensional subspace by using only the

PC’s with the largest eigenvalues λ1/2i .

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 24: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

MDS on all the ’x’ words

50 40 30 20 10 0 10 20PC1

30

20

10

0

10

20

30PC

2 xanthate

xanthates

xanthein

xantheins

xanthene

xanthenes

xanthicxanthin

xanthine

xanthines

xanthinsxanthoma

xanthomas

xanthomata

xanthone

xanthones

xanthous

xebec

xebecs

xenia

xenial

xenias

xenic

xenogamies

xenogamy

xenogenies

xenogeny

xenolith

xenoliths

xenon

xenons

xenophobe

xenophobes

xenophobia

xerarch

xeric

xerosere

xeroseres

xerosesxerosis

xerotic

xerus

xeruses

xi

xiphoid

xiphoids

xis

xu

xylan

xylansxylem xylems

xylene

xylenes

xylic

xylidinxylidine

xylidines

xylidins

xylocarp

xylocarps

xyloid

xylol

xylols

xylophone

xylophones

xylophonistxylophonists

xylose

xyloses

xylotomies

xylotomy

xylyl

xylyls

xyst

xysterxysters

xysti

xystoi

xystos

xysts

xystus

I Two largest PC’s of all the ’x’ words (distance(w penalty=5.0))

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 25: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

MDS on all the ’q’ words

70 60 50 40 30 20 10 0 10PC1

20

10

0

10

20

30PC

2

qaid

qaids

qindarqindars

qintar

qintars

qiviut

qiviuts

qoph

qophsqua

quackquacked

quackeries

quackery

quackingquackish

quackism

quackisms

quacks

quad

quadded

quadding

quadrangle

quadrangles

quadrangular

quadransquadrantquadrantes

quadrants

quadrat

quadrate

quadrated

quadrates

quadrating

quadrats

quadric

quadrics

quadrigaquadrigae

quadrilateral

quadrilaterals

quadrille

quadrilles

quadroonquadroons

quadruped

quadrupedal

quadrupedsquadruple

quadrupled

quadruplesquadruplet

quadruplets

quadrupling

quadsquaere

quaeres

quaestor

quaestorsquaffquaffed

quaffer

quaffers

quaffing

quaffs

quagquaggaquaggas

quaggier

quaggiest

quaggy

quagmire

quagmires

quagmirierquagmiriest

quagmiry

quags

quahaug

quahaugs quahogquahogsquai

quaich

quaiches quaichs

quaigh

quaighsquail

quailed

quailing

quailsquaint

quainter

quaintest

quaintly

quaintness

quaintnesses

quais

quake

quaked

quaker

quakers

quakes

quakier

quakiest

quakily

quaking

quaky

quale

qualia

qualification

qualifications

qualified

qualifier

qualifiers

qualifies

qualify

qualifying

qualitative

qualities

quality

qualm

qualmierqualmiest

qualmish

qualms

qualmy

quamash

quamashesquandang

quandangs

quandaries

quandaryquandong

quandongs

quant

quanta

quantal

quantedquanticquantics

quantified

quantifies

quantify

quantifying

quanting

quantitative

quantities

quantity

quantizequantized

quantizes

quantizing

quantongquantongs

quants

quantum

quarantine

quarantined

quarantines

quarantining

quarequark

quarks

quarrel

quarreled

quarrelingquarrelled

quarrelling

quarrels

quarrelsome

quarried

quarrierquarriers quarries

quarry

quarrying

quartquartan

quartans

quarte

quarter

quarterback

quarterbacked

quarterbacking

quarterbacks

quartered

quartering

quarterlies

quarterly

quartermaster

quartermasters

quartern

quarterns

quarters

quartesquartet

quartets quarticquartics

quartile

quartiles

quarto

quartos

quarts

quartz

quartzes

quasar

quasars quash

quashed

quashes

quashing

quasi

quassquasses

quassiaquassias

quassin

quassinsquatequatorze

quatorzes

quatrain

quatrains

quatre

quatres

quaver

quavered

quaverer

quaverers

quavering

quaversquavery

quay

quayage

quayages

quaylike

quays

quayside

quaysides

quean

queans

queasierqueasiest

queasily

queasinessqueasinesses

queasy

queazier

queaziest

queazy

queenqueened

queeningqueenlier

queenliest

queenlyqueens

queerqueered

queerer

queerest

queeringqueerish

queerly

queerness

queernesses

queers

quell

quelled

queller

quellers

quelling

quellsquench

quenchable

quenched

quencher

quenchers

quenches

quenchingquenchless

quenellequenellesquercine querida

queridas

queried

querierqueriersqueries

querist

querists

quern

querns

querulous

querulously

querulousnessquerulousnesses

query

querying

quest

quested

quester

questers

questing

question

questionable

questioned

questioner

questioners

questioningquestionnaire

questionnaires

questionniare

questionniares

questions

questor

questors

quests

quetzal

quetzales

quetzals

queue

queued

queueing

queuer

queuers

queues

queuing

quey

queys

quezal

quezales

quezalsquibble

quibbled

quibbler

quibblers

quibbles

quibbling

quiche

quiches

quick

quicken

quickened

quickening

quickens

quicker

quickestquickie

quickies

quickly

quickness

quicknesses

quicks

quicksand

quicksands

quickset

quicksetsquicksilver

quicksilvers

quid

quiddities

quiddity

quidnuncquidnuncs

quids

quiescence

quiescences

quiescent

quietquieted

quieten

quietened

quietening

quietens

quieter

quietersquietest

quieting

quietism

quietismsquietist

quietists

quietly

quietness

quietnesses

quiets

quietude

quietudes

quietus

quietusesquiff

quiffs

quillquillai

quillais

quilled

quillet

quillets

quilling

quills

quilt

quilted

quilter

quilters

quilting

quiltings

quilts

quinaries

quinary

quinate

quince

quinces

quincunx

quincunxesquinellaquinellas

quinic

quiniela

quinielas

quinin

quinina

quininas

quinine

quinines quinins

quinnat

quinnats

quinoaquinoas

quinoidquinoids

quinol

quinolinquinolins

quinols

quinonequinones

quinsies

quinsy

quint

quintain

quintains

quintal

quintals

quintan

quintans

quintar

quintars

quintessence

quintessences

quintessential

quintet

quintetsquintic

quintics

quintile

quintiles

quintin

quintins

quints

quintuple

quintupled

quintuplesquintuplet

quintuplets

quintupling

quip

quipped

quippingquippish

quippu

quippus

quipsquipster

quipsters

quipu

quipus

quire

quired

quires

quiring

quirk

quirkedquirkier

quirkiest

quirkily

quirking

quirks

quirky

quirt

quirted

quirting

quirts

quisling

quislings

quitquitch

quitchesquite

quitrent

quitrentsquits

quitted

quitter

quitters

quitting

quittor

quittors

quiver

quivered

quiverer

quiverers

quivering

quiversquivery

quixotequixotes

quixotic

quixotries

quixotryquiz

quizmaster

quizmasters

quizzed

quizzer

quizzers

quizzes

quizzing

quod

quods

quoin

quoined

quoining

quoinsquoit

quoited

quoiting

quoits

quomodo

quomodos

quondam

quorumquorums

quota

quotable

quotably

quotas

quotation

quotations

quotequoted

quoter

quoters

quotes

quothquotha

quotient

quotients

quoting

qurshqurshes

qurush

qurushes

I Two largest PC’s of all the ’q’ words (distance(w penalty=5.0))

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 26: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

MDS on GB1 peptide conformations

I Principal components track overall shape similarities (but notkinetics!)

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 27: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Hierarchical Clustering

divisive agglomerative

I Most efficient algorithms are agglomerative

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 28: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Agglomeration

Various linkage criteria can be used to rank which states toagglomerate

I single linkage (nearest-neighbor)

I complete linkage (furthest-neighbor)

I average linkage

I centroid linkage

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 29: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Examples of clusteringDistance metricsClustering algorithms

Dendrograms

A dendrogram (”branch” plot) allows you to visualize the completehierarchy of clusterings

16661101311283410714189361337011912609122742013111714135738410410511243116140385832685357939922312172130971154849792912312659103108831141491244412981323564139102131981115610069374619134788750312580138154512096102540511459054674996117218255214252148869551097610626391471847332871421131467513763881012427611226530418194118012777713685143621440

20

40

60

80

100

120

140

160

RMSD Ward hierarchical clustering

from scipy import cluster

linkage = cluster.hierarchy.ward(distances)

plt.figure(figsize=(12,6))

plt.title(’RMSD Ward hierarchical clustering’)

cluster.hierarchy.dendrogram(linkage, no_labels=False, count_sort=’descendent’)

plt.savefig(’gb1_dendrogram.pdf’)

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 30: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Eigenfaces

Classification

GOAL: Given some previous learning on training data, can weclassify new data as belonging to the correct group?

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 31: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Eigenfaces

Distance classifiers

Distance classifier examples:

I face recognition

I character/handwriting recognition

Note that there are many other machine learning algorithms forclassification:

I Support Vector Machines (w/ kernel trick)

I Naive Bayes, Logistic Regression

I Neural Networks (e.g. Google Deep Dream, PSIPRED)

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 32: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Eigenfaces

Eigenfaces

Images are just arrays of numbers, i.e. m-dimensional vectors ~x .

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 33: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Eigenfaces

Eigenfaces

Steps:

1. Perform MDS (e.g.Principal ComponentAnalysis) on training data

2. Assign new data pointsusing the previously-learnedcluster centers

PC1

PC3

PC2

Vincent Voelz Introduction to Bioalgorithms: Clustering

Page 34: Introduction to Bioalgorithms: ClusteringI Clustering requires calculating all pairwise distances, so O(N2). Data sets of N >105 (typically) will not t in computer memory! I One trick

ClusteringUsing clusters for classification

Eigenfaces

Character recognition

Linear Discriminant Analysis (LDA) of handwritten numerals

http://scikit-learn.org/stable/modules/manifold.html

Vincent Voelz Introduction to Bioalgorithms: Clustering