Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008...

Delbert DueckDepartment of Electrical & Computer Engineering

University of Toronto

July 30, 2008

Society for Mathematical Biology Conference

Affinity Propagation: Clustering by Passing Messages Between Data Points

Caravaggio’s “Vocazione di San Matteo” (The Calling of St. Matthew)

An interpretation of affinity propagation by Marc Mézard, Laboratoire de Physique Théorique et Modeles Satistique, Paris

Affinity Propagation: Clustering byPassing Messages Between Data Points

Delbert DueckProbabilistic and Statistical

Inference LabElectrical & Computer Engineering

University of Toronto

July 30, 2008Society for Mathematical Biology Conference

Where is theexemplar?

Exemplar-based clusteringTASK:

INPUTS: A set of real-valued pairwise similarities , {s(i,k)}, between data points and the number of exemplars (K) or a real-valued exemplar cost

OUTPUT: A subset of exemplar data points and an assignment of every other point to an exemplar

OBJECTIVE FUNCTION: Maximize the sum of similarities between data points and their exemplars, minus the exemplar costs

Identify a subset of data points as exemplars and assign every other data

point to one of those exemplars

Exemplar-based ClusteringWhy is this an important problem?

User-specified similarities offer a large amount of flexibility

The clustering algorithm can be uncoupled from the details of how similarities are computed

There is potential for significant improvement on existing algorithms

Greedy Method: k-medians clustering

Randomly chooseinitial exemplars,(data centers)

Assign data points tonearest centers

For each cluster,pick best newcenter

For each cluster,pick best newcenter

Assign data points tonearest centers

Convergence:Final set ofexemplars(centers)

Affinity Propagation

How well does k-mediansclustering work?

Olivetti face database contains 400 greyscale 64×64 images from 40 peopleSimilarity is based on sum-of-squared distance using a central 50×50 pixel window

Small enough problem to find exact solution

Example: Olivetti face images

Olivetti faces: squared error achieved byONE MILLION runs of k-medians clustering

Exact solution(using LP relaxation + days

of computation)

k-medians clustering, one million random restarts

for each k

Number of clusters, k

Squared error


Closing the performance gap:AFFINITY PROPAGATION

AFFINITY PROPAGATIONScience, 16 Feb. 2007

joint work with Brendan Frey

One-sentence summary:All data points are simultaneously

considered as exemplars, but exchangedeterministic messages while a good set

of exemplars gradually emerges.

Affinity Propagation: visualization

Affinity PropagationTASK:

INPUTS:A set of pairwise similarities, {s(i,k)}, where s(i,k) is a real number indicating how well-suited data point k is as an exemplar for data point i

e.g. s(i,k) = −‖xi − xk‖2, i≠k

For each data point k, a real number, s(k,k), indicating the a priori preference that it be chosen as an exemplar

e.g. s(k,k) = p ∀k

Identify a subset of data points as exemplars and assign every other data

point to one of those exemplars

Need notbe metric!

Affinity Propagation: message-passingAffinity propagation can be viewed as data points exchanging messages amongst themselvesIt can be derived as belief propagation (max-product) on a completely-connected factor graph

Sending responsibilities, rCandidateexemplar k

r(i,k)

Data point i

Competingcandidateexemplar k’

a(i,k’)

Sending availabilities, a

Candidateexemplar k

a(i,k)

Data point i

Supportingdata point i’

r(i’,k)

Affinity Propagation: update equationsSending responsibilities

Candidateexemplar k

r(i,k)

Data point i

Competingcandidateexemplar k’

a(i,k’)

Sending availabilities

Candidateexemplar k

a(i,k)

Data point i

Supportingdata point i’

r(i’,k)

Making decisions:

Affinity Propagation: MATLAB code01 N=size(S,1); A=zeros(N,N); R=zeros(N,N); % initialize messages02 S=S+1e-12*randn(N,N)*(max(S(:))-min(S(:))); % remove degeneracies03 lam=0.5; % Set damping factor04 for iter=1:100,05 Rold=R; % NOW COMPUTE RESPONSIBILITIES06 AS=A+S; [Y,I]=max(AS,[],2);07 for i=1:N, AS(i,I(i))=-realmax; end;08 [Y2,I2]=max(AS,[],2);09 R=S-repmat(Y,[1,N]);10 for i=1:N, R(i,I(i))=S(i,I(i))-Y2(i); end;11 R=(1-lam)*R+lam*Rold; % Dampen responsibilities12 Aold=A; % NOW COMPUTE AVAILABILITIES13 Rp=max(R,0); for k=1:N, Rp(k,k)=R(k,k); end;14 A=repmat(sum(Rp,1),[N,1])-Rp;15 dA=diag(A); A=min(A,0); for k=1:N, A(k,k)=dA(k); end;16 A=(1-lam)*A+lam*Aold; % dampen availabilities17 end;18 E=R+A; % pseudomarginals19 I=find(diag(E)>0); K=length(I); % indices of exemplars20 [tmp c]=max(S(:,I),[],2); c(I)=1:K; idx=I(c); % assignments

More code available at www.psi.toronto.edu/affinitypropagation

Recall Olivetti faces: squared error achieved by1 million runs of k-medians clustering


of computation)


for each K

Number of clusters, K

Squared error

Olivetti faces: squared error achieved by Affinity Propagation


of computation)


for each K

Number of clusters, K

Squared error

Affinity propagation,one run, 1000 times faster than 106 k-medians runs

A survey of applications investigated by other researchers and developers

VQ codebook design, Jiang et al., 2007 Image segmentation, Xiao et al., 2007Object classification, Fu et al., 2007Finding light sources using images, An et al., 2007Microarray analysis, Leone et al., 2007Computer network analysis, Code et al., 2007Audio-visual data analysis, Zhang et al., 2007Protein sequence analysis, Wittkop et al., 2007Protein clustering, Lees et al., 2007Analysis of cuticular hydrocarbons, Kent et al., 2007 …


Affinity Propagation:Applications in Bioinformatics

Detecting transcripts (genes) using microarray data(Data from Frey et al., Nature Genetics 2005)

s(segment i, segment k) = Similarity of expression patterns (columns) minus distance between segments in the DNA/genome

s(segment i, garbage) = tunable constant

# segments = 76,000 for chromosome 1

Mousetissues

DNA activityLow High

Position in DNA

…

Segment i Segment k

Mousetissues

DNA activityLow High

Position in DNA

…

Segment i Segment k

False positive rate (%)

Tru

e p

osi

tive

s (%

) R

EF

SE

Q

0 1 2 3 4 5

40

30

20

10

0

Ge

ne

re

con

stru

ctio

n

err

or

Number of clusters (“genes”)

0 2000 4000 6000 8000

-1.8

-2.0

-2.2

-2.4

-2.6

k-medians clustering(10,000 runs)

Affinity propagation

Random guessing

Detecting transcripts (genes) using microarray data(Data from Frey et al., Nature Genetics 2005)

Gene-drug interactions for 1259 drugs on 5985 genesThreshold to binary interaction matrix

GOAL: find small query set of genes on which new drugs could be tested to predict interactions for non-query genesHold out 10% of drugs as test sets(i,k) = #drugs interacting with both gene i and gene j

-5

-4

-3

-2

-1

0

1

2

3

4

5

drugs

yeast g

enes

new drugs

drugs(test set)

drugs(training set)

query setof genes

?????????

???????????????????????????

Application #2: Yeast gene-deletion strains(presented at RECOMB 2008)

10 20 30 40 50 60 70 807

7.2

7.4

7.6

7.8

8

8.2

8.4

8.6x 10

4

K (number of strain representatives)

uti

lity

(inte

ract

ions

cor

rect

ly p

redi

cted

on

trai

ning

dat

a)

0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

specificity (proportion of non-interactions correctly predicted in test data)

sen

siti

vity

(p

ropo

rtio

n of

inte

ract

ions

cor

rect

ly p

redi

cted

in t

est

data

)

K(number of strain representatives)

net s

imila

rity

(interaction

s correctly

predicted

on training data)

10 20 30 40 50 60 70 807

7.2

7.4

7.6

7.8

8

8.2

8.4

8.6x 10

4


uti

lity

(inte

ractions c

orr

ectly p

redic

ted o

n t

rain

ing d

ata

)

0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58


sen

sit

ivit

y

(p

roport

ion o

f in

tera

ctions c

orr

ectly p

redic

ted in t

est

data

)

affinity propagation

greedy method (best of 10 restarts)

greedy method (best of 100 restarts)greedy method (best of 1000 restarts)



Affinity Propagationk-medians clustering (best of 10 restarts)k-medians clustering (best of 100 restarts)k-medians clustering (best of 1000 restarts)k-medians clustering (best of 10,000 restarts)k-medians clustering (best of 100,000 restarts)

Application #2: Yeast gene-deletion strains

10 20 30 40 50 60 70 807

7.2

7.4

7.6

7.8

8

8.2

8.4

8.6x 10

4


uti

lity

(inte

ract

ions

cor

rect

ly p

redi

cted

on

trai

ning

dat

a)

0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58


sen

siti

vity

(p

ropo

rtio

n of

inte

ract

ions

cor

rect

ly p

redi

cted

in t

est

data

)

specificity(proportion of non-interactions correctly predicted in test data)

sens

itivi

ty(propo

rtion

of interactio

ns correctly

predicted

in te

st data)

10 20 30 40 50 60 70 807

7.2

7.4

7.6

7.8

8

8.2

8.4

8.6x 10

4


uti

lity

(inte

ractions c

orr

ectly p

redic

ted o

n t

rain

ing d

ata

)

0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58


sen

sit

ivit

y

(p

roport

ion o

f in

tera

ctions c

orr

ectly p

redic

ted in t

est

data

)



greedy method (best of 100 restarts)greedy method (best of 1000 restarts)



Affinity Propagationk-medians clustering (best of 10 restarts)k-medians clustering (best of 100 restarts)k-medians clustering (best of 1000 restarts)k-medians clustering (best of 10,000 restarts)k-medians clustering (best of 100,000 restarts)

Application #2: Yeast gene-deletion strains

Some data points are potential treatments, TCorrespond to HIV strain sequences

Other data points are targets, R, (sequence fragments)Correspond to epitopes that immune system responds to

· · ·

Application #3: HIV vaccine design(presented at RECOMB 2008)

· · ·

· · · MGARASVLSGGELDRWEKIRLRPGGKKKYQLKHIVWASRELERF · · ·· · · MGARASVLSGGELDRWEKIRLRPGGKKKYRLKHIVWASRELERF · · ·

MGARASVLS

GARASVLSGARASVLSGG RASVLSGGK

ASVLSGGKLSVLSGGKLDVLSGGKLDK

LSGGKLDKW

SGGKLDKWEGGKLDKWEK

GKLDKWEKI

KLDKWEKIR

LDKWEKIRLDKWEKIRLRKWEKIRLRP

WEKIRLRPGEKIRLRPGGKIRLRPGGKIRLRPGGKKRLRPGGKKKLRPGGKKKYRPGGKKKYKPGGKKKYKLGGKKKYKLKGKKKYKLKH

KKKYKLKHIKKYKLKHIVKYKLKHIVWYKLKHIVWAKLKHIVWAS

LKHIVWASRKHIVWASREHIVWASRELIVWASRELEVWASRELERWASRELERF

RASVLSGGEASVLSGGELSVLSGGELDVLSGGELDRLSGGELDRWSGGELDRWEGGELDRWEKGELDRWEKIELDRWEKIRLDRWEKIRLDRWEKIRLRRWEKIRLRP

RPGGKKKYQ

PGGKKKYQLGGKKKYQLK GKKKYQLKHKKKYQLKHIKKYQLKHIVKYQLKHIVW

YQLKHIVWAQLKHIVWAS

RPGGKKKYR PGGKKKYRLGGKKKYRLKGKKKYRLKH

KKKYRLKHIKKYRLKHIV

KYRLKHIVWYRLKHIVWARLKHIVWAS

· · · MGARASVLSGGKLDKWEKIRLRPGGKKKYKLKHIVWASRELERF · · ·

s(T,R)

Application #3: HIV vaccine designThe net similarity of a vaccine portfolio is its coverage

Fraction of database 9-mers the vaccine containsHighest-possible coverage comes from artificially-constructed strainse.g. Mosaics (Fischer et al., Nature Medicine 2006 )

vaccineportfolio size

Natural strains Artificial Mosaic strains

(upper bound)Affinity Propagation

greedy method(k-medians variant)

K=20 77.54% 77.34% 80.84%

K=30 80.92% 80.14% 82.74%

K=38 82.13% 81.62% 83.64%

K=52 84.19% 83.53% 84.83%

SummaryExemplar-based clustering offers flexibility in choosing similarities between data pointse.g. non-Euclidean, discrete, or non-metric data spaces

Affinity Propagation achieves better clustering solutions than other methodsnumber of exemplars, K, is automatically determinedsimple update equations, easy implementationFAST: # binary scalar operations # input similarities

Many applications in bioinformaticsMicroarray data, yeast gene-deletion strains, HIV vaccine design

AcknowledgementsAffinity Propagation (

www.psi.toronto.edu/affinitypropagation )Brendan J. Frey(Electrical & Computer Engineering, University of Toronto)

Detecting transcripts (genes) using microarray dataTim Hughes + lab(Banting & Best Department of Medical Research, University of Toronto)

Yeast gene-deletion strains:Andrew Emili, Gabe Musso, Guri Giaever(Banting & Best Department of Medical Research, University of Toronto)

HIV vaccine design:Nebojsa Jojic, Vladimir Jojic(Microsoft Research)

Funding for this work provided by:

http://www.psi.toronto.edu/affinitypropagation

http://www.iscb.org/


QUESTIONS?

0 25 50 75 100 125 150

8000

7000

6000

5000

4000

3000

2000

exact solution


VSH (best of 20 restarts)

0 25 50 75 100 125 150

8000

7000

6000

5000

4000

3000

2000

exact solution


VSH (best of 20 restarts) exact solutionaffinity propagationVSH (best of 20 restarts)k-centers clustering(1 million restarts)

squared error

number of clusters (p)(k)

Linear program (exact)

medians

Comparison of affinity propagation, linear programming, the VSH and k-medians clustering

(400 Olivetti face images)

Error and timing comparison of affinity propagation and the VSH

(Results from Brusco & Kohn and Frey & Dueck)

Selecting the “right” number of centers

Preferences influence the number of detected centers

Does affinity propagation find the proper number of centers? Yes.

ITERATION # 0

-5 0 +5-5

0

+5

ITERATION #15

-5 0 +5-5

0

+5

non-exemplar exemplar

Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008...

Documents

Transcript of Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008...