Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008...

34
Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity Propagation: Clustering by Passing Messages Between Data Points

Transcript of Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008...

Page 1: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Delbert DueckDepartment of Electrical & Computer Engineering

University of Toronto

July 30, 2008

Society for Mathematical Biology Conference

Affinity Propagation: Clustering by Passing Messages Between Data Points

Page 2: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Caravaggio’s “Vocazione di San Matteo” (The Calling of St. Matthew)

An interpretation of affinity propagation by Marc Mézard, Laboratoire de Physique Théorique et Modeles Satistique, Paris

Affinity Propagation: Clustering byPassing Messages Between Data Points

Delbert DueckProbabilistic and Statistical 

Inference LabElectrical & Computer Engineering

University of Toronto

July 30, 2008Society for Mathematical Biology Conference

Where is theexemplar?

Page 3: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Exemplar-based clusteringTASK:

INPUTS: A set of real-valued pairwise similarities , {s(i,k)}, between data points and the number of exemplars (K) or a real-valued exemplar cost

OUTPUT: A subset of exemplar data points and an assignment of every other point to an exemplar

OBJECTIVE FUNCTION: Maximize the sum of similarities between data points and their exemplars, minus the exemplar costs

Identify a subset of data points as exemplars and assign every other data

point to one of those exemplars

Page 4: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Exemplar-based ClusteringWhy is this an important problem?

User-specified similarities offer a large amount of flexibility

The clustering algorithm can be uncoupled from the details of how similarities are computed

There is potential for significant improvement on existing algorithms

Page 5: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Greedy Method: k-medians clustering

Randomly chooseinitial exemplars,(data centers)

Assign data points tonearest centers

For each cluster,pick best newcenter

For each cluster,pick best newcenter

Assign data points tonearest centers

Convergence:Final set ofexemplars(centers)

Page 6: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation

How well does k-mediansclustering work?

Page 7: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Olivetti face database contains 400 greyscale 64×64 images from 40 peopleSimilarity is based on sum-of-squared distance using a central 50×50 pixel window

Small enough problem to find exact solution

Example: Olivetti face images

Page 8: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Olivetti faces: squared error achieved byONE MILLION runs of k-medians clustering

Exact solution(using LP relaxation + days 

of computation)

   k-medians clustering, one million random restarts 

for each k

Number of clusters, k

Squared error

Page 9: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation

Closing the performance gap:AFFINITY PROPAGATION

Page 10: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

AFFINITY PROPAGATIONScience, 16 Feb. 2007

joint work with Brendan Frey

One-sentence summary:All data points are simultaneously

considered as exemplars, but exchangedeterministic messages while a good set

of exemplars gradually emerges.

Page 11: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation: visualization

Page 12: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation: visualization

Page 13: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity PropagationTASK:

INPUTS:A set of pairwise similarities, {s(i,k)}, where s(i,k) is a real number indicating how well-suited data point k is as an exemplar for data point i

e.g. s(i,k) = −‖xi − xk‖2, i≠k

For each data point k, a real number, s(k,k), indicating the a priori preference that it be chosen as an exemplar

e.g. s(k,k) = p ∀k

Identify a subset of data points as exemplars and assign every other data

point to one of those exemplars

Need notbe metric!

Page 14: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation: message-passingAffinity propagation can be viewed as data points exchanging messages amongst themselvesIt can be derived as belief propagation (max-product) on a completely-connected factor graph

Sending responsibilities, rCandidateexemplar k

r(i,k)

Data point i

Competingcandidateexemplar k’

a(i,k’)

Sending availabilities, a

Candidateexemplar k

a(i,k)

Data point i

Supportingdata point i’

r(i’,k)

Page 15: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation: update equationsSending responsibilities

Candidateexemplar k

r(i,k)

Data point i

Competingcandidateexemplar k’

a(i,k’)

Sending availabilities

Candidateexemplar k

a(i,k)

Data point i

Supportingdata point i’

r(i’,k)

Making decisions:

Page 16: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation: MATLAB code01 N=size(S,1); A=zeros(N,N); R=zeros(N,N); % initialize messages02 S=S+1e-12*randn(N,N)*(max(S(:))-min(S(:))); % remove degeneracies03 lam=0.5; % Set damping factor04 for iter=1:100,05 Rold=R; % NOW COMPUTE RESPONSIBILITIES06 AS=A+S; [Y,I]=max(AS,[],2);07 for i=1:N, AS(i,I(i))=-realmax; end;08 [Y2,I2]=max(AS,[],2);09 R=S-repmat(Y,[1,N]);10 for i=1:N, R(i,I(i))=S(i,I(i))-Y2(i); end;11 R=(1-lam)*R+lam*Rold; % Dampen responsibilities12 Aold=A; % NOW COMPUTE AVAILABILITIES13 Rp=max(R,0); for k=1:N, Rp(k,k)=R(k,k); end;14 A=repmat(sum(Rp,1),[N,1])-Rp;15 dA=diag(A); A=min(A,0); for k=1:N, A(k,k)=dA(k); end;16 A=(1-lam)*A+lam*Aold; % dampen availabilities17 end;18 E=R+A; % pseudomarginals19 I=find(diag(E)>0); K=length(I); % indices of exemplars20 [tmp c]=max(S(:,I),[],2); c(I)=1:K; idx=I(c); % assignments

More code available at www.psi.toronto.edu/affinitypropagation

Page 17: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Recall Olivetti faces: squared error achieved by1 million runs of k-medians clustering

Exact solution(using LP relaxation + days 

of computation)

   k-medians clustering, one million random restarts 

for each K

Number of clusters, K

Squared error

Page 18: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Olivetti faces: squared error achieved by Affinity Propagation

Exact solution(using LP relaxation + days 

of computation)

   k-medians clustering, one million random restarts 

for each K

Number of clusters, K

Squared error

Affinity propagation,one run, 1000 times faster than 106 k-medians runs

Page 19: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

A survey of applications investigated by other researchers and developers

VQ codebook design, Jiang et al., 2007 Image segmentation, Xiao et al., 2007Object classification, Fu et al., 2007Finding light sources using images, An et al., 2007Microarray analysis, Leone et al., 2007Computer network analysis, Code et al., 2007Audio-visual data analysis, Zhang et al., 2007Protein sequence analysis, Wittkop et al., 2007Protein clustering, Lees et al., 2007Analysis of cuticular hydrocarbons, Kent et al., 2007 …

Page 20: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation

Affinity Propagation:Applications in Bioinformatics

Page 21: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Detecting transcripts (genes) using microarray data(Data from Frey et al., Nature Genetics 2005)

s(segment i, segment k) = Similarity of expression patterns (columns) minus distance between segments in the DNA/genome

s(segment i, garbage) = tunable constant

# segments = 76,000 for chromosome 1

Mousetissues

DNA activityLow High

Position in DNA

Segment i Segment k

Page 22: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Mousetissues

DNA activityLow High

Position in DNA

Segment i Segment k

False positive rate (%)

Tru

e p

osi

tive

s (%

) R

EF

SE

Q

0 1 2 3 4 5

40

30

20

10

0

Ge

ne

re

con

stru

ctio

n

err

or

Number of clusters (“genes”)

0 2000 4000 6000 8000

-1.8

-2.0

-2.2

-2.4

-2.6

k-medians clustering(10,000 runs)

Affinity propagation

Random guessing

Detecting transcripts (genes) using microarray data(Data from Frey et al., Nature Genetics 2005)

Page 23: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Gene-drug interactions for 1259 drugs on 5985 genesThreshold to binary interaction matrix

GOAL: find small query set of genes on which new drugs could be tested to predict interactions for non-query genesHold out 10% of drugs as test sets(i,k) = #drugs interacting with both gene i and gene j

-5

-4

-3

-2

-1

0

1

2

3

4

5

drugs

yeast g

enes

new drugs

drugs(test set)

drugs(training set)

query setof genes

?????????

???????????????????????????

Application #2: Yeast gene-deletion strains(presented at RECOMB 2008)

Page 24: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

10 20 30 40 50 60 70 807

7.2

7.4

7.6

7.8

8

8.2

8.4

8.6x 10

4

K (number of strain representatives)

uti

lity

(inte

ract

ions

cor

rect

ly p

redi

cted

on

trai

ning

dat

a)

0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

specificity (proportion of non-interactions correctly predicted in test data)

sen

siti

vity

(p

ropo

rtio

n of

inte

ract

ions

cor

rect

ly p

redi

cted

in t

est

data

)

K(number of strain representatives)

net s

imila

rity

(interaction

s correctly

 predicted

 on training data)

10 20 30 40 50 60 70 807

7.2

7.4

7.6

7.8

8

8.2

8.4

8.6x 10

4

K (number of strain representatives)

uti

lity

(inte

ractions c

orr

ectly p

redic

ted o

n t

rain

ing d

ata

)

0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

specificity (proportion of non-interactions correctly predicted in test data)

sen

sit

ivit

y

(p

roport

ion o

f in

tera

ctions c

orr

ectly p

redic

ted in t

est

data

)

affinity propagation

greedy method (best of 10 restarts)

greedy method (best of 100 restarts)greedy method (best of 1000 restarts)

greedy method (best of 10000 restarts)

greedy method (best of 100000 restarts)

Affinity Propagationk-medians clustering (best of 10 restarts)k-medians clustering (best of 100 restarts)k-medians clustering (best of 1000 restarts)k-medians clustering (best of 10,000 restarts)k-medians clustering (best of 100,000 restarts)

Application #2: Yeast gene-deletion strains

Page 25: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

10 20 30 40 50 60 70 807

7.2

7.4

7.6

7.8

8

8.2

8.4

8.6x 10

4

K (number of strain representatives)

uti

lity

(inte

ract

ions

cor

rect

ly p

redi

cted

on

trai

ning

dat

a)

0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

specificity (proportion of non-interactions correctly predicted in test data)

sen

siti

vity

(p

ropo

rtio

n of

inte

ract

ions

cor

rect

ly p

redi

cted

in t

est

data

)

specificity(proportion of non-interactions correctly predicted in test data)

sens

itivi

ty(propo

rtion

 of interactio

ns correctly

 predicted

 in te

st data)

10 20 30 40 50 60 70 807

7.2

7.4

7.6

7.8

8

8.2

8.4

8.6x 10

4

K (number of strain representatives)

uti

lity

(inte

ractions c

orr

ectly p

redic

ted o

n t

rain

ing d

ata

)

0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

specificity (proportion of non-interactions correctly predicted in test data)

sen

sit

ivit

y

(p

roport

ion o

f in

tera

ctions c

orr

ectly p

redic

ted in t

est

data

)

affinity propagation

greedy method (best of 10 restarts)

greedy method (best of 100 restarts)greedy method (best of 1000 restarts)

greedy method (best of 10000 restarts)

greedy method (best of 100000 restarts)

Affinity Propagationk-medians clustering (best of 10 restarts)k-medians clustering (best of 100 restarts)k-medians clustering (best of 1000 restarts)k-medians clustering (best of 10,000 restarts)k-medians clustering (best of 100,000 restarts)

Application #2: Yeast gene-deletion strains

Page 26: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Some data points are potential treatments, TCorrespond to HIV strain sequences

Other data points are targets, R, (sequence fragments)Correspond to epitopes that immune system responds to

 

· · ·

Application #3: HIV vaccine design(presented at RECOMB 2008)

· · ·

· · · MGARASVLSGGELDRWEKIRLRPGGKKKYQLKHIVWASRELERF · · ·· · · MGARASVLSGGELDRWEKIRLRPGGKKKYRLKHIVWASRELERF · · ·

MGARASVLS

GARASVLSGARASVLSGG RASVLSGGK

ASVLSGGKLSVLSGGKLDVLSGGKLDK

LSGGKLDKW

SGGKLDKWEGGKLDKWEK

GKLDKWEKI

KLDKWEKIR

LDKWEKIRLDKWEKIRLRKWEKIRLRP

WEKIRLRPGEKIRLRPGGKIRLRPGGKIRLRPGGKKRLRPGGKKKLRPGGKKKYRPGGKKKYKPGGKKKYKLGGKKKYKLKGKKKYKLKH

KKKYKLKHIKKYKLKHIVKYKLKHIVWYKLKHIVWAKLKHIVWAS

LKHIVWASRKHIVWASREHIVWASRELIVWASRELEVWASRELERWASRELERF

RASVLSGGEASVLSGGELSVLSGGELDVLSGGELDRLSGGELDRWSGGELDRWEGGELDRWEKGELDRWEKIELDRWEKIRLDRWEKIRLDRWEKIRLRRWEKIRLRP

RPGGKKKYQ

PGGKKKYQLGGKKKYQLK GKKKYQLKHKKKYQLKHIKKYQLKHIVKYQLKHIVW

YQLKHIVWAQLKHIVWAS

RPGGKKKYR PGGKKKYRLGGKKKYRLKGKKKYRLKH

KKKYRLKHIKKYRLKHIV

KYRLKHIVWYRLKHIVWARLKHIVWAS

· · · MGARASVLSGGKLDKWEKIRLRPGGKKKYKLKHIVWASRELERF · · ·

s(T,R)

Page 27: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Application #3: HIV vaccine designThe net similarity of a vaccine portfolio is its coverage

Fraction of database 9-mers the vaccine containsHighest-possible coverage comes from artificially-constructed strainse.g. Mosaics (Fischer et al., Nature Medicine 2006 )

vaccineportfolio size

Natural strains Artificial Mosaic strains

(upper bound)Affinity Propagation

greedy method(k-medians variant)

K=20 77.54% 77.34% 80.84%

K=30 80.92% 80.14% 82.74%

K=38 82.13% 81.62% 83.64%

K=52 84.19% 83.53% 84.83%

Page 28: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

SummaryExemplar-based clustering offers flexibility in choosing similarities between data pointse.g. non-Euclidean, discrete, or non-metric data spaces

Affinity Propagation achieves better clustering solutions than other methodsnumber of exemplars, K, is automatically determinedsimple update equations, easy implementationFAST: # binary scalar operations # input similarities

Many applications in bioinformaticsMicroarray data, yeast gene-deletion strains, HIV vaccine design

Page 29: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

AcknowledgementsAffinity Propagation (

www.psi.toronto.edu/affinitypropagation )Brendan J. Frey(Electrical & Computer Engineering, University of Toronto)

Detecting transcripts (genes) using microarray dataTim Hughes + lab(Banting & Best Department of Medical Research, University of Toronto)

Yeast gene-deletion strains:Andrew Emili, Gabe Musso, Guri Giaever(Banting & Best Department of Medical Research, University of Toronto)

HIV vaccine design:Nebojsa Jojic, Vladimir Jojic(Microsoft Research)

Funding for this work provided by:

Page 30: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation

QUESTIONS?

Page 31: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Affinity Propagation

Page 32: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

0 25 50 75 100 125 150

8000

7000

6000

5000

4000

3000

2000

exact solution

affinity propagation

VSH (best of 20 restarts)

0 25 50 75 100 125 150

8000

7000

6000

5000

4000

3000

2000

exact solution

affinity propagation

VSH (best of 20 restarts) exact solutionaffinity propagationVSH (best of 20 restarts)k-centers clustering(1 million restarts)

squared error

number of clusters (p)(k)

Linear program (exact)

medians

Comparison of affinity propagation, linear programming, the VSH and k-medians clustering

(400 Olivetti face images)

Page 33: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Error and timing comparison of affinity propagation and the VSH

(Results from Brusco & Kohn and Frey & Dueck)

Page 34: Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008 Society for Mathematical Biology Conference Affinity.

Selecting the “right” number of centers

Preferences influence the number of detected centers

Does affinity propagation find the proper number of centers? Yes.

ITERATION # 0

-5 0 +5-5

0

+5

ITERATION #15

-5 0 +5-5

0

+5

non-exemplar exemplar