Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008...
-
Upload
amina-ashdown -
Category
Documents
-
view
213 -
download
0
Transcript of Delbert Dueck Department of Electrical & Computer Engineering University of Toronto July 30, 2008...
Delbert DueckDepartment of Electrical & Computer Engineering
University of Toronto
July 30, 2008
Society for Mathematical Biology Conference
Affinity Propagation: Clustering by Passing Messages Between Data Points
Caravaggio’s “Vocazione di San Matteo” (The Calling of St. Matthew)
An interpretation of affinity propagation by Marc Mézard, Laboratoire de Physique Théorique et Modeles Satistique, Paris
Affinity Propagation: Clustering byPassing Messages Between Data Points
Delbert DueckProbabilistic and Statistical
Inference LabElectrical & Computer Engineering
University of Toronto
July 30, 2008Society for Mathematical Biology Conference
Where is theexemplar?
Exemplar-based clusteringTASK:
INPUTS: A set of real-valued pairwise similarities , {s(i,k)}, between data points and the number of exemplars (K) or a real-valued exemplar cost
OUTPUT: A subset of exemplar data points and an assignment of every other point to an exemplar
OBJECTIVE FUNCTION: Maximize the sum of similarities between data points and their exemplars, minus the exemplar costs
Identify a subset of data points as exemplars and assign every other data
point to one of those exemplars
Exemplar-based ClusteringWhy is this an important problem?
User-specified similarities offer a large amount of flexibility
The clustering algorithm can be uncoupled from the details of how similarities are computed
There is potential for significant improvement on existing algorithms
Greedy Method: k-medians clustering
Randomly chooseinitial exemplars,(data centers)
Assign data points tonearest centers
For each cluster,pick best newcenter
For each cluster,pick best newcenter
Assign data points tonearest centers
Convergence:Final set ofexemplars(centers)
Affinity Propagation
How well does k-mediansclustering work?
Olivetti face database contains 400 greyscale 64×64 images from 40 peopleSimilarity is based on sum-of-squared distance using a central 50×50 pixel window
Small enough problem to find exact solution
Example: Olivetti face images
Olivetti faces: squared error achieved byONE MILLION runs of k-medians clustering
Exact solution(using LP relaxation + days
of computation)
k-medians clustering, one million random restarts
for each k
Number of clusters, k
Squared error
Affinity Propagation
Closing the performance gap:AFFINITY PROPAGATION
AFFINITY PROPAGATIONScience, 16 Feb. 2007
joint work with Brendan Frey
One-sentence summary:All data points are simultaneously
considered as exemplars, but exchangedeterministic messages while a good set
of exemplars gradually emerges.
Affinity Propagation: visualization
Affinity Propagation: visualization
Affinity PropagationTASK:
INPUTS:A set of pairwise similarities, {s(i,k)}, where s(i,k) is a real number indicating how well-suited data point k is as an exemplar for data point i
e.g. s(i,k) = −‖xi − xk‖2, i≠k
For each data point k, a real number, s(k,k), indicating the a priori preference that it be chosen as an exemplar
e.g. s(k,k) = p ∀k
Identify a subset of data points as exemplars and assign every other data
point to one of those exemplars
Need notbe metric!
Affinity Propagation: message-passingAffinity propagation can be viewed as data points exchanging messages amongst themselvesIt can be derived as belief propagation (max-product) on a completely-connected factor graph
Sending responsibilities, rCandidateexemplar k
r(i,k)
Data point i
Competingcandidateexemplar k’
a(i,k’)
Sending availabilities, a
Candidateexemplar k
a(i,k)
Data point i
Supportingdata point i’
r(i’,k)
Affinity Propagation: update equationsSending responsibilities
Candidateexemplar k
r(i,k)
Data point i
Competingcandidateexemplar k’
a(i,k’)
Sending availabilities
Candidateexemplar k
a(i,k)
Data point i
Supportingdata point i’
r(i’,k)
Making decisions:
Affinity Propagation: MATLAB code01 N=size(S,1); A=zeros(N,N); R=zeros(N,N); % initialize messages02 S=S+1e-12*randn(N,N)*(max(S(:))-min(S(:))); % remove degeneracies03 lam=0.5; % Set damping factor04 for iter=1:100,05 Rold=R; % NOW COMPUTE RESPONSIBILITIES06 AS=A+S; [Y,I]=max(AS,[],2);07 for i=1:N, AS(i,I(i))=-realmax; end;08 [Y2,I2]=max(AS,[],2);09 R=S-repmat(Y,[1,N]);10 for i=1:N, R(i,I(i))=S(i,I(i))-Y2(i); end;11 R=(1-lam)*R+lam*Rold; % Dampen responsibilities12 Aold=A; % NOW COMPUTE AVAILABILITIES13 Rp=max(R,0); for k=1:N, Rp(k,k)=R(k,k); end;14 A=repmat(sum(Rp,1),[N,1])-Rp;15 dA=diag(A); A=min(A,0); for k=1:N, A(k,k)=dA(k); end;16 A=(1-lam)*A+lam*Aold; % dampen availabilities17 end;18 E=R+A; % pseudomarginals19 I=find(diag(E)>0); K=length(I); % indices of exemplars20 [tmp c]=max(S(:,I),[],2); c(I)=1:K; idx=I(c); % assignments
More code available at www.psi.toronto.edu/affinitypropagation
Recall Olivetti faces: squared error achieved by1 million runs of k-medians clustering
Exact solution(using LP relaxation + days
of computation)
k-medians clustering, one million random restarts
for each K
Number of clusters, K
Squared error
Olivetti faces: squared error achieved by Affinity Propagation
Exact solution(using LP relaxation + days
of computation)
k-medians clustering, one million random restarts
for each K
Number of clusters, K
Squared error
Affinity propagation,one run, 1000 times faster than 106 k-medians runs
A survey of applications investigated by other researchers and developers
VQ codebook design, Jiang et al., 2007 Image segmentation, Xiao et al., 2007Object classification, Fu et al., 2007Finding light sources using images, An et al., 2007Microarray analysis, Leone et al., 2007Computer network analysis, Code et al., 2007Audio-visual data analysis, Zhang et al., 2007Protein sequence analysis, Wittkop et al., 2007Protein clustering, Lees et al., 2007Analysis of cuticular hydrocarbons, Kent et al., 2007 …
Affinity Propagation
Affinity Propagation:Applications in Bioinformatics
Detecting transcripts (genes) using microarray data(Data from Frey et al., Nature Genetics 2005)
s(segment i, segment k) = Similarity of expression patterns (columns) minus distance between segments in the DNA/genome
s(segment i, garbage) = tunable constant
# segments = 76,000 for chromosome 1
Mousetissues
DNA activityLow High
Position in DNA
…
Segment i Segment k
Mousetissues
DNA activityLow High
Position in DNA
…
Segment i Segment k
False positive rate (%)
Tru
e p
osi
tive
s (%
) R
EF
SE
Q
0 1 2 3 4 5
40
30
20
10
0
Ge
ne
re
con
stru
ctio
n
err
or
Number of clusters (“genes”)
0 2000 4000 6000 8000
-1.8
-2.0
-2.2
-2.4
-2.6
k-medians clustering(10,000 runs)
Affinity propagation
Random guessing
Detecting transcripts (genes) using microarray data(Data from Frey et al., Nature Genetics 2005)
Gene-drug interactions for 1259 drugs on 5985 genesThreshold to binary interaction matrix
GOAL: find small query set of genes on which new drugs could be tested to predict interactions for non-query genesHold out 10% of drugs as test sets(i,k) = #drugs interacting with both gene i and gene j
-5
-4
-3
-2
-1
0
1
2
3
4
5
drugs
yeast g
enes
new drugs
drugs(test set)
drugs(training set)
query setof genes
?????????
???????????????????????????
Application #2: Yeast gene-deletion strains(presented at RECOMB 2008)
10 20 30 40 50 60 70 807
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6x 10
4
K (number of strain representatives)
uti
lity
(inte
ract
ions
cor
rect
ly p
redi
cted
on
trai
ning
dat
a)
0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
specificity (proportion of non-interactions correctly predicted in test data)
sen
siti
vity
(p
ropo
rtio
n of
inte
ract
ions
cor
rect
ly p
redi
cted
in t
est
data
)
K(number of strain representatives)
net s
imila
rity
(interaction
s correctly
predicted
on training data)
10 20 30 40 50 60 70 807
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6x 10
4
K (number of strain representatives)
uti
lity
(inte
ractions c
orr
ectly p
redic
ted o
n t
rain
ing d
ata
)
0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
specificity (proportion of non-interactions correctly predicted in test data)
sen
sit
ivit
y
(p
roport
ion o
f in
tera
ctions c
orr
ectly p
redic
ted in t
est
data
)
affinity propagation
greedy method (best of 10 restarts)
greedy method (best of 100 restarts)greedy method (best of 1000 restarts)
greedy method (best of 10000 restarts)
greedy method (best of 100000 restarts)
Affinity Propagationk-medians clustering (best of 10 restarts)k-medians clustering (best of 100 restarts)k-medians clustering (best of 1000 restarts)k-medians clustering (best of 10,000 restarts)k-medians clustering (best of 100,000 restarts)
Application #2: Yeast gene-deletion strains
10 20 30 40 50 60 70 807
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6x 10
4
K (number of strain representatives)
uti
lity
(inte
ract
ions
cor
rect
ly p
redi
cted
on
trai
ning
dat
a)
0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
specificity (proportion of non-interactions correctly predicted in test data)
sen
siti
vity
(p
ropo
rtio
n of
inte
ract
ions
cor
rect
ly p
redi
cted
in t
est
data
)
specificity(proportion of non-interactions correctly predicted in test data)
sens
itivi
ty(propo
rtion
of interactio
ns correctly
predicted
in te
st data)
10 20 30 40 50 60 70 807
7.2
7.4
7.6
7.8
8
8.2
8.4
8.6x 10
4
K (number of strain representatives)
uti
lity
(inte
ractions c
orr
ectly p
redic
ted o
n t
rain
ing d
ata
)
0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.9360.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
specificity (proportion of non-interactions correctly predicted in test data)
sen
sit
ivit
y
(p
roport
ion o
f in
tera
ctions c
orr
ectly p
redic
ted in t
est
data
)
affinity propagation
greedy method (best of 10 restarts)
greedy method (best of 100 restarts)greedy method (best of 1000 restarts)
greedy method (best of 10000 restarts)
greedy method (best of 100000 restarts)
Affinity Propagationk-medians clustering (best of 10 restarts)k-medians clustering (best of 100 restarts)k-medians clustering (best of 1000 restarts)k-medians clustering (best of 10,000 restarts)k-medians clustering (best of 100,000 restarts)
Application #2: Yeast gene-deletion strains
Some data points are potential treatments, TCorrespond to HIV strain sequences
Other data points are targets, R, (sequence fragments)Correspond to epitopes that immune system responds to
· · ·
Application #3: HIV vaccine design(presented at RECOMB 2008)
· · ·
· · · MGARASVLSGGELDRWEKIRLRPGGKKKYQLKHIVWASRELERF · · ·· · · MGARASVLSGGELDRWEKIRLRPGGKKKYRLKHIVWASRELERF · · ·
MGARASVLS
GARASVLSGARASVLSGG RASVLSGGK
ASVLSGGKLSVLSGGKLDVLSGGKLDK
LSGGKLDKW
SGGKLDKWEGGKLDKWEK
GKLDKWEKI
KLDKWEKIR
LDKWEKIRLDKWEKIRLRKWEKIRLRP
WEKIRLRPGEKIRLRPGGKIRLRPGGKIRLRPGGKKRLRPGGKKKLRPGGKKKYRPGGKKKYKPGGKKKYKLGGKKKYKLKGKKKYKLKH
KKKYKLKHIKKYKLKHIVKYKLKHIVWYKLKHIVWAKLKHIVWAS
LKHIVWASRKHIVWASREHIVWASRELIVWASRELEVWASRELERWASRELERF
RASVLSGGEASVLSGGELSVLSGGELDVLSGGELDRLSGGELDRWSGGELDRWEGGELDRWEKGELDRWEKIELDRWEKIRLDRWEKIRLDRWEKIRLRRWEKIRLRP
RPGGKKKYQ
PGGKKKYQLGGKKKYQLK GKKKYQLKHKKKYQLKHIKKYQLKHIVKYQLKHIVW
YQLKHIVWAQLKHIVWAS
RPGGKKKYR PGGKKKYRLGGKKKYRLKGKKKYRLKH
KKKYRLKHIKKYRLKHIV
KYRLKHIVWYRLKHIVWARLKHIVWAS
· · · MGARASVLSGGKLDKWEKIRLRPGGKKKYKLKHIVWASRELERF · · ·
s(T,R)
Application #3: HIV vaccine designThe net similarity of a vaccine portfolio is its coverage
Fraction of database 9-mers the vaccine containsHighest-possible coverage comes from artificially-constructed strainse.g. Mosaics (Fischer et al., Nature Medicine 2006 )
vaccineportfolio size
Natural strains Artificial Mosaic strains
(upper bound)Affinity Propagation
greedy method(k-medians variant)
K=20 77.54% 77.34% 80.84%
K=30 80.92% 80.14% 82.74%
K=38 82.13% 81.62% 83.64%
K=52 84.19% 83.53% 84.83%
SummaryExemplar-based clustering offers flexibility in choosing similarities between data pointse.g. non-Euclidean, discrete, or non-metric data spaces
Affinity Propagation achieves better clustering solutions than other methodsnumber of exemplars, K, is automatically determinedsimple update equations, easy implementationFAST: # binary scalar operations # input similarities
Many applications in bioinformaticsMicroarray data, yeast gene-deletion strains, HIV vaccine design
AcknowledgementsAffinity Propagation (
www.psi.toronto.edu/affinitypropagation )Brendan J. Frey(Electrical & Computer Engineering, University of Toronto)
Detecting transcripts (genes) using microarray dataTim Hughes + lab(Banting & Best Department of Medical Research, University of Toronto)
Yeast gene-deletion strains:Andrew Emili, Gabe Musso, Guri Giaever(Banting & Best Department of Medical Research, University of Toronto)
HIV vaccine design:Nebojsa Jojic, Vladimir Jojic(Microsoft Research)
Funding for this work provided by:
Affinity Propagation
QUESTIONS?
Affinity Propagation
0 25 50 75 100 125 150
8000
7000
6000
5000
4000
3000
2000
exact solution
affinity propagation
VSH (best of 20 restarts)
0 25 50 75 100 125 150
8000
7000
6000
5000
4000
3000
2000
exact solution
affinity propagation
VSH (best of 20 restarts) exact solutionaffinity propagationVSH (best of 20 restarts)k-centers clustering(1 million restarts)
squared error
number of clusters (p)(k)
Linear program (exact)
medians
Comparison of affinity propagation, linear programming, the VSH and k-medians clustering
(400 Olivetti face images)
Error and timing comparison of affinity propagation and the VSH
(Results from Brusco & Kohn and Frey & Dueck)
Selecting the “right” number of centers
Preferences influence the number of detected centers
Does affinity propagation find the proper number of centers? Yes.
ITERATION # 0
-5 0 +5-5
0
+5
ITERATION #15
-5 0 +5-5
0
+5
non-exemplar exemplar