Post on 13-Apr-2018
Institute of Bioinformatics
Johannes Kepler University Linz
Introduction to apcluster
Ulrich Bodenhoferbodenhofer@bioinf.jku.at
Webinar “Introduction to apcluster”, June 13, 2013 2
Outline
1. Introduction to affinity propagation (AP) clustering
2. The apcluster package, its algorithms, and visualiza-tion tools
3. Live apcluster demonstration
4. Question and Answer period
Webinar “Introduction to apcluster”, June 13, 2013 3
Affinity Propagation (AP) Clustering
Affinity propagation (AP) is an emergingnew clustering technique with increasingimportance in many fields (in particular,bioinformatics).
AP uses pairwise similarities as inputsand iteratively determines clusters alongwith samples that are representative forthe clusters, so-called exemplars.
AP is based on an iterative message pass-ing scheme in which samples compete forbecoming exemplars.
0.2 0.4 0.6 0.8
0.2
0.4
0.6
0.8
1.0
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
● ●
●
●●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●●
●
●
●
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
160
159
158
157
156
155
154
153
152
151
150
149
148
147
146
145
144
143
142
141
140
139
138
137
136
135
134
133
132
131
130
129
128
127
126
125
124
123
122
121
120
119
118
117
116
115
114
113
112
111
110
109
108
107
106
105
104
103
102
101
100
99
98
97
96
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
210
209
208
207
206
205
204
203
202
201
200
199
198
197
196
195
194
193
192
191
190
189
188
187
186
185
184
183
182
181
180
179
178
177
176
175
174
173
172
171
170
169
168
167
166
165
164
163
162
161
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Webinar “Introduction to apcluster”, June 13, 2013 4
Affinity Propagation:What’s Special About It?
Efficiently finds approximate exemplars (to find an optimal so-lution is actually NP-hard).
Deterministic, initialization-independent algorithm with verysimple update rules (can be viewed as max-sum algorithm in afactor graph).
Published in Science:
B. J. Frey and D. Dueck. Clustering by passing mes-sages between data points. Science, 315:972–976,2007. DOI: 10.1126/science.1136800.
Webinar “Introduction to apcluster”, June 13, 2013 5
Affinity Propagation: Message Matrices
Responsibility matrix R: R(i, k) is sent from i to k and corre-sponds to the accumulated evidence for how well-suited k isto serve as the exemplar for i (taking into account other possi-ble exemplars for i).
Availability matrix A: A(i, k) is sent from k to i and correspondsto the accumulated evidence for how appropriate it is for i tochoose k as its exemplar (taking into account other points thatwould choose k as exemplar).
Webinar “Introduction to apcluster”, June 13, 2013 6
Message Passing / Update Rules
Repeatedly perform the following updates (A is initialized with zeros):
For i, k ∈ {1, . . . , l}:
R(i, k)← S(i, k)−maxk′ 6=k
(A(i, k′) + S(i, k′)
)For i, k ∈ {1, . . . , l} s.t. i 6= k:
A(i, k)← min(0, R(k, k) +
∑j 6∈{i,k}
max(0, R(j, k)
))For k ∈ {1, . . . , l}:
A(k, k)←∑j 6=k
max(0, R(j, k)
)The “self-similarities” S(k, k) are called input preferences. They determine theindividual tendencies of samples to become exemplars.
Webinar “Introduction to apcluster”, June 13, 2013 7
How Exemplars Emerge: An Example
R A R+A
Webinar “Introduction to apcluster”, June 13, 2013 8
Function apcluster()
apcluster(s, x, p, q, ...)
s . . . quadratic similarity matrix or similarity measure (function)
x . . . data (vector or list; only if s was a function)
p . . . input preference (per sample or one value for all)
q . . . sets input preference to quantile of similarities (q ∈ [0, 1])
The function returns an APResult object that contains the cluster-ing result.
Webinar “Introduction to apcluster”, June 13, 2013 9
Affinity Propagation: Pro’s & Con’s
+ Exemplars are real samples, no hypothetical averages.
+ Only pairwise similarities necessary; the similarity measuresnot even need to satisfy symmetry or the triangle inequality(applicable to all sorts of kernels and correlation measures).
+ Algorithm (almost) deterministic, not sensitive to initialization.
+ Number of clusters need not be pre-specified.
– Adjustment of input preference parameter can be tricky.
– Quadratic similarity matrices required: does not scale to largedata sets.
Webinar “Introduction to apcluster”, June 13, 2013 10
Adjustment of Input Preference /Number of Clusters
The apcluster package further offers:
Function preferenceRange(s) computes input preferencebounds for given similarity matrix s:
Lower bound: 1–2 clusters
Upper bound: as many clusters as samples
Function apclusterK() performs an inverse search for aninput preference that facilitates a desired number of clustersK.
Function aggExCluster() performs agglomerative cluster-ing on top of AP clustering and allows for specifying cut levelswith a desired number of clusters.
Webinar “Introduction to apcluster”, June 13, 2013 11
So How to Scale AP to Large DatasetsThen?
Webinar “Introduction to apcluster”, June 13, 2013 11
So How to Scale AP to Large DatasetsThen?
Sub-sampling?
Webinar “Introduction to apcluster”, June 13, 2013 11
So How to Scale AP to Large DatasetsThen?
Sub-sampling?
Webinar “Introduction to apcluster”, June 13, 2013 11
So How to Scale AP to Large DatasetsThen?
Sub-sampling?
Massive loss ofinformation!
Webinar “Introduction to apcluster”, June 13, 2013 11
So How to Scale AP to Large DatasetsThen?
Sub-sampling?
Massive loss ofinformation!
Reduce columnobjects . . .
Webinar “Introduction to apcluster”, June 13, 2013 11
So How to Scale AP to Large DatasetsThen?
Sub-sampling?
Massive loss ofinformation!
Reduce columnobjects . . .
Webinar “Introduction to apcluster”, June 13, 2013 11
So How to Scale AP to Large DatasetsThen?
Sub-sampling?
Massive loss ofinformation!
Reduce columnobjects . . .
but keep all rowobjects
Webinar “Introduction to apcluster”, June 13, 2013 12
Leveraged Affinity Propagation
Start with a small, but reasonable, sub-sample of columns (po-tential exemplars).
Iteratively repeat AP on such sub-samples, keeping the exem-plars of the previous iteration. Messages are exchanged be-tween the row objects (all) and the potential exemplars/columnobjects (sub-sample).
No need to calculate the whole similarity matrix in advance.Instead, only the similarities with the sub-samples need to becomputed.
Webinar “Introduction to apcluster”, June 13, 2013 13
Function apclusterL()
apclusterL(s, x, frac, sweeps, p, q, ...)
s . . . similarity measure (function)
x . . . data (vector or list)
frac . . . fraction of samples to be considered in each iteration
sweeps . . . number of iterations
p, q . . . analogous to apcluster()
The function returns an APResult object that contains the cluster-ing result.
Webinar “Introduction to apcluster”, June 13, 2013 14
Functions for Visualizing Results
plot(x) with x being an APResult object: performancegraphs for assessing convergence
plot(x) with x being an AggExResult object: dendrogramof agglomerative clustering
plot(x, y) with x being an APResult object and y beingoriginal data: plot data along with cluster structure)
heatmap(x) with x being an APResult or AggExResult ob-ject: plot heatmap (dendrograms and color coding of clustersswitchable)
Webinar “Introduction to apcluster”, June 13, 2013 15
Similarity Measures
The apcluster package provides the following similarity measures:
negDistMat(): negative distances (resp. power thereof), interfaceanalogous to dist()
expSimMat(): generalization of Gauss/Laplace similarity (RBF/La-place kernel)
linSimMat(): Łukasiewicz similarity, interface analogous to dist()
corSimMat(): correlation, interface analogous to cor()
All functions in the apcluster package can also be used with custom-made similarity measures.
Webinar “Introduction to apcluster”, June 13, 2013 16
Similarity Measures: Examples[similarity of (x, y) and (0, 0)]
negDistMat(r=2, method="euclidean") negDistMat(r=1, method="manhattan")
- 2
- 1
0
1
2
- 2
- 1
0
1
2
- 8
- 6
- 4
- 2
0
- 2
- 1
0
1
2
- 2
- 1
0
1
2
- 4
- 3
- 2
- 1
0
Webinar “Introduction to apcluster”, June 13, 2013 16
Similarity Measures: Examples[similarity of (x, y) and (0, 0)]
negDistMat(r=2, method="euclidean") negDistMat(r=1, method="manhattan")
- 2
- 1
0
1
2
- 2
- 1
0
1
2
- 8
- 6
- 4
- 2
0
- 2
- 1
0
1
2
- 2
- 1
0
1
2
- 4
- 3
- 2
- 1
0
expSimMat(r=2, w=1, method="euclidean") expSimMat(r=1, w=1, method="manhattan")
- 2
- 1
0
1
2
- 2
- 1
0
1
2
0.0
0.5
1.0
- 2
- 1
0
1
2
- 2
- 1
0
1
2
0.0
0.5
1.0
Webinar “Introduction to apcluster”, June 13, 2013 16
Similarity Measures: Examples[similarity of (x, y) and (0, 0)]
negDistMat(r=2, method="euclidean") negDistMat(r=1, method="manhattan")
- 2
- 1
0
1
2
- 2
- 1
0
1
2
- 8
- 6
- 4
- 2
0
- 2
- 1
0
1
2
- 2
- 1
0
1
2
- 4
- 3
- 2
- 1
0
expSimMat(r=2, w=1, method="euclidean") expSimMat(r=1, w=1, method="manhattan")
- 2
- 1
0
1
2
- 2
- 1
0
1
2
0.0
0.5
1.0
- 2
- 1
0
1
2
- 2
- 1
0
1
2
0.0
0.5
1.0
linSimMat(w=1, method="euclidean") linSimMat(w=1, method="maximum")
- 2
- 1
0
1
2
- 2
- 1
0
1
2
0.0
0.5
1.0
- 2
- 1
0
1
2
- 2
- 1
0
1
2
0.0
0.5
1.0
Webinar “Introduction to apcluster”, June 13, 2013 17
Example:Two Clusters of Coiled Coil Sequences
ANSTILMV
CPQALVEKHILNRYKAETQ
AMVTLLVCRQKEEGHSTYKQRADQTWLRVKACEFTHLNIVADGKQTRSYLEAFHLMNRTVYDSEHMNTVYAFWLAINRVQTKLGTRAEKLSACIMNQDRTELKCKSVYQALTNAGINRWDELYEKNTADMQYRHAEIQRSTYLAILMTQKREDEVAQKNADGQRKEKLNVKRVQEAETAQRCLGLSKEQRKDQKELWNLV
VKKL.........................VKTLKAQNSELASTANMLREQVAQLKQKV..............NYALKQKVQAL....VQALKKRVQALKARNYAAKQKVQAL..............ALLEAIRLRNKVKAL....VVTLKELHSSTTLENDQLRQKVRQL..............LKLDNYHLENEVARL....LEAVRRKIRSLQEQNYHLENEVARLKKLVMKQLEDKVEELLSKNYHLENEVARL....MKQLEDKVEELLSKNYHLENEVARLKKLVMKQLEDKVEELLSKQYHLENEVARL....MKQLEDKVEELLSKKYHLENEVARL....LPILEDKVEELLSKNYHLENEVARL.......LEDKVEELLSKNYHLENEVARLKKLVVKELEDKVEELLSKNYHLENEVARLKKLV.......VEELLSKNYHLENEVARLKKLV.......VEELLSKNYHLENEVARLKKLL..........MVEKCWKLMDKVVRL....TKLLEDLVYFV..................AAALCGLVYLF..................ICRLCYRVLRH..................IEKLQTKVLELQRKLDNTTAAVQELGRENNLETQHKVLELTAENERLQKKVEQLSREL...TQQKVLELTSDNDRLRKRVEQLSREL...LKSWNETLTSRL..............VKQLKAKVEELKSKLWHLKNKVARLKKKN...LERVAEELNSIAGHL...........VVHLRQRTEELLRCNEQQAAELETCKEQL.......ERDFLKTTLHR...........SLNMVKVLGDLLKNSLTIINGL.......VETLKDVIYHAKELQLELKKKK.......IAYALDTIAENQAKNEHLQKENERLLRDW.......FKNYLTKVAYY.................VLLVLIKLYYD...................CTMLVGATYMLLVDN..............IDEWKSLTYDE...........LEELQAVHEYWRSMNRYS...........
L...N E.EN..L E...LN.....E
abcdefgabcdefgabcdefgabcdefga
DQTFMLVIINPQRVASTKEAHQTEGNKSD
CDFMRSTWALNYEK
AFLIKLGIQSADNE
CHKLIDTEA
HKQTWLMVIDGHNSVTEKL
EHIRYDQNSFILQTVDNEKFSQLIGKLNQTDEAY
EKLRVYDSAH
EGLQSTVRINVYKLEQRSVYENDMQRYLELVIDAKRVIKKL I
L
DVSLIQDVERILDYS...........ITTSISHVESLLDDRL..........IEDKIEEILSKIYHIENEIARIKKLIIEDKIEEILSKIYHIENEIARI....IEDKIEEIESKQKKIENEIARIKKLLQEDKLEETLSKIYHLENEIARV....IEDKIEEILSKIYHIENEI........EGCLIEIL.................FEDALLA...................FSDYLGLM...................IDYLAKKLHDIEEEY..........VPKWISIM..................VTKTIETHTDNI..............VAKRIEALENKI..............INSDFNAI..................LSSNFGAISSVLNDILSRLDKV........LEAINSEL..............IESAINDVHNFS..............IKNEIAAIKQEIAAIEQMI.......IKEEIAAIKDKIAAIKEYI.......IEEEIQAIKEEIAAIKYLI.......MEAEINTLKSKLELT...........LQHNIKTQVSQILRVEVDI..........FLDIWTYN...............MKGLIDEVDQDFTSR...........TAQELDCIKITQQVGVEL........LRNMAIDMGNEIGSQNRQV.......
E..I E..IE....K
I...I I...I I...Idefgabcdefgabcdefgabcdefga
(a) (b)
Webinar “Introduction to apcluster”, June 13, 2013 18
Example:AP Clustering of Chemical Compounds
2,5−
diflu
oron
itrob
enze
ne
2−ni
trop
hena
nthr
ene
9−ni
trop
hena
nthr
ene
1−am
ino−
8−ni
trop
yren
e
1−ni
trop
yren
e
2−ni
trob
enz[
j]ace
anth
ryle
ne
1−ni
troc
oron
ene
1−ni
trob
enzo
[a]p
yren
e
4−ni
trob
enz[
k]ac
ephe
nant
hryl
ene
8−ni
troq
uino
line
5−ni
troq
uino
line
5−ni
trob
enzi
mid
azol
e
1−m
ethy
l−6−
nitr
oind
azol
e
6−ni
troq
uino
line
2−ac
etox
y−7−
nitr
oflu
oren
e
2−m
ethy
l−7−
nitr
oflu
oren
e
2−cy
ano−
7−ni
trof
luor
ene
2−io
do−
7−ni
trof
luor
ene
2−hy
drox
y−7−
nitr
oflu
oren
e
2−flu
oro−
7−ni
trof
luor
ene
1−ni
trof
luor
anth
ene
1,3,
6,8−
tetr
anitr
opyr
ene
3−m
ethy
l−4−
nitr
obip
heny
l
2,4−
dini
trof
luor
anth
ene
3,4−
dini
trof
luor
anth
ene
3−ni
tro−
9−flu
oren
one
4,3'
−di
nitr
obip
heny
l
2−ni
tro−
m−
phen
ylen
edia
min
e
1−ch
loro
−2,
4−di
nitr
oben
zene
2,4,
7−tr
initr
o−9−
fluor
enon
e
2,4,7−trinitro−9−fluorenone
1−chloro−2,4−dinitrobenzene
2−nitro−m−phenylenediamine
4,3'−dinitrobiphenyl
3−nitro−9−fluorenone
3,4−dinitrofluoranthene
2,4−dinitrofluoranthene
3−methyl−4−nitrobiphenyl
1,3,6,8−tetranitropyrene
1−nitrofluoranthene
2−fluoro−7−nitrofluorene
2−hydroxy−7−nitrofluorene
2−iodo−7−nitrofluorene
2−cyano−7−nitrofluorene
2−methyl−7−nitrofluorene
2−acetoxy−7−nitrofluorene
6−nitroquinoline
1−methyl−6−nitroindazole
5−nitrobenzimidazole
5−nitroquinoline
8−nitroquinoline
4−nitrobenz[k]acephenanthrylene
1−nitrobenzo[a]pyrene
1−nitrocoronene
2−nitrobenz[j]aceanthrylene
1−nitropyrene
1−amino−8−nitropyrene
9−nitrophenanthrene
2−nitrophenanthrene
2,5−difluoronitrobenzene
Webinar “Introduction to apcluster”, June 13, 2013 19
Example: Leveraged Affinity Propagation
37 53 83 88 91 27 30 69 73 79 80
807978777675747372717069686766656463626119302928272625242322212018171615141312111098765432110510410310210110099989796959493929190898887868584838281605958575655545352515049484746454443424140393837363534333231
Webinar “Introduction to apcluster”, June 13, 2013 20
And Now . . .
Webinar “Introduction to apcluster”, June 13, 2013 20
And Now . . .
. . . let’s see apcluster at work!
Webinar “Introduction to apcluster”, June 13, 2013 21
Further Information
Paper about package:
U. Bodenhofer, A. Kothmeier, and S. Hochreiter. APCluster:an R package for affinity propagation clustering. Bioinfor-matics, 27(17):2463–2464, 2011. DOI: 10.1093/bioinformat-ics/btr406.
Package homepages:
http://www.bioinf.jku.at/software/apcluster/
http://cran.r-project.org/web/packages/
apcluster/index.html
Webinar “Introduction to apcluster”, June 13, 2013 22
Acknowledgments
Package co-authors:
Andreas Kothmeier
Johannes Palme
Webinar “Introduction to apcluster”, June 13, 2013 22
Acknowledgments
Package co-authors:
Andreas Kothmeier
Johannes Palme
Co-workers in related applications and projects:
Sepp Hochreiter
Günter Klambauer
Andreas Mayr
Webinar “Introduction to apcluster”, June 13, 2013 22
Acknowledgments
Package co-authors:
Andreas Kothmeier
Johannes Palme
Co-workers in related applications and projects:
Sepp Hochreiter
Günter Klambauer
Andreas Mayr
For the invitation to and organization of this webinar:
Ray DiGiacomo, Jr., and the Orange County R User Group