Introduction to apcluster - bioinf.jku.at of Bioinformatics Johannes Kepler University Linz...

Institute of Bioinformatics

Johannes Kepler University Linz

Introduction to apcluster

Ulrich Bodenhoferbodenhofer@bioinf.jku.at

Webinar “Introduction to apcluster”, June 13, 2013 2

Outline

1. Introduction to affinity propagation (AP) clustering

2. The apcluster package, its algorithms, and visualiza-tion tools

3. Live apcluster demonstration

4. Question and Answer period

Affinity Propagation (AP) Clustering

Affinity propagation (AP) is an emergingnew clustering technique with increasingimportance in many fields (in particular,bioinformatics).

AP uses pairwise similarities as inputsand iteratively determines clusters alongwith samples that are representative forthe clusters, so-called exemplars.

AP is based on an iterative message pass-ing scheme in which samples compete forbecoming exemplars.

0.2 0.4 0.6 0.8

●●

● ●

●●●

●●

● ●

●●

● ●

●●

●●●

●●

●●●

●●

●●●

●●

● ●

●●

● ●

●●

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 161

210 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Affinity Propagation:What’s Special About It?

Efficiently finds approximate exemplars (to find an optimal so-lution is actually NP-hard).

Deterministic, initialization-independent algorithm with verysimple update rules (can be viewed as max-sum algorithm in afactor graph).

Published in Science:

B. J. Frey and D. Dueck. Clustering by passing mes-sages between data points. Science, 315:972–976,2007. DOI: 10.1126/science.1136800.

Affinity Propagation: Message Matrices

Responsibility matrix R: R(i, k) is sent from i to k and corre-sponds to the accumulated evidence for how well-suited k isto serve as the exemplar for i (taking into account other possi-ble exemplars for i).

Availability matrix A: A(i, k) is sent from k to i and correspondsto the accumulated evidence for how appropriate it is for i tochoose k as its exemplar (taking into account other points thatwould choose k as exemplar).

Message Passing / Update Rules

Repeatedly perform the following updates (A is initialized with zeros):

For i, k ∈ {1, . . . , l}:

R(i, k)← S(i, k)−maxk′ 6=k

(A(i, k′) + S(i, k′)

)For i, k ∈ {1, . . . , l} s.t. i 6= k:

A(i, k)← min(0, R(k, k) +

∑j 6∈{i,k}

max(0, R(j, k)

))For k ∈ {1, . . . , l}:

A(k, k)←∑j 6=k

max(0, R(j, k)

)The “self-similarities” S(k, k) are called input preferences. They determine theindividual tendencies of samples to become exemplars.

How Exemplars Emerge: An Example

R A R+A

Function apcluster()

apcluster(s, x, p, q, ...)

s . . . quadratic similarity matrix or similarity measure (function)

x . . . data (vector or list; only if s was a function)

p . . . input preference (per sample or one value for all)

q . . . sets input preference to quantile of similarities (q ∈ [0, 1])

The function returns an APResult object that contains the cluster-ing result.

Affinity Propagation: Pro’s & Con’s

+ Exemplars are real samples, no hypothetical averages.

+ Only pairwise similarities necessary; the similarity measuresnot even need to satisfy symmetry or the triangle inequality(applicable to all sorts of kernels and correlation measures).

+ Algorithm (almost) deterministic, not sensitive to initialization.

+ Number of clusters need not be pre-specified.

– Adjustment of input preference parameter can be tricky.

– Quadratic similarity matrices required: does not scale to largedata sets.

Adjustment of Input Preference /Number of Clusters

The apcluster package further offers:

Function preferenceRange(s) computes input preferencebounds for given similarity matrix s:

Lower bound: 1–2 clusters

Upper bound: as many clusters as samples

Function apclusterK() performs an inverse search for aninput preference that facilitates a desired number of clustersK.

Function aggExCluster() performs agglomerative cluster-ing on top of AP clustering and allows for specifying cut levelswith a desired number of clusters.

So How to Scale AP to Large DatasetsThen?

Sub-sampling?

Massive loss ofinformation!

Sub-sampling?

Reduce columnobjects . . .

Sub-sampling?

but keep all rowobjects

Leveraged Affinity Propagation

Start with a small, but reasonable, sub-sample of columns (po-tential exemplars).

Iteratively repeat AP on such sub-samples, keeping the exem-plars of the previous iteration. Messages are exchanged be-tween the row objects (all) and the potential exemplars/columnobjects (sub-sample).

No need to calculate the whole similarity matrix in advance.Instead, only the similarities with the sub-samples need to becomputed.

Function apclusterL()

apclusterL(s, x, frac, sweeps, p, q, ...)

s . . . similarity measure (function)

x . . . data (vector or list)

frac . . . fraction of samples to be considered in each iteration

sweeps . . . number of iterations

p, q . . . analogous to apcluster()

The function returns an APResult object that contains the cluster-ing result.

Functions for Visualizing Results

plot(x) with x being an APResult object: performancegraphs for assessing convergence

plot(x) with x being an AggExResult object: dendrogramof agglomerative clustering

plot(x, y) with x being an APResult object and y beingoriginal data: plot data along with cluster structure)

heatmap(x) with x being an APResult or AggExResult ob-ject: plot heatmap (dendrograms and color coding of clustersswitchable)

Similarity Measures

The apcluster package provides the following similarity measures:

negDistMat(): negative distances (resp. power thereof), interfaceanalogous to dist()

expSimMat(): generalization of Gauss/Laplace similarity (RBF/La-place kernel)

linSimMat(): Łukasiewicz similarity, interface analogous to dist()

corSimMat(): correlation, interface analogous to cor()

All functions in the apcluster package can also be used with custom-made similarity measures.

Similarity Measures: Examples[similarity of (x, y) and (0, 0)]

negDistMat(r=2, method="euclidean") negDistMat(r=1, method="manhattan")

expSimMat(r=2, w=1, method="euclidean") expSimMat(r=1, w=1, method="manhattan")

linSimMat(w=1, method="euclidean") linSimMat(w=1, method="maximum")

Example:Two Clusters of Coiled Coil Sequences

ANSTILMV

CPQALVEKHILNRYKAETQ

AMVTLLVCRQKEEGHSTYKQRADQTWLRVKACEFTHLNIVADGKQTRSYLEAFHLMNRTVYDSEHMNTVYAFWLAINRVQTKLGTRAEKLSACIMNQDRTELKCKSVYQALTNAGINRWDELYEKNTADMQYRHAEIQRSTYLAILMTQKREDEVAQKNADGQRKEKLNVKRVQEAETAQRCLGLSKEQRKDQKELWNLV

VKKL.........................VKTLKAQNSELASTANMLREQVAQLKQKV..............NYALKQKVQAL....VQALKKRVQALKARNYAAKQKVQAL..............ALLEAIRLRNKVKAL....VVTLKELHSSTTLENDQLRQKVRQL..............LKLDNYHLENEVARL....LEAVRRKIRSLQEQNYHLENEVARLKKLVMKQLEDKVEELLSKNYHLENEVARL....MKQLEDKVEELLSKNYHLENEVARLKKLVMKQLEDKVEELLSKQYHLENEVARL....MKQLEDKVEELLSKKYHLENEVARL....LPILEDKVEELLSKNYHLENEVARL.......LEDKVEELLSKNYHLENEVARLKKLVVKELEDKVEELLSKNYHLENEVARLKKLV.......VEELLSKNYHLENEVARLKKLV.......VEELLSKNYHLENEVARLKKLL..........MVEKCWKLMDKVVRL....TKLLEDLVYFV..................AAALCGLVYLF..................ICRLCYRVLRH..................IEKLQTKVLELQRKLDNTTAAVQELGRENNLETQHKVLELTAENERLQKKVEQLSREL...TQQKVLELTSDNDRLRKRVEQLSREL...LKSWNETLTSRL..............VKQLKAKVEELKSKLWHLKNKVARLKKKN...LERVAEELNSIAGHL...........VVHLRQRTEELLRCNEQQAAELETCKEQL.......ERDFLKTTLHR...........SLNMVKVLGDLLKNSLTIINGL.......VETLKDVIYHAKELQLELKKKK.......IAYALDTIAENQAKNEHLQKENERLLRDW.......FKNYLTKVAYY.................VLLVLIKLYYD...................CTMLVGATYMLLVDN..............IDEWKSLTYDE...........LEELQAVHEYWRSMNRYS...........

L...N E.EN..L E...LN.....E

abcdefgabcdefgabcdefgabcdefga

DQTFMLVIINPQRVASTKEAHQTEGNKSD

CDFMRSTWALNYEK

AFLIKLGIQSADNE

CHKLIDTEA

HKQTWLMVIDGHNSVTEKL

EHIRYDQNSFILQTVDNEKFSQLIGKLNQTDEAY

EKLRVYDSAH

EGLQSTVRINVYKLEQRSVYENDMQRYLELVIDAKRVIKKL I

DVSLIQDVERILDYS...........ITTSISHVESLLDDRL..........IEDKIEEILSKIYHIENEIARIKKLIIEDKIEEILSKIYHIENEIARI....IEDKIEEIESKQKKIENEIARIKKLLQEDKLEETLSKIYHLENEIARV....IEDKIEEILSKIYHIENEI........EGCLIEIL.................FEDALLA...................FSDYLGLM...................IDYLAKKLHDIEEEY..........VPKWISIM..................VTKTIETHTDNI..............VAKRIEALENKI..............INSDFNAI..................LSSNFGAISSVLNDILSRLDKV........LEAINSEL..............IESAINDVHNFS..............IKNEIAAIKQEIAAIEQMI.......IKEEIAAIKDKIAAIKEYI.......IEEEIQAIKEEIAAIKYLI.......MEAEINTLKSKLELT...........LQHNIKTQVSQILRVEVDI..........FLDIWTYN...............MKGLIDEVDQDFTSR...........TAQELDCIKITQQVGVEL........LRNMAIDMGNEIGSQNRQV.......

E..I E..IE....K

I...I I...I I...Idefgabcdefgabcdefgabcdefga

(a) (b)

Example:AP Clustering of Chemical Compounds

2,5−

2−ni

9−ni

1−am

ino−

8−ni

1−ni

2−ni

1−ni

4−ni

8−ni

5−ni

l−6−

6−ni

2−ac

y−7−

l−7−

2−cy

ano−

7−ni

2−io

7−ni

2−hy

y−7−

2−flu

oro−

7−ni

1−ni

6,8−

l−4−

2,4−

3,4−

3−ni

tro−

9−flu

2−ni

tro−

1−ch

4−di

7−tr

o−9−

2,4,7−trinitro−9−fluorenone

1−chloro−2,4−dinitrobenzene

2−nitro−m−phenylenediamine

4,3'−dinitrobiphenyl

3−nitro−9−fluorenone

3,4−dinitrofluoranthene

2,4−dinitrofluoranthene

3−methyl−4−nitrobiphenyl

1,3,6,8−tetranitropyrene

1−nitrofluoranthene

2−fluoro−7−nitrofluorene

2−hydroxy−7−nitrofluorene

2−iodo−7−nitrofluorene

2−cyano−7−nitrofluorene

2−methyl−7−nitrofluorene

2−acetoxy−7−nitrofluorene

6−nitroquinoline

1−methyl−6−nitroindazole

5−nitrobenzimidazole

5−nitroquinoline

8−nitroquinoline

4−nitrobenz[k]acephenanthrylene

1−nitrobenzo[a]pyrene

1−nitrocoronene

2−nitrobenz[j]aceanthrylene

1−nitropyrene

1−amino−8−nitropyrene

9−nitrophenanthrene

2−nitrophenanthrene

2,5−difluoronitrobenzene

Example: Leveraged Affinity Propagation

37 53 83 88 91 27 30 69 73 79 80

807978777675747372717069686766656463626119302928272625242322212018171615141312111098765432110510410310210110099989796959493929190898887868584838281605958575655545352515049484746454443424140393837363534333231

And Now . . .

. . . let’s see apcluster at work!

Further Information

Paper about package:

U. Bodenhofer, A. Kothmeier, and S. Hochreiter. APCluster:an R package for affinity propagation clustering. Bioinfor-matics, 27(17):2463–2464, 2011. DOI: 10.1093/bioinformat-ics/btr406.

Package homepages:

http://www.bioinf.jku.at/software/apcluster/

http://cran.r-project.org/web/packages/

apcluster/index.html

Acknowledgments

Package co-authors:

Andreas Kothmeier

Johannes Palme

Acknowledgments

Package co-authors:

Andreas Kothmeier

Johannes Palme

Co-workers in related applications and projects:

Sepp Hochreiter

Günter Klambauer

Andreas Mayr

Acknowledgments

Package co-authors:

Andreas Kothmeier

Johannes Palme

Co-workers in related applications and projects:

Sepp Hochreiter

Günter Klambauer

Andreas Mayr

For the invitation to and organization of this webinar:

Ray DiGiacomo, Jr., and the Orange County R User Group

Introduction to apcluster - bioinf.jku.at of Bioinformatics Johannes Kepler University Linz...

Documents

Transcript of Introduction to apcluster - bioinf.jku.at of Bioinformatics Johannes Kepler University Linz...

DEPARTMENT OF ECONOMICS JOHANNES KEPLER ...cDepartment of Economics, Johannes Kepler University Linz, Austria. 2 1. Introduction Nearly twenty years after the introduction of the European

CURRICULUM VITAE CONTACT - JKU - Johannes Kepler ... · Social Sciences (Johannes Kepler University Linz) 2006-2011 PhD in Social and Economic Sciences (Johannes Kepler University

lukiyanchuklukiyanchuk.ru/publ/6.pdf · 2009. 1. 29. · D. Bäuerle, B. Luk'yanchuk*, and K. Piglmayer Angewandte Physik, Johannes-Kepler-Universität Linz, A-4040 Linz, Austria

JOHANNES KEPLER UNIVERSITY LINZ Institute of Computational ... · JOHANNES KEPLER UNIVERSITY LINZ Institute of Computational Mathematics Multigrid Methods for Isogeometric Analysis

SPACE FOR EXCHANGE. - JKU - Johannes Kepler Universität Linz · WELCOME 2 Welcome to the Johannes Kepler University Linz. Congratulations. You have been nominated by your home university

.NET Architecture Albrecht Wöß Institute for System Software Johannes Kepler University Linz © University of Linz, Institute for System Software, 2004.

Backjumping - Johannes Kepler University Linz

Johannes Kepler University Linz. - JKU

Johannes Kepler University Linz, Austria Handbook: …ssw.jku.at/Teaching/Lectures/SpezialLVA/Lightfoot/2011/Handbook.pdf · Department of Computing Johannes Kepler University Linz,

Peter Praxmarer GUP, Joh. Kepler University Linz praxmarer@gup.jku.at

Surrogate-Based Multi-Objective Optimization of Electrical ... · 1 Department of Electrical Drives and Power Electronics, Johannes Kepler University Linz, Linz 4040, Austria 2 Department

Inverse Problems in Semiconductor Devices Martin Burger Johannes Kepler Universität Linz.

GeoGebra. Supporting Institutions Johannes Kepler University Linz Mathematics Education RISCMathematics EducationRISC Austrian Ministry of Education Dr.

Johannes Kepler University Linz - Compact and Eﬃcient ...Johannes Kepler University Linz, Austria Abstract In several Java VMs, strings consist of two separate objects: metadata

A-4040 Linz, Austria, Europe Johannes Kepler University ... · • Research Institute for Symbolic Computation Johannes Kepler University A-4040 Linz, Austria, Europe Circular Programs

JOHANNES KEPLER UNIVERSITY LINZ Institute of ...Estimates for Saddle Point Problems Walter Zulehner Institute of Computational Mathematics, Johannes Kepler University Altenberger Str.

JOHANNES KEPLER UNIVERSITY LINZ Institute of Computational ... · Institute of Computational Mathematics, Johannes Kepler University Linz, A-4040 Linz, Austria; helmut.gfrerer@jku.at

Johannes Kepler University Linz - ul.ie file3 Welcome to the Johannes Kepler University Linz (JKU)! Congratulations! You have been nominated by your home university to study at JKU

T FLA - bioinf.jku.at

European Grid Initiative Dieter Kranzlmüller (GUP, Joh. Kepler University Linz) contact@eu-egi.org.