Bioinformatics kernels relations

40
Kernel Methods and Relational Learning in Bioinformatics ir. Michiel Stock Dr. Willem Waegeman Prof. dr. Bernard De Baets Faculty of Bioscience Engineering Ghent University November 2012 KERMIT ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 1 / 40

Transcript of Bioinformatics kernels relations

Page 1: Bioinformatics kernels relations

Kernel Methods and Relational Learning inBioinformatics

ir. Michiel StockDr. Willem Waegeman

Prof. dr. Bernard De Baets

Faculty of Bioscience EngineeringGhent University

November 2012

KERMIT

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 1 / 40

Page 2: Bioinformatics kernels relations

Outline

1 Introduction

2 Kernel methods

3 Learning relations

4 Case studiesEnzyme function predictionProtein-ligand interactionsMicrobial ecology

5 Conclusions

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 2 / 40

Page 3: Bioinformatics kernels relations

Introduction

Introductory example

Problem statement

Predict protein-protein interactions based on high-throughput data.

Based on a gold standardTypical features that can beused:

Yeast two-hybrid

Pfam profile

Phylogenetic profile

Localization

PSI-BLAST

Expression

...

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 3 / 40

Page 4: Bioinformatics kernels relations

Introduction

Machine learning is widely used in bioinformatics

In addition to all these applications, computa-tional techniques are used to solve other problems,such as efficient primer design for PCR, biologicalimage analysis and backtranslation of proteins (whichis, given the degeneration of the genetic code,a complex combinatorial problem).

Machine learning consists in programmingcomputers to optimize a performance criterionby using example data or past experience. Theoptimized criterion can be the accuracy provided bya predictive model—in a modelling problem—,and the value of a fitness or evaluation function—inan optimization problem.

In a modelling problem, the ‘learning’ term refers torunning a computer program to induce a model byusing training data or past experience. Machinelearning uses statistical theory when buildingcomputational models since the objective is to

make inferences from a sample. The two mainsteps in this process are to induce the model byprocessing the huge amount of data and to representthe model and making inferences efficiently. It mustbe noticed that the efficiency of the learning andinference algorithms, as well as their space andtime complexity and their transparency and inter-pretability, can be as important as their predictiveaccuracy. The process of transforming data intoknowledge is both iterative and interactive. Theiterative phase consists of several steps. In the firststep, we need to integrate and merge the differentsources of information into only one format. Byusing data warehouse techniques, the detection andresolution of outliers and inconsistencies are solved.In the second step, it is necessary to select, clean andtransform the data. To carry out this step, we need toeliminate or correct the uncorrected data, as well as

Figure 1: Classification of the topics wheremachine learningmethods are applied.

88 Larran‹ aga et al. at Biom

edische Bibliotheek on Novem

ber 2, 2010bib.oxfordjournals.org

Dow

nloaded from

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 4 / 40

Page 5: Bioinformatics kernels relations

Introduction

Bioinformatics deals with complex data

Bioinformatics data is typically:

in large dimension (e.g., microarrays or proteomics data)

structured (e.g., gene sequences, small molecules, interactionnetworks, phylogenetic trees...)

heterogeneous (e.g., vectors, sequences, graphs to describethe same protein)

in large quantities (e.g., more than 106 known proteinsequences)

noisy (e.g., many features are not relevant)

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 5 / 40

Page 6: Bioinformatics kernels relations

Kernel methods

Formal definition of a kernel

Kernels are non-linear functions defined over objects x ∈ X .

Definition

A function k : X × X → R is called a positive definite kernel if it issymmetric, that is, k(x, x′) = k(x′, x) for any two objects x, x′ ∈ X , andpositive semi-definite, that is,

N∑

i=1

N∑

j=1

cicjk(xi , xj) ≥ 0

for any N > 0, any choice of N objects x1, . . . , xN ∈ X , and any choice ofreal numbers c1, . . . , cN ∈ R.

Can be seen as generalized covariances.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 6 / 40

Page 7: Bioinformatics kernels relations

Kernel methods

Interpretation of kernels

Suppose an object x has animplicit feature representationφ(x) ∈ F .A kernel function can be seenas a dot product in thisfeature space:

k(x, x′) = 〈φ(x), φ(x′)〉

Linear models in this featurespace F can be made:

y(x) = wTφ(x)

=∑

n

ank(xn, x)

X F

k h�(x),�(x0)i

dinsdag, 10 april 2012

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 7 / 40

Page 8: Bioinformatics kernels relations

Kernel methods

Many kernel methods exist

Examples of popular kernelmethods:

Support vector machine(SVM)

Regularized least squares(RLS)

Kernel principalcomponent analysis(KPCA)

Learning algorithm isindependent of the kernelrepresentation!

SVM

KPCA

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 8 / 40

Page 9: Bioinformatics kernels relations

Kernel methods

Kernels for (protein) sequences

Spectrum kernel (SK)

The SK considers the number of k-mers m two sequences si and sj have incommon.

SKk(si , sj) =∑

m∈Σk

N(m, si )∗N(m, sj)

with N(m, s) the number of k-mersm in sequence s.

To predict structure, function...of DNA, RNA or proteins.

A discriminative alternative forHidden Markov Models.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 9 / 40

Page 10: Bioinformatics kernels relations

Kernel methods

Kernels for graphs (1)

Graph

Graphs are a set of interconnected objects, called vertices (or nodes), thatare connected through edges.

Graphs can show the structure of an object or interactions betweendifferent objects.

Graph are important in bioinformatics!ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 10 / 40

Page 11: Bioinformatics kernels relations

Kernel methods

Kernels for graphs (2)

Graph kernel

Constructing a similarity between graphs.

Based on performing arandom walk on both graphsand counting the number ofmatching walks.Usually very computationallydemanding!

In chemoinformatics:

In structural bioinformatics:

A B

zaterdag, 28 april 2012

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 11 / 40

Page 12: Bioinformatics kernels relations

Kernel methods

Kernels for graphs (3)

Diffusion kernel

Constructing a similarity between vertices within the same graph.

Also based on performing arandom walk on a graph.Captures the long-rangerelationships betweenvertices.Inspired by the heatequation. The kernelquantifies how quickly ‘heat’can spread from one node toanother.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 12 / 40

Page 13: Bioinformatics kernels relations

Kernel methods

Kernels for fingerprints

Objects that can be describedby a long binary vector x canbe represented by theTanimoto kernel:

KTan(xm, xn) =

〈xm, xn〉〈xm, xm〉+ 〈xn, xn〉 − 〈xm, xn〉

.

Fingerprint representation ofan object:

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 13 / 40

Page 14: Bioinformatics kernels relations

Learning relations

Kernels for pairs of objects

Problem statement

Predict the binding interaction between a given protein and a ligand(small molecule). Learning Molecular docking.

The problem deals with twotypes of objects:

Proteins (graph kernel ofstructure, sequencekernel, fingerprints...)

Ligand (fingerprints,graph kernel...)

Label is for a pair of objects.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 14 / 40

Page 15: Bioinformatics kernels relations

Learning relations

Kernels for pairs of objects

Pairwise kernel

Combine the kernel matrices of the individual objects to construct a kernelmatrix for pairs of objects.

Starting from individual kernels for the proteins and ligands:

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Extra Logo’s

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomics

Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.

proteins ligands

( , )( , )( , )

...( , )( , )

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

h(e) = hw,�(e)i =X

e2E

aeK�(e, e)

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.

Proteins Ligands

Object kernelsPairwise kernel

Data set

Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:

h 2 H �(e)

Given a training dataset T, this function can be learned using the following algorithm:

A(T ) = argmaxh2H

L(h, T ) + �khk2H,

with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:

L �

L(h, T ) =X

v2V

X

e,e2Ev

(ye � ye � h(e) + h(e))2

In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:

K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)

SVMRLS

...

Learning algorithm

By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:

- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems

Database objects

Mor

e re

leva

nt

Query 1 Query 2

Mor

e re

leva

nt

Functional ranking of enzymesGiven structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.

Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.

KERMIT

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Extra Logo’s

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomics

Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.

proteins ligands

( , )( , )( , )

...

( , )( , )

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

h(e) = hw,�(e)i =X

e2E

aeK�(e, e)

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.

Proteins Ligands

Object kernelsPairwise kernel

Data set

Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:

h 2 H �(e)

Given a training dataset T, this function can be learned using the following algorithm:

A(T ) = argmaxh2H

L(h, T ) + �khk2H,

with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:

L �

L(h, T ) =X

v2V

X

e,e2Ev

(ye � ye � h(e) + h(e))2

In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:

K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)

SVMRLS

...

Learning algorithm

By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:

- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems

Database objects

Mor

e re

leva

nt

Query 1 Query 2

Mor

e re

leva

nt

Functional ranking of enzymesGiven structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.

Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.

KERMIT

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 15 / 40

Page 16: Bioinformatics kernels relations

Learning relations

Conditional ranking (1)

Motivation

Suppose one is not particularly interested in the exact value of theinteraction but in the order of the proteins for a given ligand.

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Extra Logo’s

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomics

Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.

proteins ligands

( , )( , )( , )

...

( , )( , )

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

h(e) = hw,�(e)i =X

e2E

aeK�(e, e)

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.

Proteins Ligands

Object kernelsPairwise kernel

Data set

Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:

h 2 H �(e)

Given a training dataset T, this function can be learned using the following algorithm:

A(T ) = argmaxh2H

L(h, T ) + �khk2H,

with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:

L �

L(h, T ) =X

v2V

X

e,e2Ev

(ye � ye � h(e) + h(e))2

In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:

K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)

SVMRLS

...

Learning algorithm

By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:

- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems

Database objects

Mor

e re

leva

nt

Query 1 Query 2

Mor

e re

leva

nt

Functional ranking of enzymesGiven structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.

Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.

KERMIT

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Extra Logo’s

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomics

Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.

proteins ligands

( , )( , )( , )

...

( , )( , )

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

h(e) = hw,�(e)i =X

e2E

aeK�(e, e)

Relational Learning and Ranking Algorithms for Bioinformatics Applications

KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics

[email protected]

Michiel Stock, Willem Waegeman, Bernard De Baets

Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.

Proteins Ligands

Object kernelsPairwise kernel

Data set

Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:

h 2 H �(e)

Given a training dataset T, this function can be learned using the following algorithm:

A(T ) = argmaxh2H

L(h, T ) + �khk2H,

with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:

L �

L(h, T ) =X

v2V

X

e,e2Ev

(ye � ye � h(e) + h(e))2

In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:

K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)

SVMRLS

...

Learning algorithm

By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:

- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems

Database objects

Mor

e re

leva

nt

Query 1 Query 2

Mor

e re

leva

ntFunctional ranking of enzymes

Given structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.

Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.

KERMIT

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 16 / 40

Page 17: Bioinformatics kernels relations

Learning relations

Conditional ranking (2)

Based on a graph description,with e a pair of objects.Train the model:

h(e) =< w,Φ(e) >=∑

e∈EaeK

Φ(e, e)

using the algorithm:

A(T ) = argminh∈H

L(h,T )+λ‖h‖2H.

Where we use a ranking loss:

L(h,T ) =∑

v∈V

e,e∈Ev

(ye−ye−h(e)+h(e))2.

*Figure 1 Example of a multi-graph. If this graph, on the left, would be used for ranking the elements conditioned on C, then A scores better than E, which ranks higher than E, which on its turn ranks higher than D and D ranks higher than B. There is no information about the relation between C and F and G, respectively, our model could be used to include these two instances in the ranking if features are available. Notice that in this setting unconditional ranking of these objects is meaningless as this graph is obviously intransitive. Figure reproduced from (Pahikkala et al., 2010). The proposed framework is based on the Kronecker product kernel for generating implicit joint feature representations of queries and the sets of objects to be ranked. Exactly this kernel construction will allow a straightforward extension of the existing framework to dyadic relations and multi-task learning problems (Objectives 1 and 2). It has been proposed independently by three research groups for modeling pairwise inputs in different application domains (Basilico et al. 2004, Oyana et al. 2004, Ben-Hur et al. 2005). From a different perspective, it has been considered in structured output prediction methods for defining joint feature representations of inputs and outputs (Tsochantaridis et al., 2005, Weston et al., 2007). While the usefulness of Kronecker product kernels for pairwise learning has been clearly established, computational efficiency of the resulting algorithms remains a major challenge. Previously proposed methods require the explicit computation of the kernel matrix over the data object pairs, hereby introducing bottlenecks in terms of processing and memory usage, even for modest dataset sizes. To overcome this problem, one typically applies sampling strategies of the kernel matrix for training. An alternative approach known as the Cartesian kernel has been proposed in (Kashima et al., 2009). This kernel exhibits interesting computational properties, but it can be solely employed in selected applications, because it cannot make predictions for (couples of) objects that are not observed in the training dataset. When modeling interactions between two types of objects one gets close to the field of collaborative filtering, as shown in (Pessiot et al., 2007). Matrix factorization methods, which are used especially in collaborative filtering, may be applied to conditional ranking problems, by exploiting the known labels for pairs of objects in order to generate a latent feature representation that allows predicting these labels for pairs for which this information is missing. Such methods can be combined with our machine learning approach, as a preprocessing step in which additional latent features are generated (part of Objectives 1 and 2).

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 17 / 40

Page 18: Bioinformatics kernels relations

Case studies Enzyme function prediction

Predicting enzyme function

Problem statement

Predict the function (EC number) of an enzyme using structuralinformation of the active site.

Data:

1730 enzymes with 21different functions

four different structuralsimilarities

CavBasemaximum commonsubgraphlabeled point cloudsuperpositionfingerprints

active site of anenzyme:

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 18 / 40

Page 19: Bioinformatics kernels relations

Case studies Enzyme function prediction

EC numbers

EC number

A functional label of an enzyme, based on the reaction that is catalyzed.

Example: EC 2.7.6.1 = ribose-phosphate diphosphokinase

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 19 / 40

Page 20: Bioinformatics kernels relations

Case studies Enzyme function prediction

Defining catalytic similarity

Catalytic similarity

The catalytic similarity is the number of successive equal digits in the ECnumber between two enzymes, starting from the first digit.

EC 2.7.7.12

EC 4.2.3.90

EC ?.?.?.?EC 2.7.7.34

EC 4.6.1.11

EC 2.7.1.12

1

0

0

3

0

2

02

0

zondag, 13 mei 2012

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 20 / 40

Page 21: Bioinformatics kernels relations

Case studies Enzyme function prediction

Data exploration

Kernel PCA of the cb data

−800 −600 −400 −200 0 200 400 600−100

0 −

800

−60

0 −

400

−20

0

0 2

00 4

00

−400−200

0 200

400 600

8001000

First component

Seco

nd c

ompo

nent

Third

com

pone

nt

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●● ●●●●● ●●●●●●●●● ●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ● ● ●●●●●● ●●

●●●●●● ●●

●●

●●

●●●

●●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●

●●●●●●

●●●

●●●●

●●●●

●●●●●

●●●●●

●●●●

●●●

●●●●●●●●●●

●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●

●●

●●●●●●●●

●●●

●●●●●

●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Kernel PCA of the fp data

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

−2.0−1.5

−1.0−0.5

0.0 0.5

1.0 1.5

First component

Seco

nd c

ompo

nent

Third

com

pone

nt

●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●

●●●

●●●

●●●●

●●

●●

●●●●

●●●●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●●

●●●

●●●●

●●●●●●●

●●●

●●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

● ●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●●

●●●

● ●

●●

●●●

●●●

●●

●●

●●

● ●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

● ●●●

● ●

●●● ●

●● ●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●● ●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●●

● ●

●●●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●● ●●

●●●

● ●

● ●

●●●

●●

●●

●●

● ●●

● ●●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●●

●●

●●

●●●

● ●●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●●●●

●●

●●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●●● ●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

● ●

Kernel PCA of the mcs data

−2 −1 0 1 2 3 4

−1.5

−1.0

−0.5

0.0

0.5

1.0

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

First component

Seco

nd c

ompo

nent

Third

com

pone

nt

●●●●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●● ●●

●●

●●

●●●

●●

● ●

●●

●●

●●●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●

●●●

●●

●●●●

●●●

●●●

●●●●

● ●●

●●●

●●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●●●●

● ●●●

●●●●

●●

●●

●●●

●●●●●

●●●●

●●

●●●

●●

●●

●●

●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●● ●●

●●

●●

●●

● ●●

● ●

●●

● ●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●●

●● ●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

●●●

●●●

●●●

●●●

●●

●●●

●●●●

● ●

●●

●●

●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●●

●●●●

●●●●

●●●

●●●●

●●●●

●●●

●●●●●●●

●●●●

●●●

●●

●●●●●●●●●●

●●●

●●●

●●●

●●●●●●●●●●

●●

●●

●●

●●

●●●●●●

●●●●●

●●

●●●

●●●●●●●

●●●●●

●●●

●●●●

●●●●●●

●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●●●●

●●

●●●

●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●

●●●

●●●●

●●●●●●

●●●●●●●

●●

●●●

●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●

●●●●

●●●●●●●

●●●●

●●●

●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●

●●●●●●●

●●●●

●●●●

●●●●●●●●●

●●

●●●●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●●●●●●●●●●●

●●●

●●●

●●

●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Kernel PCA of the lpcs data

−8 −6 −4 −2 0 2 4

−3−2

−1 0

1 2

3 4

−5−4

−3−2

−1 0

1 2

3

First component

Seco

nd c

ompo

nent

Third

com

pone

nt

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●●●●●●●

●●●

●●

●●●

●●

●●●●

●●●●

●●●●

●●●●●●

●●

●●●

●●●

●●●●

●●

●●

●●●

●●●●●

●●●●●●

●●●

●●

●●

●●●

●●

●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●

●●● ●● ●●●● ●● ● ●● ●●●●●●

●●●

● ●●●●●●● ●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●

●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●● ●●● ●●● ●● ●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●● ●●●●● ●●●● ●●●●●●

●●● ●●●●●●●●●●●●●●● ●●● ●●●●● ●●●●● ●●● ●●●●●●●●● ●●●●●●●●●●● ●●● ●●●●●●●●●●●

●●

●●●

●●●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●

●●

●●●●●

●●●●●

●●●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

050

100

150

200

250

300

Hierarchical clustering of the cb data

hclust (*, "complete")dist(D)

05

10

Hierarchical clustering of the fp data

hclust (*, "complete")dist(D)

02

46

810

1214

Hierarchical clustering of the mcs data

hclust (*, "complete")dist(D)

02

46

810

Hierarchical clustering of the lpcs data

hclust (*, "complete")dist(D)

donderdag, 8 november 2012

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 21 / 40

Page 22: Bioinformatics kernels relations

Case studies Enzyme function prediction

Ranking enzymes

Ranking enzymes

For a query enzyme with unknown function, construct a ranking of adatabase of annotated enzymes, based on structure. The top of theranking has likely the same function as the query.

unsupervised: for a given query enzyme with unknown function, rankthe database according to the structural similarity with the query

supervised: first a ranking model h(v , v ′) is constructed by using anindependent training set. Subsequently for a given query enzyme vwith unknown function, rank the enzymes vi from the databaseaccording to h(v , vi )

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 22 / 40

Page 23: Bioinformatics kernels relations

Case studies Enzyme function prediction

Results ranking enzymes7

Table IISUMMARY OF THE RESULTS OBTAINED FOR UNSUPERVISED AND SUPERVISED RANKING. FOR EACH COMBINATION OF CAVITY-BASED SIMILARITY

AND TYPE OF PERFORMANCE MEASURE THE PERFORMANCE IS AVERAGED OVER THE DIFFERENT FOLDS AND QUERIES, WITH THE STANDARDDEVIATION BETWEEN PARENTHESES. FOR EVERY ROW THE BEST RANKING MODEL IS MARKED IN BOLD, WHILE THE WORST MODEL IS INDICATED

BY AN UNDERSCORE.

cb fp mcs lpcs

Unsupervised

RA 0.9062 (0.0603) 0.8815 (0.0689) 0.8923 (0.0692) 0.8877 (0.0607)MAP 0.9321 (0.1531) 0.7207 (0.235) 0.8846 (0.1578) 0.7339 (0.2074)AUC 0.9636 (0.0795) 0.8655 (0.1387) 0.9393 (0.0919) 0.8794 (0.1126)nDCG 0.9922 (0.0329) 0.9349 (0.1424) 0.9812 (0.0498) 0.9471 (0.1112)

Supervised

RA 0.9951 (0.017) 0.995 (0.015) 0.9944 (0.0112) 0.9952 (0.0156)MAP 0.9991 (0.0092) 0.9954 (0.0432) 0.9989 (0.0076) 0.9835 (0.0797)AUC 0.9976 (0.0005) 0.9967 (0.0184) 0.9975 (0.0024) 0.9934 (0.0368)nDCG 0.9968 (0.0171) 0.9942 (0.0424) 0.987 (0.0398) 0.9812 (0.0673)

Kernel PCA of the cb data

−800 −600 −400 −200 0 200 400 600−100

0 −

800

−60

0 −

400

−20

0

0 2

00 4

00

−400−200

0 200

400 600

8001000

First component

Seco

nd c

ompo

nent

Third

com

pone

nt

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●● ●●●●● ●●●●●●●●● ●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ● ● ●●●●●● ●●

●●●●●● ●●

●●

●●

●●●

●●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●

●●●●●●

●●●

●●●●

●●●●

●●●●●

●●●●●

●●●●

●●●

●●●●●●●●●●

●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●

●●

●●●●●●●●

●●●

●●●●●

●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Kernel PCA of the fp data

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

−0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

−2.0−1.5

−1.0−0.5

0.0 0.5

1.0 1.5

First component

Seco

nd c

ompo

nent

Third

com

pone

nt

●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●

●●●

●●●

●●●●

●●

●●

●●●●

●●●●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●●

●●●

●●●●

●●●●●●●

●●●

●●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

● ●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●●

●●●

● ●

●●

●●●

●●●

●●

●●

●●

● ●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

● ●●●

● ●

●●● ●

●● ●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●● ●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●●

● ●

●●●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●● ●●

●●●

● ●

● ●

●●●

●●

●●

●●

● ●●

● ●●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●●

●●

●●

●●●

● ●●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●●●●

●●

●●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●●● ●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

● ●

Kernel PCA of the mcs data

−2 −1 0 1 2 3 4

−1.5

−1.0

−0.5

0.0

0.5

1.0

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

First component

Seco

nd c

ompo

nent

Third

com

pone

nt

●●●●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●● ●●

●●

●●

●●●

●●

● ●

●●

●●

●●●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●

●●●

●●

●●●●

●●●

●●●

●●●●

● ●●

●●●

●●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●●●●

● ●●●

●●●●

●●

●●

●●●

●●●●●

●●●●

●●

●●●

●●

●●

●●

●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●● ●●

●●

●●

●●

● ●●

● ●

●●

● ●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●●

●● ●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

●●●

●●●

●●●

●●●

●●

●●●

●●●●

● ●

●●

●●

●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●●

●●●●

●●●●

●●●

●●●●

●●●●

●●●

●●●●●●●

●●●●

●●●

●●

●●●●●●●●●●

●●●

●●●

●●●

●●●●●●●●●●

●●

●●

●●

●●

●●●●●●

●●●●●

●●

●●●

●●●●●●●

●●●●●

●●●

●●●●

●●●●●●

●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●●●●

●●

●●●

●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●

●●●

●●●●

●●●●●●

●●●●●●●

●●

●●●

●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●

●●●●

●●●●●●●

●●●●

●●●

●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●

●●●●●●●

●●●●

●●●●

●●●●●●●●●

●●

●●●●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●●●●●●●●●●●

●●●

●●●

●●

●●

●●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Kernel PCA of the lpcs data

−8 −6 −4 −2 0 2 4

−3−2

−1 0

1 2

3 4

−5−4

−3−2

−1 0

1 2

3

First component

Seco

nd c

ompo

nent

Third

com

pone

nt

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●

●●●●●●●●●●

●●●

●●

●●●

●●

●●●●

●●●●

●●●●

●●●●●●

●●

●●●

●●●

●●●●

●●

●●

●●●

●●●●●

●●●●●●

●●●

●●

●●

●●●

●●

●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●

●●● ●● ●●●● ●● ● ●● ●●●●●●

●●●

● ●●●●●●● ●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●

●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●● ●●● ●●● ●● ●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●● ●●●●● ●●●● ●●●●●●

●●● ●●●●●●●●●●●●●●● ●●● ●●●●● ●●●●● ●●● ●●●●●●●●● ●●●●●●●●●●● ●●● ●●●●●●●●●●●

●●

●●●

●●●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●

●●

●●●●●

●●●●●

●●●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

050

100

150

200

250

300

Hierarchical clustering of the cb data

hclust (*, "complete")dist(D)

05

10

Hierarchical clustering of the fp data

hclust (*, "complete")dist(D)

02

46

810

1214

Hierarchical clustering of the mcs data

hclust (*, "complete")dist(D)

02

46

810

Hierarchical clustering of the lpcs data

hclust (*, "complete")dist(D)

donderdag, 8 november 2012

Figure 3. (top) Kernel principal component analysis of the different cavity-based similarities. Enzymes are shown as points in the three -dimensionalspace spanned by the first three principal components and are colored according to the first digit of the EC number. (bottom) Hierarchical clustering ofthe enzymes in the feature space for the four different cavity-based similarities. The enzymes at the leafs of the tree are colored according to the firstdigit of their EC number. Color key: EC 1.x.x.x is red, EC 2.x.x.x is green, EC 3.x.x.x is blue, EC 4.x.x.x is cyan, EC 5.x.x.x is magenta and EC 6.x.x.xis grey.

all four cavity-based similarities. The statistical significanceof the differences was confirmed with a paired Wilcoxontest and a conservative Bonferroni correction for multiplehypotheses testing (p < 10�6). Moreover, for all kernelsand performance measures, supervised ranking decreases thestandard deviation of the error, implying that the modelsbecome more stable.

Three important reasons can be put forward for explainingthe improvement in performance. First of all, the traditionalbenefit of supervised learning plays an important role. Onecan expect that supervised ranking methods outperform un-supervised analogs, because they take ground truth rankingsinto account during the training phase to guide towardsretrieval of enzymes with a similar EC number. Conversely,unsupervised methods solely rely on the characterization ofa meaningful similarity measure between enzymes, whileignoring EC numbers.

Second, we also advocate that supervised ranking meth-ods have the ability to preserve the hierarchical structure

of EC numbers in their predicted rankings. Figure 4 sup-ports this claim. It summarizes the values used for rankingone fold of the test set obtained by the different models.So, for unsupervised ranking it visualizes K(v, v0), forsupervised ranking the values h(v, v0) are shown. Eachrow of the heatmap corresponds to one query. For thesupervised models one notices a much better correspondencewith the ground truth. Furthermore, the different levels ofcatalytic similarity can be better distinguished1. In addition,an example of the distributions of the predicted valueswithin one query is visualized in Figure 5 by means of boxplots, illustrating again that supervised models establish abetter separation of the different levels of similarity withrespect to the query enzyme. For this example query noquartiles are overlapping in any supervised model, unlike the

1Notice that all the unsupervised heatmaps are symmetric, because theyvisualize a subset of the distance matrices. Conversely, the supervisedheatmaps are approximately symmetric, since rankings are inferred rowby row.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 23 / 40

Page 24: Bioinformatics kernels relations

Case studies Enzyme function prediction

Supervised ranking preserves hierarchies (1)

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 24 / 40

Page 25: Bioinformatics kernels relations

Case studies Enzyme function prediction

Supervised ranking preserves hierarchies (2)

●●●●● ●

0 1 2 4

020

4060

8010

012

014

0

cat. similarity

pred

ictio

n

Unsupervisedcb

●●

●●

0 1 2 40.

200.

250.

300.

350.

400.

450.

50

cat. similarity

pred

ictio

n

Unsupervisedfp

●●●

●●●

0 1 2 4

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

cat. similarity

pred

ictio

n

Unsupervisedmcs

●●●

●●●●●●●●●●● ●●●●

0 1 2 4

0.0

0.2

0.4

0.6

0.8

1.0

cat. similarity

pred

ictio

n

Unsupervisedlpcs

●●

0 1 2 4

01

23

4

cat. similarity

pred

ictio

n

Supervisedcb

●●●

●●

0 1 2 4

−10

12

34

cat. similarity

pred

ictio

n

Supervisedfp

0 1 2 4

01

23

4

cat. similarity

pred

ictio

n

Supervisedmcs

●●●●●●

●●●

0 1 2 40

12

34

cat. similarity

pred

ictio

n

Supervisedlpcs

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 25 / 40

Page 26: Bioinformatics kernels relations

Case studies Protein-ligand interactions

Predicting protein-ligand interactions

Problem statement

Predict the binding interaction between a given protein and a ligand(small molecule). Learning Molecular docking.

Training using the Karamandataset:

317 kinase targets

38 kinase inhibitors

For each combination thedissociation coefficient Kd innM is known.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 26 / 40

Page 27: Bioinformatics kernels relations

Case studies Protein-ligand interactions

Karaman dataset

©20

08 N

atur

e Pu

blis

hing

Gro

up h

ttp://

ww

w.n

atur

e.co

m/n

atur

ebio

tech

nolo

gy

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 27 / 40

Page 28: Bioinformatics kernels relations

Case studies Protein-ligand interactions

Building a model

Features

CavBase similarity for proteins

Tanimoto kernel from the fingerprints derived from ligands

Virtual docking results

Model types:

Classification by specifying a cutoff value, using RLS.

Conditional ranking, use one type of object to construct a ranking ofthe other type according to binding energy.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 28 / 40

Page 29: Bioinformatics kernels relations

Case studies Protein-ligand interactions

Protein-ligands results classification

Test sampling Cutoff [nM] AUC

new ligand1000 0.621584 (0.104163)

10000 0.653330 (0.107727)

new protein1000 0.812184 (0.185627)

10000 0.801310 (0.157205)

Cutoff value hardly matters

Generalizing to new ligand harder than for new protein

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 29 / 40

Page 30: Bioinformatics kernels relations

Case studies Protein-ligand interactions

Protein-ligands results ranking

Testing scheme: new query for the same database

Query type Ranking error

Ligand 0.324000 (0.129307)Protein 0.32799 (0.088344)

Query type does not matter (much)

Using protein as query somewhat more reliable

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 30 / 40

Page 31: Bioinformatics kernels relations

Case studies Microbial ecology

Predicting microbial interactions

Problem statement

How do heterotrophic bacteria influence the growth of methanotrophicbacteria?

Dataset:

10 methanotrophs

27 heterotrophs

Of each combination a timeseries of their collectivegrowth (OD) was measuredfor 14 days.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 31 / 40

Page 32: Bioinformatics kernels relations

Case studies Microbial ecology

Concept

Methanotrophs Heterotrophs

Methane

Carboncompounds

vitamins?antibiotics?

Features: ⌦

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 32 / 40

Page 33: Bioinformatics kernels relations

Case studies Microbial ecology

Experimental setup

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 33 / 40

Page 34: Bioinformatics kernels relations

Case studies Microbial ecology

Optical density time series

0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Meth_5 and Hetero_2

Time (days)

OD

●●

max OD

max increasment OD

0 5 10 15

0.00

0.05

0.10

0.15

0.20

Meth_7 and Hetero_10

Time (days)O

D

● ● ● ●

max OD

max increasment OD

Three types of labels were derived from these plots:

maximal optical density

maximal increase in optical density

time of maximal increase in optical densityir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 34 / 40

Page 35: Bioinformatics kernels relations

Case studies Microbial ecology

Labels for bacterial combinations

NM

S

M 9

M 6

M 7

M 3

M 1

M 5

M 2

M 4

M 8

Methanotrophs

H 6H 12H 10H 18NMSH 7H 9H 11H 4H 5H 25H 1H 3H 8H 16H 14H 24H 2H 22H 21H 17H 20H 19H 23H 13H 15

Het

erot

roph

s

Heat map of the log. of max. density

−6 −4 −2 0Value

02

46

Color Keyand Histogram

Cou

nt

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 35 / 40

Page 36: Bioinformatics kernels relations

Case studies Microbial ecology

Regression results

Pairwise regression of the labels using support vector regression. Testing isdone by withholding each heterotroph in a leave-one-out scheme.

Label MSE/var Spearman cor.

Max. OD 0.8248 0.6875Max. incr. OD 0.7888 0.57708Time max. incr. OD 0.9694 0.3839

This is a hard problem!

Exact experimental conditions very important!

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 36 / 40

Page 37: Bioinformatics kernels relations

Case studies Microbial ecology

Extra feature selection

Idea

Look for the most relevant genes for interaction in the heterotrophs usinglasso regression (in combination with the LARS algorithm) orRegularized Random Forests.

For example, max. OD seemsto be determined by genesrelated tomethenyltetrahydrofolate.Take with a large grain ofsalt!

LARS:

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***

0.0 0.2 0.4 0.6 0.8 1.0

−6

−4

−2

02

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts * * * **********

**

* ** *** ** ************** ** *** *****

*************************************** ***

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ****************

*** ***** ***************************************

***

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***

*

*

* ********* ***

* ** ** * **************** **

*** ***** *************************************** ***

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * **

* ** ** * ** ************** ** *** ***** *************************************** ***

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** **************** *** ***** *************************************** **** * * ********* *

**

* ** *** ** ************** ** *** ***** ************

*************************** ***

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * ********** * ** * ** ** * ** ************** ** *** ***** *************************************** ***

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***

* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***

LASSO

238

1234

445

220

8740

0 1 2 5 13 15 19 21 34 39 47 65

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 37 / 40

Page 38: Bioinformatics kernels relations

Conclusions

Take-home messages

Use kernels for complex structured data.

Relations can be learned by treating a pair of objects as aspecial kind of structured object.

Predicting a ranking is in many cases a more relevant answer toa research question.

Posing the right research question is of vital importance whenbuilding models!

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 38 / 40

Page 39: Bioinformatics kernels relations

Conclusions

Further reading I

[1] A. Ben-Hur and W. S. Noble. Kernel methods for predictingprotein-protein interactions. Bioinformatics, 21 Suppl 1:i38–46, June2005.

[2] S. Erdin, A. M. Lisewski, and O. Lichtarge. Protein functionprediction: towards integration of similarity metrics. Current Opinionin Structural Biology, 21(2):180–8, Apr. 2011.

[3] L. Jacob and J.-P. Vert. Protein-ligand interaction prediction: animproved chemogenomics approach. Bioinformatics, 24(19):2149–56,Oct. 2008.

[4] T. Pahikkala, A. Airola, M. Stock, B. De Baets, and W. Waegeman.Efficient regularized least-squares algorithms for conditional ranking onrelational data. Machine Learning, Submitted, 2012.

[5] B. Scholkopf, K. Tsuda, and J.-P. Vert. Kernel Methods inComputational Biology. 2004.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 39 / 40

Page 40: Bioinformatics kernels relations

Conclusions

Further reading II

[6] M. Stock. Learning pairwise relations in bioinformatics: three casestudies. Master’s thesis, Ghent University, 2012.

[7] J.-P. Vert, J. Qiu, and W. S. Noble. A new pairwise kernel forbiological network inference with support vector machines. BMCBioinformatics, 8(S-10), Jan. 2007.

[8] W. Waegeman, T. Pahikkala, A. Airola, T. Salakoski, M. Stock, andB. De Baets. A kernel-based framework for learning graded relationsfrom data. IEEE Transactions on Fuzzy Systems, 99:1, 2012.

ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 40 / 40