Bioinformatics kernels relations
-
Upload
michiel-stock -
Category
Documents
-
view
338 -
download
0
Transcript of Bioinformatics kernels relations
Kernel Methods and Relational Learning inBioinformatics
ir. Michiel StockDr. Willem Waegeman
Prof. dr. Bernard De Baets
Faculty of Bioscience EngineeringGhent University
November 2012
KERMIT
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 1 / 40
Outline
1 Introduction
2 Kernel methods
3 Learning relations
4 Case studiesEnzyme function predictionProtein-ligand interactionsMicrobial ecology
5 Conclusions
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 2 / 40
Introduction
Introductory example
Problem statement
Predict protein-protein interactions based on high-throughput data.
Based on a gold standardTypical features that can beused:
Yeast two-hybrid
Pfam profile
Phylogenetic profile
Localization
PSI-BLAST
Expression
...
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 3 / 40
Introduction
Machine learning is widely used in bioinformatics
In addition to all these applications, computa-tional techniques are used to solve other problems,such as efficient primer design for PCR, biologicalimage analysis and backtranslation of proteins (whichis, given the degeneration of the genetic code,a complex combinatorial problem).
Machine learning consists in programmingcomputers to optimize a performance criterionby using example data or past experience. Theoptimized criterion can be the accuracy provided bya predictive model—in a modelling problem—,and the value of a fitness or evaluation function—inan optimization problem.
In a modelling problem, the ‘learning’ term refers torunning a computer program to induce a model byusing training data or past experience. Machinelearning uses statistical theory when buildingcomputational models since the objective is to
make inferences from a sample. The two mainsteps in this process are to induce the model byprocessing the huge amount of data and to representthe model and making inferences efficiently. It mustbe noticed that the efficiency of the learning andinference algorithms, as well as their space andtime complexity and their transparency and inter-pretability, can be as important as their predictiveaccuracy. The process of transforming data intoknowledge is both iterative and interactive. Theiterative phase consists of several steps. In the firststep, we need to integrate and merge the differentsources of information into only one format. Byusing data warehouse techniques, the detection andresolution of outliers and inconsistencies are solved.In the second step, it is necessary to select, clean andtransform the data. To carry out this step, we need toeliminate or correct the uncorrected data, as well as
Figure 1: Classification of the topics wheremachine learningmethods are applied.
88 Larran‹ aga et al. at Biom
edische Bibliotheek on Novem
ber 2, 2010bib.oxfordjournals.org
Dow
nloaded from
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 4 / 40
Introduction
Bioinformatics deals with complex data
Bioinformatics data is typically:
in large dimension (e.g., microarrays or proteomics data)
structured (e.g., gene sequences, small molecules, interactionnetworks, phylogenetic trees...)
heterogeneous (e.g., vectors, sequences, graphs to describethe same protein)
in large quantities (e.g., more than 106 known proteinsequences)
noisy (e.g., many features are not relevant)
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 5 / 40
Kernel methods
Formal definition of a kernel
Kernels are non-linear functions defined over objects x ∈ X .
Definition
A function k : X × X → R is called a positive definite kernel if it issymmetric, that is, k(x, x′) = k(x′, x) for any two objects x, x′ ∈ X , andpositive semi-definite, that is,
N∑
i=1
N∑
j=1
cicjk(xi , xj) ≥ 0
for any N > 0, any choice of N objects x1, . . . , xN ∈ X , and any choice ofreal numbers c1, . . . , cN ∈ R.
Can be seen as generalized covariances.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 6 / 40
Kernel methods
Interpretation of kernels
Suppose an object x has animplicit feature representationφ(x) ∈ F .A kernel function can be seenas a dot product in thisfeature space:
k(x, x′) = 〈φ(x), φ(x′)〉
Linear models in this featurespace F can be made:
y(x) = wTφ(x)
=∑
n
ank(xn, x)
�
X F
k h�(x),�(x0)i
dinsdag, 10 april 2012
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 7 / 40
Kernel methods
Many kernel methods exist
Examples of popular kernelmethods:
Support vector machine(SVM)
Regularized least squares(RLS)
Kernel principalcomponent analysis(KPCA)
Learning algorithm isindependent of the kernelrepresentation!
SVM
KPCA
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 8 / 40
Kernel methods
Kernels for (protein) sequences
Spectrum kernel (SK)
The SK considers the number of k-mers m two sequences si and sj have incommon.
SKk(si , sj) =∑
m∈Σk
N(m, si )∗N(m, sj)
with N(m, s) the number of k-mersm in sequence s.
To predict structure, function...of DNA, RNA or proteins.
A discriminative alternative forHidden Markov Models.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 9 / 40
Kernel methods
Kernels for graphs (1)
Graph
Graphs are a set of interconnected objects, called vertices (or nodes), thatare connected through edges.
Graphs can show the structure of an object or interactions betweendifferent objects.
Graph are important in bioinformatics!ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 10 / 40
Kernel methods
Kernels for graphs (2)
Graph kernel
Constructing a similarity between graphs.
Based on performing arandom walk on both graphsand counting the number ofmatching walks.Usually very computationallydemanding!
In chemoinformatics:
In structural bioinformatics:
A B
zaterdag, 28 april 2012
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 11 / 40
Kernel methods
Kernels for graphs (3)
Diffusion kernel
Constructing a similarity between vertices within the same graph.
Also based on performing arandom walk on a graph.Captures the long-rangerelationships betweenvertices.Inspired by the heatequation. The kernelquantifies how quickly ‘heat’can spread from one node toanother.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 12 / 40
Kernel methods
Kernels for fingerprints
Objects that can be describedby a long binary vector x canbe represented by theTanimoto kernel:
KTan(xm, xn) =
〈xm, xn〉〈xm, xm〉+ 〈xn, xn〉 − 〈xm, xn〉
.
Fingerprint representation ofan object:
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 13 / 40
Learning relations
Kernels for pairs of objects
Problem statement
Predict the binding interaction between a given protein and a ligand(small molecule). Learning Molecular docking.
The problem deals with twotypes of objects:
Proteins (graph kernel ofstructure, sequencekernel, fingerprints...)
Ligand (fingerprints,graph kernel...)
Label is for a pair of objects.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 14 / 40
Learning relations
Kernels for pairs of objects
Pairwise kernel
Combine the kernel matrices of the individual objects to construct a kernelmatrix for pairs of objects.
Starting from individual kernels for the proteins and ligands:
Relational Learning and Ranking Algorithms for Bioinformatics Applications
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics
Extra Logo’s
Michiel Stock, Willem Waegeman, Bernard De Baets
Introductory example: chemogenomics
Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.
proteins ligands
( , )( , )( , )
...( , )( , )
EC 2.7.7.12
EC 4.2.3.90
EC ?.?.?.?EC 2.7.7.34
EC 4.6.1.11
EC 2.7.1.12
1
0
0
3
0
2
02
0
zondag, 13 mei 2012
h(e) = hw,�(e)i =X
e2E
aeK�(e, e)
Relational Learning and Ranking Algorithms for Bioinformatics Applications
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics
Michiel Stock, Willem Waegeman, Bernard De Baets
Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.
Proteins Ligands
Object kernelsPairwise kernel
Data set
Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:
h 2 H �(e)
Given a training dataset T, this function can be learned using the following algorithm:
A(T ) = argmaxh2H
L(h, T ) + �khk2H,
with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:
L �
L(h, T ) =X
v2V
X
e,e2Ev
(ye � ye � h(e) + h(e))2
In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:
K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)
SVMRLS
...
Learning algorithm
By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:
- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems
Database objects
Mor
e re
leva
nt
Query 1 Query 2
Mor
e re
leva
nt
Functional ranking of enzymesGiven structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.
Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.
KERMIT
Relational Learning and Ranking Algorithms for Bioinformatics Applications
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics
Extra Logo’s
Michiel Stock, Willem Waegeman, Bernard De Baets
Introductory example: chemogenomics
Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.
proteins ligands
( , )( , )( , )
...
( , )( , )
EC 2.7.7.12
EC 4.2.3.90
EC ?.?.?.?EC 2.7.7.34
EC 4.6.1.11
EC 2.7.1.12
1
0
0
3
0
2
02
0
zondag, 13 mei 2012
h(e) = hw,�(e)i =X
e2E
aeK�(e, e)
Relational Learning and Ranking Algorithms for Bioinformatics Applications
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics
Michiel Stock, Willem Waegeman, Bernard De Baets
Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.
Proteins Ligands
Object kernelsPairwise kernel
Data set
Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:
h 2 H �(e)
Given a training dataset T, this function can be learned using the following algorithm:
A(T ) = argmaxh2H
L(h, T ) + �khk2H,
with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:
L �
L(h, T ) =X
v2V
X
e,e2Ev
(ye � ye � h(e) + h(e))2
In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:
K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)
SVMRLS
...
Learning algorithm
By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:
- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems
Database objects
Mor
e re
leva
nt
Query 1 Query 2
Mor
e re
leva
nt
Functional ranking of enzymesGiven structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.
Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.
KERMIT
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 15 / 40
Learning relations
Conditional ranking (1)
Motivation
Suppose one is not particularly interested in the exact value of theinteraction but in the order of the proteins for a given ligand.
Relational Learning and Ranking Algorithms for Bioinformatics Applications
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics
Extra Logo’s
Michiel Stock, Willem Waegeman, Bernard De Baets
Introductory example: chemogenomics
Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.
proteins ligands
( , )( , )( , )
...
( , )( , )
EC 2.7.7.12
EC 4.2.3.90
EC ?.?.?.?EC 2.7.7.34
EC 4.6.1.11
EC 2.7.1.12
1
0
0
3
0
2
02
0
zondag, 13 mei 2012
h(e) = hw,�(e)i =X
e2E
aeK�(e, e)
Relational Learning and Ranking Algorithms for Bioinformatics Applications
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics
Michiel Stock, Willem Waegeman, Bernard De Baets
Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.
Proteins Ligands
Object kernelsPairwise kernel
Data set
Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:
h 2 H �(e)
Given a training dataset T, this function can be learned using the following algorithm:
A(T ) = argmaxh2H
L(h, T ) + �khk2H,
with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:
L �
L(h, T ) =X
v2V
X
e,e2Ev
(ye � ye � h(e) + h(e))2
In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:
K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)
SVMRLS
...
Learning algorithm
By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:
- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems
Database objects
Mor
e re
leva
nt
Query 1 Query 2
Mor
e re
leva
nt
Functional ranking of enzymesGiven structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.
Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.
KERMIT
Relational Learning and Ranking Algorithms for Bioinformatics Applications
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics
Extra Logo’s
Michiel Stock, Willem Waegeman, Bernard De Baets
Introductory example: chemogenomics
Suppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. With no known mechanistic information one can build a statistical model based on a data set. Kernel methods allow for the generation of a joint feature representation of a pair containing a protein and a ligand.
proteins ligands
( , )( , )( , )
...
( , )( , )
EC 2.7.7.12
EC 4.2.3.90
EC ?.?.?.?EC 2.7.7.34
EC 4.6.1.11
EC 2.7.1.12
1
0
0
3
0
2
02
0
zondag, 13 mei 2012
h(e) = hw,�(e)i =X
e2E
aeK�(e, e)
Relational Learning and Ranking Algorithms for Bioinformatics Applications
KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics
Michiel Stock, Willem Waegeman, Bernard De Baets
Introductory example: chemogenomicsSuppose one wants to model the binding interactions between a set of proteins and a database of ligands to aid the process of drug design. Our framework can be used to model pairwise relations between different types of objects.
Proteins Ligands
Object kernelsPairwise kernel
Data set
Conditional ranking algorithmThe ranking data can be seen as a graph, we want to predict some value using a feature representation of the edges:
h 2 H �(e)
Given a training dataset T, this function can be learned using the following algorithm:
A(T ) = argmaxh2H
L(h, T ) + �khk2H,
with an appropriate loss function and a regularization parameter. To train a model for conditional ranking a convex and differentiable approximation of the ranking loss is used:
L �
L(h, T ) =X
v2V
X
e,e2Ev
(ye � ye � h(e) + h(e))2
In the most general case the Kronecker product pairwise kernel is used for the edges, which is simply the product of some kernel between pairs of nodes:
K�(e, e) = K�(v, v0, v, v0) = K�(v, v)K�(v0, v0)
SVMRLS
...
Learning algorithm
By optimizing a ranking loss, our algorithms can also be used for conditional ranking, as shown on the right.In short, our framework is ideally suited for bioinformatics challenges:
- efficient learning process- can handle complex objects (graphs, trees, sequences...)- ability to deal with information retrieval problems
Database objects
Mor
e re
leva
nt
Query 1 Query 2
Mor
e re
leva
ntFunctional ranking of enzymes
Given structural information of an enzyme we want to infer its function. This is done by ranking annotated proteins of a database according to their predicted catalytic similarity (derived from the EC number) with the query-protein.
Using five state of the art structural similarities, we showed that learning a conditional ranking model is always an improvement compared on the baseline ranking.
KERMIT
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 16 / 40
Learning relations
Conditional ranking (2)
Based on a graph description,with e a pair of objects.Train the model:
h(e) =< w,Φ(e) >=∑
e∈EaeK
Φ(e, e)
using the algorithm:
A(T ) = argminh∈H
L(h,T )+λ‖h‖2H.
Where we use a ranking loss:
L(h,T ) =∑
v∈V
∑
e,e∈Ev
(ye−ye−h(e)+h(e))2.
*Figure 1 Example of a multi-graph. If this graph, on the left, would be used for ranking the elements conditioned on C, then A scores better than E, which ranks higher than E, which on its turn ranks higher than D and D ranks higher than B. There is no information about the relation between C and F and G, respectively, our model could be used to include these two instances in the ranking if features are available. Notice that in this setting unconditional ranking of these objects is meaningless as this graph is obviously intransitive. Figure reproduced from (Pahikkala et al., 2010). The proposed framework is based on the Kronecker product kernel for generating implicit joint feature representations of queries and the sets of objects to be ranked. Exactly this kernel construction will allow a straightforward extension of the existing framework to dyadic relations and multi-task learning problems (Objectives 1 and 2). It has been proposed independently by three research groups for modeling pairwise inputs in different application domains (Basilico et al. 2004, Oyana et al. 2004, Ben-Hur et al. 2005). From a different perspective, it has been considered in structured output prediction methods for defining joint feature representations of inputs and outputs (Tsochantaridis et al., 2005, Weston et al., 2007). While the usefulness of Kronecker product kernels for pairwise learning has been clearly established, computational efficiency of the resulting algorithms remains a major challenge. Previously proposed methods require the explicit computation of the kernel matrix over the data object pairs, hereby introducing bottlenecks in terms of processing and memory usage, even for modest dataset sizes. To overcome this problem, one typically applies sampling strategies of the kernel matrix for training. An alternative approach known as the Cartesian kernel has been proposed in (Kashima et al., 2009). This kernel exhibits interesting computational properties, but it can be solely employed in selected applications, because it cannot make predictions for (couples of) objects that are not observed in the training dataset. When modeling interactions between two types of objects one gets close to the field of collaborative filtering, as shown in (Pessiot et al., 2007). Matrix factorization methods, which are used especially in collaborative filtering, may be applied to conditional ranking problems, by exploiting the known labels for pairs of objects in order to generate a latent feature representation that allows predicting these labels for pairs for which this information is missing. Such methods can be combined with our machine learning approach, as a preprocessing step in which additional latent features are generated (part of Objectives 1 and 2).
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 17 / 40
Case studies Enzyme function prediction
Predicting enzyme function
Problem statement
Predict the function (EC number) of an enzyme using structuralinformation of the active site.
Data:
1730 enzymes with 21different functions
four different structuralsimilarities
CavBasemaximum commonsubgraphlabeled point cloudsuperpositionfingerprints
active site of anenzyme:
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 18 / 40
Case studies Enzyme function prediction
EC numbers
EC number
A functional label of an enzyme, based on the reaction that is catalyzed.
Example: EC 2.7.6.1 = ribose-phosphate diphosphokinase
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 19 / 40
Case studies Enzyme function prediction
Defining catalytic similarity
Catalytic similarity
The catalytic similarity is the number of successive equal digits in the ECnumber between two enzymes, starting from the first digit.
EC 2.7.7.12
EC 4.2.3.90
EC ?.?.?.?EC 2.7.7.34
EC 4.6.1.11
EC 2.7.1.12
1
0
0
3
0
2
02
0
zondag, 13 mei 2012
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 20 / 40
Case studies Enzyme function prediction
Data exploration
Kernel PCA of the cb data
−800 −600 −400 −200 0 200 400 600−100
0 −
800
−60
0 −
400
−20
0
0 2
00 4
00
−400−200
0 200
400 600
8001000
First component
Seco
nd c
ompo
nent
Third
com
pone
nt
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●● ●●●●● ●●●●●●●●● ●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ● ● ●●●●●● ●●
●●●●●● ●●
●●
●●
●
●
●●●
●
●●●●●●
●
●
●●●●●●●
●●●●●●●●●
●●●
●●
●●●●●●
●
●
●●●
●●●●
●
●
●
●●●●
●
●●●●●
●●●●●
●●●●
●●●
●●●●●●●●●●
●●●●●●
●●●●●
●●●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●
●
●
●
●●●
●●●●●●●●●●●●●●●
●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●
●
●
●●●●●●●●●
●●
●●●●●●●●
●●●
●
●●●●●
●●●●●
●
●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●●
●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
Kernel PCA of the fp data
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
−2.0−1.5
−1.0−0.5
0.0 0.5
1.0 1.5
First component
Seco
nd c
ompo
nent
Third
com
pone
nt
●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●●●
●●
●●
●●
●●
●●●
●
●
●
●
●
●
●●●●●●●
●
●●
●
●●●
●
●
●●●
●
●●●●
●●
●
●
●
●●
●
●
●●●●
●
●
●
●
●
●●●●●●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●●●
●
●●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●●●●●
●
●
●
●
●●●
●
●
●
●
●●●
●●●●
●
●
●
●●●●●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●●●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●●●●
●●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●●
●
●
●
●●
●●
● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●●●
●●
●
●
●
●
●
●●
●
●●
●●●●
●
●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●
●●
●●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●●●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●●●
●●●
●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●●
●
●●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●
●
●●●
●
● ●
●●
●
●●●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
● ●●●
●●
●●
● ●
●●
●●
●●●
●
●●
●
●
●●
●●●
●●
●●●
● ●
●
●●
●
●
●
●
●●
●
●
● ●
●●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
● ●
●●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
●●
●
● ●●●
●
● ●
●●● ●
●
●
●
●● ●●
●●●
●●
●
●
●●
●
●
●
●●
●
●●●
●●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●● ●●
●●
●●
●
●
●●
●
●●●
●●
●●
●
●
●●●●
●
●●●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●●●●●●
●●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●●●●●●
●●
●●
●
●●●●●
●
●●
●●
●
●
●●●
●
●
●
●●
●●●●
●
●●
●
●
●
●●
●●●
●●●
●
●●
●●●
●
●
●
●
●
● ●
●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●●●●●
●
●●
●
●●
●●●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●●●● ●●
●●●
● ●
● ●
●●●
●
●●
●
●
●
●●
●●
●
●
● ●●
●
●
● ●●●
●
●
●
● ●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●●
●
●●
●●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●●
●●
●
●●●
● ●
●
●●●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
● ●●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●● ●●
●
●
●
●
●●
●
●●●
●●
●
●●
●
●●
●●●
●●●
●
●●●
●
●
●●
●●●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●●
●●
●●
●●
●
●●
●●●●
●●●
●
●
●
●
●
●
●
●●
●
●●●●●●●
●
●●
●●●●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●●
●
●●
●
●
●
●
●●
●
●
●●
●●●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●●
●●
●●●
●
●●
●●
●●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●●●●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●●
● ●
●
Kernel PCA of the mcs data
−2 −1 0 1 2 3 4
−1.5
−1.0
−0.5
0.0
0.5
1.0
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
First component
Seco
nd c
ompo
nent
Third
com
pone
nt
●
●
●
●
●
●
●
●
●●●●●●●
●
●●
●●●●●●●
●●
●
●
●●
●
●
●
●●●
●●
●●
●●
●●
●
●●●●●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●●
●●●
●●●
●●●●
●
●●
●●
●
●
●●●
●●
●
●●
●
●
●
●●
●
●●
● ●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●●
●●● ●●
●
●●
●
●
●●
●●●
●
●
●
●●
● ●
●●
●●
●
●●●●
●
●
●●●
●●
●
●●●●●●
●●
●
●
●
●●●●
●●
●
●
●
●
●
●
●●●
●
●●
●
●●●●●
●
●●
●
●
●
●●●
●●
●
●●●●
●●●
●●●
●
●●●●
●
● ●●
●●●
●
●
●
●
●
●●●
●●
●
●●
●●
●
●●●
●●●●●
●
●
●
●●
●●
●●●●●
● ●●●
●●●●
●●
●
●●
●●●
●
●
●
●●●●●
●
●●●●
●
●●
●
●●●
●
●●
●
●
●
●●
●
●
●
●●
●●●●●●●●●●
●
●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●
●●
●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●●●● ●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
● ●●
●
●
● ●
●●
● ●●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●●
● ●
●●
●
●●
●
●
●●
●
●
●
●●●
●
●●
●
●●
●●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●● ●●
●
●
●
●
●
● ●
●
●
●●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●●●
●●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●●
●
●
●
●●●
●
●
●
●●●
●●●
●
●●
●●●
●
●●●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●●●
●●●
●●
●
●
●●
●
●
●●
●
●●
●●●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●●●●
●
●●●
●
●●●
●●●
●
●
●
●
●●
●●
●
●
●●●●●
●
●
●
●
●
●
●●●●
●●●●
●●●
●●●●
●
●
●●●●
●●●
●●●●●●●
●●●●
●●●
●
●
●●
●
●●●●●●●●●●
●●●
●●●
●●●
●
●
●
●●●●●●●●●●
●●
●●
●
●
●●
●●
●●●●●●
●●●●●
●●
●
●
●●●
●
●●●●●●●
●
●●●●●
●●●
●
●
●
●●●●
●●●●●●
●●●●●
●
●●●●
●●
●●
●●
●●
●●●●
●●●●●●●
●
●●
●
●●●
●●
●
●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●
●●
●
●●
●
●
●●
●●
●●●
●●
●
●●●●●●
●●●●
●
●●
●●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●
●
●●●●●●●●●
●●●
●●●●
●●●●●●
●●●●●●●
●
●●
●
●●●
●●
●●●●●
●
●●●●●●●●●●●●●
●●●●●
●
●●●●●
●●●●●●●
●●●●
●
●●●●●●●●●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●
●
●●
●●●●
●●●●●●●
●
●●●●
●●●
●●●
●●●●●●●●●●●●
●
●●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●
●
●●●●●
●●●●●●
●●●●●●●●●●●●●●●●
●
●●●●
●●
●
●●●●●●●
●●●●
●●●●
●●●●●●●●●
●
●
●
●●
●●●●●●
●●
●●
●●●●
●●
●
●●●
●●
●●
●●
●
●
●●●●●●●●●●●●●
●
●
●●●
●
●●●
●
●
●
●●
●
●●
●●●●●●●
●●●
●
●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
Kernel PCA of the lpcs data
−8 −6 −4 −2 0 2 4
−3−2
−1 0
1 2
3 4
−5−4
−3−2
−1 0
1 2
3
First component
Seco
nd c
ompo
nent
Third
com
pone
nt
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●
●●●●●●●
●●
●●●●●●●●●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●●●
●
●●●●
●
●
●●●●
●
●●●●●●
●
●●
●
●●●
●
●●●
●
●●●●
●●
●
●●
●●●
●
●●●●●
●
●●●●●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●
●●● ●● ●●●● ●● ● ●● ●●●●●●
●●●
●
● ●●●●●●● ●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●
●
●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●● ●●● ●●● ●● ●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●● ●●●●● ●●●● ●●●●●●
●
●●● ●●●●●●●●●●●●●●● ●●● ●●●●● ●●●●● ●●● ●●●●●●●●● ●●●●●●●●●●● ●●● ●●●●●●●●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●
●
●
●
●
●●●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●●●●
●
●●
●
●
●
●●●●
●
●
●
●
●●
●
●●●
●
●
●
●●●
●
●
●
●●
●
●
●
●●●●
●
●
●●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●●
●
●●
●●●
●
●●●●●
●
●
●
●●●
●
●●
●
●●●●●
●
●●●●●
●●●●
●●●●
●
●
●
●●●
●
●
●●●
●
●●
●
●●
●
●●
●●
●●●
●
●
●●●●
●
●
●●
●●●●●●
●●
●●●●●●●●●●●
●
●●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●
050
100
150
200
250
300
Hierarchical clustering of the cb data
hclust (*, "complete")dist(D)
05
10
Hierarchical clustering of the fp data
hclust (*, "complete")dist(D)
02
46
810
1214
Hierarchical clustering of the mcs data
hclust (*, "complete")dist(D)
02
46
810
Hierarchical clustering of the lpcs data
hclust (*, "complete")dist(D)
donderdag, 8 november 2012
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 21 / 40
Case studies Enzyme function prediction
Ranking enzymes
Ranking enzymes
For a query enzyme with unknown function, construct a ranking of adatabase of annotated enzymes, based on structure. The top of theranking has likely the same function as the query.
unsupervised: for a given query enzyme with unknown function, rankthe database according to the structural similarity with the query
supervised: first a ranking model h(v , v ′) is constructed by using anindependent training set. Subsequently for a given query enzyme vwith unknown function, rank the enzymes vi from the databaseaccording to h(v , vi )
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 22 / 40
Case studies Enzyme function prediction
Results ranking enzymes7
Table IISUMMARY OF THE RESULTS OBTAINED FOR UNSUPERVISED AND SUPERVISED RANKING. FOR EACH COMBINATION OF CAVITY-BASED SIMILARITY
AND TYPE OF PERFORMANCE MEASURE THE PERFORMANCE IS AVERAGED OVER THE DIFFERENT FOLDS AND QUERIES, WITH THE STANDARDDEVIATION BETWEEN PARENTHESES. FOR EVERY ROW THE BEST RANKING MODEL IS MARKED IN BOLD, WHILE THE WORST MODEL IS INDICATED
BY AN UNDERSCORE.
cb fp mcs lpcs
Unsupervised
RA 0.9062 (0.0603) 0.8815 (0.0689) 0.8923 (0.0692) 0.8877 (0.0607)MAP 0.9321 (0.1531) 0.7207 (0.235) 0.8846 (0.1578) 0.7339 (0.2074)AUC 0.9636 (0.0795) 0.8655 (0.1387) 0.9393 (0.0919) 0.8794 (0.1126)nDCG 0.9922 (0.0329) 0.9349 (0.1424) 0.9812 (0.0498) 0.9471 (0.1112)
Supervised
RA 0.9951 (0.017) 0.995 (0.015) 0.9944 (0.0112) 0.9952 (0.0156)MAP 0.9991 (0.0092) 0.9954 (0.0432) 0.9989 (0.0076) 0.9835 (0.0797)AUC 0.9976 (0.0005) 0.9967 (0.0184) 0.9975 (0.0024) 0.9934 (0.0368)nDCG 0.9968 (0.0171) 0.9942 (0.0424) 0.987 (0.0398) 0.9812 (0.0673)
Kernel PCA of the cb data
−800 −600 −400 −200 0 200 400 600−100
0 −
800
−60
0 −
400
−20
0
0 2
00 4
00
−400−200
0 200
400 600
8001000
First component
Seco
nd c
ompo
nent
Third
com
pone
nt
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●● ●●●●● ●●●●●●●●● ●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ● ● ●●●●●● ●●
●●●●●● ●●
●●
●●
●
●
●●●
●
●●●●●●
●
●
●●●●●●●
●●●●●●●●●
●●●
●●
●●●●●●
●
●
●●●
●●●●
●
●
●
●●●●
●
●●●●●
●●●●●
●●●●
●●●
●●●●●●●●●●
●●●●●●
●●●●●
●●●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●
●
●
●
●●●
●●●●●●●●●●●●●●●
●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●
●
●
●●●●●●●●●
●●
●●●●●●●●
●●●
●
●●●●●
●●●●●
●
●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●●
●●●●●
●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
Kernel PCA of the fp data
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
−0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
−2.0−1.5
−1.0−0.5
0.0 0.5
1.0 1.5
First component
Seco
nd c
ompo
nent
Third
com
pone
nt
●●●●●●●●●●●●●
●●●●●●●●●●●●
●●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●●●
●●
●●
●●
●●
●●●
●
●
●
●
●
●
●●●●●●●
●
●●
●
●●●
●
●
●●●
●
●●●●
●●
●
●
●
●●
●
●
●●●●
●
●
●
●
●
●●●●●●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●●●
●
●●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●●●●●
●
●
●
●
●●●
●
●
●
●
●●●
●●●●
●
●
●
●●●●●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●●●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●●●●
●●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●●
●
●
●
●●
●●
● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●●●
●●
●
●
●
●
●
●●
●
●●
●●●●
●
●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●
●●
●●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●●●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●●
●
●●
●
●
●
●
●●●
●●●
●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●●
●
●●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●
●
●●●
●
● ●
●●
●
●●●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
● ●●●
●●
●●
● ●
●●
●●
●●●
●
●●
●
●
●●
●●●
●●
●●●
● ●
●
●●
●
●
●
●
●●
●
●
● ●
●●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
● ●
●●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
●●
●
● ●●●
●
● ●
●●● ●
●
●
●
●● ●●
●●●
●●
●
●
●●
●
●
●
●●
●
●●●
●●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●● ●●
●●
●●
●
●
●●
●
●●●
●●
●●
●
●
●●●●
●
●●●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●●●●●●
●●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●●●●●●
●●
●●
●
●●●●●
●
●●
●●
●
●
●●●
●
●
●
●●
●●●●
●
●●
●
●
●
●●
●●●
●●●
●
●●
●●●
●
●
●
●
●
● ●
●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●●●●●
●
●●
●
●●
●●●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●●●● ●●
●●●
● ●
● ●
●●●
●
●●
●
●
●
●●
●●
●
●
● ●●
●
●
● ●●●
●
●
●
● ●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●●
●
●●
●●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●●
●●
●
●●●
● ●
●
●●●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
● ●●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●● ●●
●
●
●
●
●●
●
●●●
●●
●
●●
●
●●
●●●
●●●
●
●●●
●
●
●●
●●●
●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●●
●●
●●
●●
●
●●
●●●●
●●●
●
●
●
●
●
●
●
●●
●
●●●●●●●
●
●●
●●●●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●●
●
●●
●
●
●
●
●●
●
●
●●
●●●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●●
●●
●●●
●
●●
●●
●●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●●●●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●●
● ●
●
Kernel PCA of the mcs data
−2 −1 0 1 2 3 4
−1.5
−1.0
−0.5
0.0
0.5
1.0
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
First component
Seco
nd c
ompo
nent
Third
com
pone
nt
●
●
●
●
●
●
●
●
●●●●●●●
●
●●
●●●●●●●
●●
●
●
●●
●
●
●
●●●
●●
●●
●●
●●
●
●●●●●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●●
●●●
●●●
●●●●
●
●●
●●
●
●
●●●
●●
●
●●
●
●
●
●●
●
●●
● ●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●●
●●● ●●
●
●●
●
●
●●
●●●
●
●
●
●●
● ●
●●
●●
●
●●●●
●
●
●●●
●●
●
●●●●●●
●●
●
●
●
●●●●
●●
●
●
●
●
●
●
●●●
●
●●
●
●●●●●
●
●●
●
●
●
●●●
●●
●
●●●●
●●●
●●●
●
●●●●
●
● ●●
●●●
●
●
●
●
●
●●●
●●
●
●●
●●
●
●●●
●●●●●
●
●
●
●●
●●
●●●●●
● ●●●
●●●●
●●
●
●●
●●●
●
●
●
●●●●●
●
●●●●
●
●●
●
●●●
●
●●
●
●
●
●●
●
●
●
●●
●●●●●●●●●●
●
●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●
●●
●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●●●● ●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
● ●●
●
●
● ●
●●
● ●●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●●
● ●
●●
●
●●
●
●
●●
●
●
●
●●●
●
●●
●
●●
●●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●● ●●
●
●
●
●
●
● ●
●
●
●●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●●●
●●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●●
●
●
●
●●●
●
●
●
●●●
●●●
●
●●
●●●
●
●●●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●●●
●●●
●●
●
●
●●
●
●
●●
●
●●
●●●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●●●●
●
●●●
●
●●●
●●●
●
●
●
●
●●
●●
●
●
●●●●●
●
●
●
●
●
●
●●●●
●●●●
●●●
●●●●
●
●
●●●●
●●●
●●●●●●●
●●●●
●●●
●
●
●●
●
●●●●●●●●●●
●●●
●●●
●●●
●
●
●
●●●●●●●●●●
●●
●●
●
●
●●
●●
●●●●●●
●●●●●
●●
●
●
●●●
●
●●●●●●●
●
●●●●●
●●●
●
●
●
●●●●
●●●●●●
●●●●●
●
●●●●
●●
●●
●●
●●
●●●●
●●●●●●●
●
●●
●
●●●
●●
●
●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●
●●
●
●●
●
●
●●
●●
●●●
●●
●
●●●●●●
●●●●
●
●●
●●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●
●
●●●●●●●●●
●●●
●●●●
●●●●●●
●●●●●●●
●
●●
●
●●●
●●
●●●●●
●
●●●●●●●●●●●●●
●●●●●
●
●●●●●
●●●●●●●
●●●●
●
●●●●●●●●●●●●●●
●●●●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●
●
●●
●●●●
●●●●●●●
●
●●●●
●●●
●●●
●●●●●●●●●●●●
●
●●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●
●
●●●●●
●●●●●●
●●●●●●●●●●●●●●●●
●
●●●●
●●
●
●●●●●●●
●●●●
●●●●
●●●●●●●●●
●
●
●
●●
●●●●●●
●●
●●
●●●●
●●
●
●●●
●●
●●
●●
●
●
●●●●●●●●●●●●●
●
●
●●●
●
●●●
●
●
●
●●
●
●●
●●●●●●●
●●●
●
●●●●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
Kernel PCA of the lpcs data
−8 −6 −4 −2 0 2 4
−3−2
−1 0
1 2
3 4
−5−4
−3−2
−1 0
1 2
3
First component
Seco
nd c
ompo
nent
Third
com
pone
nt
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●
●●●●●●●
●●
●●●●●●●●●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●●●
●
●●●●
●
●
●●●●
●
●●●●●●
●
●●
●
●●●
●
●●●
●
●●●●
●●
●
●●
●●●
●
●●●●●
●
●●●●●●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●
●●● ●● ●●●● ●● ● ●● ●●●●●●
●●●
●
● ●●●●●●● ●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●
●
●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●● ●●● ●●● ●● ●●●●●●●●●●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●● ●●●●● ●●●● ●●●●●●
●
●●● ●●●●●●●●●●●●●●● ●●● ●●●●● ●●●●● ●●● ●●●●●●●●● ●●●●●●●●●●● ●●● ●●●●●●●●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●
●
●
●
●
●
●●●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●●●●
●
●●
●
●
●
●●●●
●
●
●
●
●●
●
●●●
●
●
●
●●●
●
●
●
●●
●
●
●
●●●●
●
●
●●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●●
●
●●
●●●
●
●●●●●
●
●
●
●●●
●
●●
●
●●●●●
●
●●●●●
●●●●
●●●●
●
●
●
●●●
●
●
●●●
●
●●
●
●●
●
●●
●●
●●●
●
●
●●●●
●
●
●●
●●●●●●
●●
●●●●●●●●●●●
●
●●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●
050
100
150
200
250
300
Hierarchical clustering of the cb data
hclust (*, "complete")dist(D)
05
10
Hierarchical clustering of the fp data
hclust (*, "complete")dist(D)
02
46
810
1214
Hierarchical clustering of the mcs data
hclust (*, "complete")dist(D)
02
46
810
Hierarchical clustering of the lpcs data
hclust (*, "complete")dist(D)
donderdag, 8 november 2012
Figure 3. (top) Kernel principal component analysis of the different cavity-based similarities. Enzymes are shown as points in the three -dimensionalspace spanned by the first three principal components and are colored according to the first digit of the EC number. (bottom) Hierarchical clustering ofthe enzymes in the feature space for the four different cavity-based similarities. The enzymes at the leafs of the tree are colored according to the firstdigit of their EC number. Color key: EC 1.x.x.x is red, EC 2.x.x.x is green, EC 3.x.x.x is blue, EC 4.x.x.x is cyan, EC 5.x.x.x is magenta and EC 6.x.x.xis grey.
all four cavity-based similarities. The statistical significanceof the differences was confirmed with a paired Wilcoxontest and a conservative Bonferroni correction for multiplehypotheses testing (p < 10�6). Moreover, for all kernelsand performance measures, supervised ranking decreases thestandard deviation of the error, implying that the modelsbecome more stable.
Three important reasons can be put forward for explainingthe improvement in performance. First of all, the traditionalbenefit of supervised learning plays an important role. Onecan expect that supervised ranking methods outperform un-supervised analogs, because they take ground truth rankingsinto account during the training phase to guide towardsretrieval of enzymes with a similar EC number. Conversely,unsupervised methods solely rely on the characterization ofa meaningful similarity measure between enzymes, whileignoring EC numbers.
Second, we also advocate that supervised ranking meth-ods have the ability to preserve the hierarchical structure
of EC numbers in their predicted rankings. Figure 4 sup-ports this claim. It summarizes the values used for rankingone fold of the test set obtained by the different models.So, for unsupervised ranking it visualizes K(v, v0), forsupervised ranking the values h(v, v0) are shown. Eachrow of the heatmap corresponds to one query. For thesupervised models one notices a much better correspondencewith the ground truth. Furthermore, the different levels ofcatalytic similarity can be better distinguished1. In addition,an example of the distributions of the predicted valueswithin one query is visualized in Figure 5 by means of boxplots, illustrating again that supervised models establish abetter separation of the different levels of similarity withrespect to the query enzyme. For this example query noquartiles are overlapping in any supervised model, unlike the
1Notice that all the unsupervised heatmaps are symmetric, because theyvisualize a subset of the distance matrices. Conversely, the supervisedheatmaps are approximately symmetric, since rankings are inferred rowby row.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 23 / 40
Case studies Enzyme function prediction
Supervised ranking preserves hierarchies (1)
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 24 / 40
Case studies Enzyme function prediction
Supervised ranking preserves hierarchies (2)
●●●●● ●
0 1 2 4
020
4060
8010
012
014
0
cat. similarity
pred
ictio
n
Unsupervisedcb
●
●
●
●●
●
●
●●
●
0 1 2 40.
200.
250.
300.
350.
400.
450.
50
cat. similarity
pred
ictio
n
Unsupervisedfp
●
●●●
●
●●●
●
●
●
0 1 2 4
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
cat. similarity
pred
ictio
n
Unsupervisedmcs
●●●
●●●●●●●●●●● ●●●●
●
●
0 1 2 4
0.0
0.2
0.4
0.6
0.8
1.0
cat. similarity
pred
ictio
n
Unsupervisedlpcs
●
●
●●
0 1 2 4
01
23
4
cat. similarity
pred
ictio
n
Supervisedcb
●●●
●●
●
●
0 1 2 4
−10
12
34
cat. similarity
pred
ictio
n
Supervisedfp
●
●
●
0 1 2 4
01
23
4
cat. similarity
pred
ictio
n
Supervisedmcs
●
●●●●●●
●
●●●
●
●
0 1 2 40
12
34
cat. similarity
pred
ictio
n
Supervisedlpcs
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 25 / 40
Case studies Protein-ligand interactions
Predicting protein-ligand interactions
Problem statement
Predict the binding interaction between a given protein and a ligand(small molecule). Learning Molecular docking.
Training using the Karamandataset:
317 kinase targets
38 kinase inhibitors
For each combination thedissociation coefficient Kd innM is known.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 26 / 40
Case studies Protein-ligand interactions
Karaman dataset
©20
08 N
atur
e Pu
blis
hing
Gro
up h
ttp://
ww
w.n
atur
e.co
m/n
atur
ebio
tech
nolo
gy
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 27 / 40
Case studies Protein-ligand interactions
Building a model
Features
CavBase similarity for proteins
Tanimoto kernel from the fingerprints derived from ligands
Virtual docking results
Model types:
Classification by specifying a cutoff value, using RLS.
Conditional ranking, use one type of object to construct a ranking ofthe other type according to binding energy.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 28 / 40
Case studies Protein-ligand interactions
Protein-ligands results classification
Test sampling Cutoff [nM] AUC
new ligand1000 0.621584 (0.104163)
10000 0.653330 (0.107727)
new protein1000 0.812184 (0.185627)
10000 0.801310 (0.157205)
Cutoff value hardly matters
Generalizing to new ligand harder than for new protein
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 29 / 40
Case studies Protein-ligand interactions
Protein-ligands results ranking
Testing scheme: new query for the same database
Query type Ranking error
Ligand 0.324000 (0.129307)Protein 0.32799 (0.088344)
Query type does not matter (much)
Using protein as query somewhat more reliable
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 30 / 40
Case studies Microbial ecology
Predicting microbial interactions
Problem statement
How do heterotrophic bacteria influence the growth of methanotrophicbacteria?
Dataset:
10 methanotrophs
27 heterotrophs
Of each combination a timeseries of their collectivegrowth (OD) was measuredfor 14 days.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 31 / 40
Case studies Microbial ecology
Concept
Methanotrophs Heterotrophs
Methane
Carboncompounds
vitamins?antibiotics?
Features: ⌦
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 32 / 40
Case studies Microbial ecology
Experimental setup
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 33 / 40
Case studies Microbial ecology
Optical density time series
0 5 10 15
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Meth_5 and Hetero_2
Time (days)
OD
●
●
●
●●
●
●
●
max OD
max increasment OD
0 5 10 15
0.00
0.05
0.10
0.15
0.20
Meth_7 and Hetero_10
Time (days)O
D
● ● ● ●
●
●
●
●
max OD
max increasment OD
Three types of labels were derived from these plots:
maximal optical density
maximal increase in optical density
time of maximal increase in optical densityir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 34 / 40
Case studies Microbial ecology
Labels for bacterial combinations
NM
S
M 9
M 6
M 7
M 3
M 1
M 5
M 2
M 4
M 8
Methanotrophs
H 6H 12H 10H 18NMSH 7H 9H 11H 4H 5H 25H 1H 3H 8H 16H 14H 24H 2H 22H 21H 17H 20H 19H 23H 13H 15
Het
erot
roph
s
Heat map of the log. of max. density
−6 −4 −2 0Value
02
46
Color Keyand Histogram
Cou
nt
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 35 / 40
Case studies Microbial ecology
Regression results
Pairwise regression of the labels using support vector regression. Testing isdone by withholding each heterotroph in a leave-one-out scheme.
Label MSE/var Spearman cor.
Max. OD 0.8248 0.6875Max. incr. OD 0.7888 0.57708Time max. incr. OD 0.9694 0.3839
This is a hard problem!
Exact experimental conditions very important!
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 36 / 40
Case studies Microbial ecology
Extra feature selection
Idea
Look for the most relevant genes for interaction in the heterotrophs usinglasso regression (in combination with the LARS algorithm) orRegularized Random Forests.
For example, max. OD seemsto be determined by genesrelated tomethenyltetrahydrofolate.Take with a large grain ofsalt!
LARS:
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***
0.0 0.2 0.4 0.6 0.8 1.0
−6
−4
−2
02
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts * * * **********
**
* ** *** ** ************** ** *** *****
*************************************** ***
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ****************
*** ***** ***************************************
***
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***
*
*
* ********* ***
* ** ** * **************** **
*** ***** *************************************** ***
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * **
* ** ** * ** ************** ** *** ***** *************************************** ***
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** **************** *** ***** *************************************** **** * * ********* *
**
* ** *** ** ************** ** *** ***** ************
*************************** ***
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * ********** * ** * ** ** * ** ************** ** *** ***** *************************************** ***
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***
* * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** **** * * ********* * ** * ** ** * ** ************** ** *** ***** *************************************** ***
LASSO
238
1234
445
220
8740
0 1 2 5 13 15 19 21 34 39 47 65
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 37 / 40
Conclusions
Take-home messages
Use kernels for complex structured data.
Relations can be learned by treating a pair of objects as aspecial kind of structured object.
Predicting a ranking is in many cases a more relevant answer toa research question.
Posing the right research question is of vital importance whenbuilding models!
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 38 / 40
Conclusions
Further reading I
[1] A. Ben-Hur and W. S. Noble. Kernel methods for predictingprotein-protein interactions. Bioinformatics, 21 Suppl 1:i38–46, June2005.
[2] S. Erdin, A. M. Lisewski, and O. Lichtarge. Protein functionprediction: towards integration of similarity metrics. Current Opinionin Structural Biology, 21(2):180–8, Apr. 2011.
[3] L. Jacob and J.-P. Vert. Protein-ligand interaction prediction: animproved chemogenomics approach. Bioinformatics, 24(19):2149–56,Oct. 2008.
[4] T. Pahikkala, A. Airola, M. Stock, B. De Baets, and W. Waegeman.Efficient regularized least-squares algorithms for conditional ranking onrelational data. Machine Learning, Submitted, 2012.
[5] B. Scholkopf, K. Tsuda, and J.-P. Vert. Kernel Methods inComputational Biology. 2004.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 39 / 40
Conclusions
Further reading II
[6] M. Stock. Learning pairwise relations in bioinformatics: three casestudies. Master’s thesis, Ghent University, 2012.
[7] J.-P. Vert, J. Qiu, and W. S. Noble. A new pairwise kernel forbiological network inference with support vector machines. BMCBioinformatics, 8(S-10), Jan. 2007.
[8] W. Waegeman, T. Pahikkala, A. Airola, T. Salakoski, M. Stock, andB. De Baets. A kernel-based framework for learning graded relationsfrom data. IEEE Transactions on Fuzzy Systems, 99:1, 2012.
ir. Michiel Stock (KERMIT) Kernels for Bioinformatics November 2012 40 / 40