3 İ 4 ECPred: Enzyme Prediction Using Combination of ... · molecular weight, number of residues,...
Transcript of 3 İ 4 ECPred: Enzyme Prediction Using Combination of ... · molecular weight, number of residues,...
EMBL-EBI Tel. +44 (0) 1223 494 444
Wellcome Trust Genome Campus [email protected]
Hinxton, Cambridgeshire, CB10 1SD, UK www.ebi.ac.uk
ECPred: Enzyme Prediction
Using Combination of Classifiers
Ahmet Sureyya Rifaioglu1, Tunca Dogan2, Omer Sinan Sarac3, Mehmet Volkan Atalay1,
Maria Jesus Martin2 and Rengul Cetin-Atalay4
ABSTRACT
Motivation : Efficient and accurate protein function
prediction methods are required to annotate the proteins
with unknown functions. Recent studies show that
combination of different methods enhances prediction
accuracy. In addition, data preparation and post-processing
of predictions are other important factors in functional
annotation of proteins.
Results : Here we propose “ECPred”, a novel hierarchical
approach to predict Enzyme Commission (EC) numbers
using combination of classifiers which are Blast-knn, SPMap
and PepStats-SVM. ECPred combines these methods and
gives a weighted mean score for each trained EC number.
In ECPred we use hierarchical data preparation and
evaluation steps to increase the accuracy of the predictions.
ECPred is trained for 851 EC classes. Cross-validation
results have shown that ECPred can predict enzyme
functions with high performance (average F-Score is 0.96).
METHODOLOGY
Blast-kNN : k-Nearest Neighbor algorithm is combined with BLAST. Similarity search is done among training
set of each EC number and k Blast scores from negative and positive training dataset are incorporated as:
SPMap : SPMap is a subsequence-based feature extraction method consisting of three main modules:
(i) Subsequence Extraction Module (ii) Clustering Module (iii) Probabilistic Profile Construction Module
Pepstats-SVM : Pepstats is a feature based method that calculates statistics for proteins including:
molecular weight, number of residues, charge etc. Proteins are represented as 37-D vectors in Pepstats.
Vectors obtained from Pepstats and SPMap are fed to the SVM classifier (independently) to obtain
classification scores between -1 and 1. Later a weighted mean score is calculated for each query protein, for
each functional class (as confidence of the prediction) by combining SVM scores and the Blast-kNN score.
INTRODUCTION
• The volume of protein sequence data is increasing
exponentially and manual curation efforts are
insufficient to annotate proteins with unknown
functions
• Therefore, effective automatic annotation methods
are required in order to overcome this problem.
• Enzymes are special type of proteins that catalyses
biochemical reactions.
• The Enzyme Commission number (EC number) is a
numerical classification scheme for enzymes, based
on the chemical reactions they catalyze
• Functional annotations of enzymes are crucial in
several fields of bioinformatics such as identification
of diseases, drug target prediction etc.
• Here we present “ECPred” which is an enzyme
prediction tool based on EC numbers.
• ECPred incorporates a novel data preparation and
hierarchical evaluation method.
Hierarchical Evaluation of Predictions: If prediction scores of predicted EC and all of its parents are
greater than the class specific optimal thresholds, the prediction passes the evaluation.
• EC numbers having 50 or more EC annotations in UniProtKB/Swiss-Prot are selected for training.
• 5-fold cross validation is performed and optimal decision thresholds are found for each EC number.
Subsequently, hierarchical evaluation method is applied for prediction.
• 851 EC numbers are trained and average F-Score is 0.96
EVALUATION & RESULTS
• In this study, ECPred method is proposed for enzyme function prediction, combining three classification
methods from different approaches: similarity, subsequence and feature-based
• A novel data preparation method is proposed based on EC hierarchy for positive and negative training datasets
• Individual thresholds are determined for each trained EC number
• Hierarchical post-processing method is proposed to determine the reliable predictions to be presented
• Cross-validation on UniProtKB/SwissProt enzymes revealed very high classification performance
• A web-server for ECPred will be ready soon where users can query sequences to obtain enzyme function
predictions
CONCLUSION
Sp : sum of k-nearest positive BLAST scores
Sn : sum of k-nearest negative BLAST scores
>= < <
EC NUMBER Prediction
Score
Optimum
Threshol
d
1.-.-.- 0.75 0.7
1.1.-.- 0.90 0.8
1.1.1.- 0.75 0.7
1.1.99.- 0.35 0.8
1.97.-.- 0.40 0.7
1.1.1.1 0.97 0.95
1.1.1.2 0.60 0.8
1.1.1.97 0.20 0.7
EC NUMBER Prediction
Score
Optimum
Threshol
d
1.-.-.- 0.75 0.7
1.1.-.- 0.90 0.8
1.1.1.- 0.75 0.7
1.1.99.- 0.35 0.8
1.97.-.- 0.40 0.7
1.1.1.1 0.97 0.95
1.1.1.2 0.60 0.8
1.1.1.97 0.20 0.7
EC NUMBER Prediction
Score
Optimum
Threshol
d
1.-.-.- 0.75 0.7
1.1.-.- 0.90 0.8
1.1.1.- 0.75 0.7
1.1.99.- 0.35 0.8
1.97.-.- 0.40 0.7
1.1.1.1 0.97 0.95
1.1.1.2 0.60 0.8
1.1.1.97 0.20 0.7
EC NUMBER Prediction
Score
Optimum
Threshol
d
1.-.-.- 0.75 0.7
1.1.-.- 0.90 0.8
1.1.1.- 0.75 0.7
1.1.99.- 0.35 0.8
1.97.-.- 0.40 0.7
1.1.1.1 0.97 0.95
1.1.1.2 0.60 0.8
1.1.1.97 0.20 0.7
DATA PREPARATION
• Each EC number is trained with its own training
dataset based on the level of corresponding EC
number on the hierarchy.
• Positive training data for EC number 1.1.1.-:
proteins that are associated with 1.1.1.- and proteins
associated with the descendants of 1.1.1.-
• Negative training data for EC number 1.1.1.-:
proteins that are associated with siblings of 1.1.1.-
and proteins associated with descendants of siblings
of 1.1.1.-
1.-.-.-
1.1.-.-
1.1.1.-
1.1.1.1 … 1.1.1.97
1.1.2.-
1.1.2.3 1.1.2.4
… 1.1.99.-
1.1.99.1 … 1.1.99.32
…1.97-.-
1.97.1.-
1.97.1.1 … 1.97.1.99
1 Department of Computer Engineering, Middle East Technical University, Ankara, Turkey2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK3 Computer Engineering Department, Istanbul Technical University, Istanbul, Turkey4 Informatics Institute, Middle East Technical University, Ankara, Turkey
1.-.-.-
1.1.-.-
1.1.1.-
1.1.1.1 1.1.1.2 … 1.1.1.97
… 1.1.99.-
1.1.99.1 … 1.1.99.32
…1.97-.-
1.97.1.-
1.97.1.1 … 1.97.1.99
1.-.-.-
1.1.-.-
1.1.1.-
1.1.1.1 1.1.1.2 … 1.1.1.97
… 1.1.99.-
1.1.99.1 … 1.1.99.32
…1.97-.-
1.97.1.-
1.97.1.1 … 1.97.1.99
1.-.-.-
1.1.-.-
1.1.1.-
1.1.1.1 1.1.1.2 … 1.1.1.97
… 1.1.99.-
1.1.99.1 … 1.1.99.32
…1.97-.-
1.97.1.-
1.97.1.1 … 1.97.1.99
1.-.-.-
1.1.-.-
1.1.1.-
1.1.1.1 1.1.1.2 … 1.1.1.97
… 1.1.99.-
1.1.99.1 … 1.1.99.32
…1.97-.-
1.97.1.-
1.97.1.1 … 1.97.1.99