Investigating Machine Learning Based Prediction of...
Transcript of Investigating Machine Learning Based Prediction of...
-
Wajid Arshad Abbasi
2019
Department of Computer and Information Sciences
Pakistan Institute of Engineering and Applied Sciences
Nilore, Islamabad, Pakistan
Investigating Machine Learning Based
Prediction of Protein Interactions
-
This page intentionally left blank.
-
Reviewers and Examiners
Foreign Reviewers
1. Dr. Brian J. Geiss, Associate Professor, Colorado State University (CSU), USA
2. Prof. Dr. Shihua Zhang, Professor, Chinese Academy of Science (CAS), China
3. Dr. Henri Xhaard, Assistant Professor, University of Helsinki, Finland
Thesis Examiners
1. Prof. Dr. Ijaz Mansoor Qureshi, Professor, Air University, Islamabad
2. Dr. Hammad Naveed, Associate Professor, NUCES, Islamabad
3. Dr. Imran Amin, Principal Scientist, NIBGE, Faisalabad
Head of the Department (Name): Dr. Asifullah Khan
Signature with Date: _________________________________
-
Thesis Submission Approval
This is to certify that the work contained in this thesis entitled Investigating machine
learning based prediction of protein interactions, was carried out by Wajid Arshad
Abbasi, and in my opinion, it is fully adequate, in scope and quality, for the degree of
Ph.D. Furthermore, it is hereby approved for submission for review and thesis defense.
Supervisor: ___________________________________
Name: Dr. Fayyaz ul Amir Afsar Minhas
Date: 28 March, 2019
Place: PIEAS, Islamabad.
Head, Department of Computer and Information Sciences: ___________________
Name: Dr. Asifullah Khan
Date: 28 March, 2019
Place: PIEAS, Islamabad.
-
Investigating Machine Learning Based
Prediction of Protein Interactions
Wajid Arshad Abbasi
Submitted in partial fulfillment of the requirements
for the degree of Ph.D.
2019
Department of Computer and Information Sciences
Pakistan Institute of Engineering and Applied Sciences
Nilore, Islamabad, Pakistan
-
ii
Dedications
To my grandparents and my uncle Asif Habib Abbasi - who are not in this world
anymore but continue to live on in my heart. Also, to my parents, my wife Dr. Saiqa
Andleeb, and my daughter Zarnish Habib, whose love and support have been
fundamental in completing my thesis.
-
iii
Author’s Declaration
I, Wajid Arshad Abbasi hereby declare that my Ph.D. thesis titled “Investigating
machine learning based prediction of protein interactions” is my own work and has not
been submitted previously by me or anybody else for taking any degree from Pakistan
Institute of Engineering and Applied Sciences (PIEAS) or any other university/institute
in the country/world.
At any time if my statement is found to be incorrect (even after my graduation), the
university has the right to withdraw my Ph.D. degree.
____________________
(Wajid Arshad Abbasi)
28 March, 2019
PIEAS, Islamabad.
-
iv
Plagiarism Undertaking
I, Wajid Arshad Abbasi, solemnly declare that research work presented in the thesis
titled “Investigating machine learning based prediction of protein interactions” is solely
my research work with no significant contribution from any other person. Small
contribution/help wherever taken has been duly acknowledged or referred and that
complete thesis has been written by me.
I understand the zero-tolerance policy of the Higher Education Commission (HEC) and
Pakistan Institute of Engineering and Applied Sciences (PIEAS) towards plagiarism.
Therefore, I, as an author of the thesis titled above declare that no portion of my thesis
has been plagiarized and any material used as a reference is properly referred/cited.
I undertake that if I am found guilty of any formal plagiarism in the thesis titled above
even after the award of my Ph.D. degree, PIEAS reserves the rights to withdraw/revoke
my Ph.D. degree and that HEC and PIEAS has the right to publish my name on the
HEC / PIEAS Website on which name of students are placed who submitted plagiarized
thesis.
____________________
(Wajid Arshad Abbasi)
28 March, 2019
PIEAS, Islamabad.
-
v
Copyrights Statement
The entire contents of this thesis entitled Investigating Machine Learning Based
Prediction of Protein Interactions by Wajid Arshad Abbasi are an intellectual
property of Pakistan Institute of Engineering & Applied Sciences (PIEAS). No portion
of the thesis should be reproduced without obtaining explicit permission from PIEAS.
-
vi
Table of Contents
Dedications......................................................................................................... ii
Author’s Declaration ........................................................................................ iii
Plagiarism Undertaking..................................................................................... iv
Copyrights Statement ......................................................................................... v
Table of Contents .............................................................................................. vi
List of Figures .................................................................................................. xii
List of Tables .................................................................................................. xix
Acknowledgments ........................................................................................... xxi
Abstract ........................................................................................................ xxiii
List of Publications and Patents ..................................................................... xxv
List of Abbreviations and Symbols ............................................................... xxvi
1 Introduction ............................................................................................... 1
1.1 Motivations........................................................................................... 3
1.2 Problem Statement and Research Aims ............................................... 4
1.3 Dissertation Organization and Chapters’ Digest .................................. 5
2 Problem Formulation and Literature Survey ......................................... 8
2.1 Proteins ................................................................................................. 8
2.1.1 Protein Structures ........................................................................... 9
2.1.2 Protein Functions .......................................................................... 10
2.2 Protein Interactions and Complex Formation .................................... 12
2.2.1 Binding Affinity of Interacting Proteins ...................................... 13
2.2.2 Interfaces or Interaction Sites of Proteins .................................... 13
2.2.3 Types of Protein Interactions and Complexes .............................. 14
2.2.4 Biologically Significant Effects of Protein Interactions ............... 15
-
vii
2.2.5 Problems of Interest in Protein Interactions ................................. 15
2.3 Experimental Methods ....................................................................... 16
2.4 Computational Methods ..................................................................... 17
2.4.1 Classical Computational Methods ................................................ 18
2.4.2 Machine Learning......................................................................... 21
2.4.2.1 Protein Interaction Prediction .................................................... 21
2.4.2.2 Protein Binding Affinity Prediction .......................................... 25
2.4.2.3 Protein Interface or Interaction Site Prediction ......................... 26
3 Issues in Host-Pathogen Protein Interaction Prediction ...................... 30
3.1 Methods .............................................................................................. 33
3.1.1 Datasets and Preprocessing .......................................................... 33
3.1.1.1 Human-HIV Interaction Dataset (HH) ...................................... 34
3.1.1.2 Human-Adenovirus Interaction Dataset (HA) .......................... 34
3.1.2 Classifiers ..................................................................................... 34
3.1.3 Feature Extraction ........................................................................ 36
3.1.4 Model Evaluation ......................................................................... 37
3.1.5 Performance Metrics .................................................................... 38
3.2 Results and Discussion ....................................................................... 40
3.2.1 Analysis of Evaluation Methodologies ........................................ 40
3.2.2 Metrics for HPI Prediction ........................................................... 43
3.3 Chapter Summary ............................................................................... 45
4 CaMELS: Calmodulin Interaction Learning System .......................... 48
4.1 Methods .............................................................................................. 49
4.1.1 Dataset and Preprocessing ............................................................ 49
4.1.1.1 CaM Interaction Site Dataset .................................................... 50
4.1.1.2 CaM Interaction Dataset ............................................................ 50
4.1.2 Machine Learning Models............................................................ 51
-
viii
4.1.2.1 MIL Based CaM-Interaction Site Prediction ............................. 51
4.1.2.3 Interaction Prediction ................................................................ 55
4.1.3 Feature Extraction ........................................................................ 56
4.1.3.1 Window Level Feature Representation ..................................... 56
4.1.3.2 Protein Level Feature Representation ....................................... 58
4.1.4 Performance Evaluation ............................................................... 59
4.1.4.1 Evaluation of Interaction Prediction.......................................... 59
4.1.4.2 Evaluation of Interaction Site Prediction .................................. 62
4.1.5 Model Selection ............................................................................ 63
4.1.6 Webserver ..................................................................................... 63
4.2 Results and Discussion ....................................................................... 64
4.2.1 Interaction Prediction ................................................................... 64
4.2.1.1 Improved CaM Interaction Prediction ....................................... 66
4.2.1.2 Motifs Search Fails to Predict CaM Interactions ...................... 68
4.2.1.3 Importance of the Whole Protein Sequence .............................. 68
4.2.1.4 GO Term Enrichment Analysis ................................................. 68
4.2.1.5 Performance Evaluation on Validation Set ............................... 69
4.2.1.6 In Silico Mutation Analysis ....................................................... 70
4.2.1.7 Validation Through Wet-Lab Experiments ............................... 70
4.2.1.8 Feature Analysis ........................................................................ 70
4.2.2 Interaction Site Prediction ............................................................ 71
4.2.2.1 Improved CaM Interaction Site Prediction ............................... 72
4.2.2.2 Motifs Search Fails to Predict CaM Interaction Site ................. 73
4.2.2.3 Performance Evaluation on Validation Set ............................... 74
4.2.2.4 Validation Through Wet-Lab Experiments ............................... 76
4.2.2.5 Contribution of Amino Acids and Motifs Identification ........... 76
4.2.2.6 MIL Using SSGO Method ........................................................ 78
-
ix
4.2.2.7 Analysis of Features in Interaction Site Prediction ................... 78
4.3 Chapter Summary ............................................................................... 78
5 ISLAND: In-Silico Protein Affinity Predictor ...................................... 80
5.1 Methods .............................................................................................. 81
5.1.1 Datasets and Preprocessing .......................................................... 82
5.1.2 Evaluation of the PPA-Pred2 Webserver ..................................... 82
5.1.3 Sequence Homology as Affinity Predictor ................................... 82
5.1.4 Proposed Methodology................................................................. 83
5.1.5 Sequence-Based Features ............................................................. 83
5.1.5.1 Explicit Features ........................................................................ 83
5.1.5.2 Kernel Representations.............................................................. 84
5.1.6 Complex Level Features Representation ...................................... 85
5.1.6.1 Feature Concatenation ............................................................... 86
5.1.6.2 Combining Kernels.................................................................... 86
5.1.7 Regression Models ....................................................................... 87
5.1.7.1 Ordinary Least-Squares Regression (OLSR) ............................ 87
5.1.7.2 Support Vector Regression (SVR) ............................................ 87
5.1.7.3 Random Forest Regression (RFR) ............................................ 88
5.1.8 Model Validation and Performance Assessment.......................... 88
5.1.9 Webserver ..................................................................................... 88
5.2 Results and Discussion ....................................................................... 89
5.2.1 Binding Affinity Prediction Through Sequence Homology ........ 89
5.2.2 Binding Affinity Prediction Through ISLAND ........................... 89
5.2.3 Comparison Using External Independent Test Dataset ................ 90
5.3 Chapter Summary ............................................................................... 91
6 Learning Protein Binding Affinity Using Privileged Information ...... 93
6.1 Methods .............................................................................................. 94
-
x
6.1.1 Datasets and Preprocessing .......................................................... 94
6.1.2 Proposed Approach ...................................................................... 95
6.1.2.1 Baseline Classifiers ................................................................... 96
6.1.2.2 LUPI-SVM ................................................................................ 97
6.1.3 Feature Representation ................................................................. 99
6.1.3.1 Sequence-Based Features ........................................................ 100
6.1.3.2 Structure-Based Features ......................................................... 100
6.1.4 Model Validation, Selection and Performance Assessment ....... 102
6.1.5 Webserver ................................................................................... 103
6.2 Results and Discussion ..................................................................... 104
6.2.1 Performance of Baseline Learners ............................................. 104
6.2.2 Performance of LUPI-SVM ....................................................... 105
6.2.3 Evaluation Through Validation Dataset ..................................... 107
6.2.4 Feature Analysis for Binding Affinity Prediction ...................... 108
6.2.5 Learned Models Using LUPI and Classical SVM...................... 109
6.3 Chapter Summary ............................................................................. 109
7 PAIRpred: A Webserver for Protein Interface Prediction ............... 111
7.1 Implementation................................................................................. 111
7.2 Usage ................................................................................................ 113
7.3 Results .............................................................................................. 114
7.4 Validation Through Wet-Lab Experiments ...................................... 115
8 Conclusions and Future Work ............................................................. 117
8.1 Conclusions ...................................................................................... 117
8.2 Future Work ..................................................................................... 119
8.2.1 Application of Learning Using Privileged Information ............. 120
8.2.2 Handling Data Sparsity in Protein Interaction Domain.............. 120
Appendix A: Predictions Through CaMELS ............................................ 122
-
xi
References ..................................................................................................... 128
-
xii
List of Figures
Figure 1.1 Central Dogma of Molecular Biology. Portion of DNA called
a gene is transcribed to RNA which is used as a template to
synthesize proteins during translation ................................... 1
Figure 2.1 The chemistry of an amino acid (left panel) and properties of
side chain (Right panel). Every amino acid has a carbon
atom, called an alpha carbon (Cα), bonded to a carboxylic
acid (–COOH) group, an amine (-NH2) group, a hydrogen
atom, and an R group (side chain) that is unique for every
amino acid. Physiochemical properties of amino acids are
determined by the nature of its side chain .............................. 8
Figure 2.2 Different levels of protein structure. Different amino acids
joined together in various combinations through covalent
bonds and form primary structure. Different sections of
primary structure fold together through backbone hydrogen
bonding and form alpha helix and beta sheets. Elements in
secondary structure again fold through side chain interactions
to from tertiary structure stabilized by ionic bonds, disulfide
bonds, hydrophobic interactions, and hydrogen bonding.
Protein quaternary structures are formed through interaction
or binding of two or more independent tertiary structures ..... 9
Figure 2.3 Protein Functions. Proteins perform their functions as
enzymes (Sucrase), antibodies (T-cell receptor), messenger
(Insulin), or structural component (Actin). The most
fundamental function that proteins perform and which
underpin all the other biochemical functions is their ability to
bind or interact with other proteins or macromolecule .......... 11
Figure 2.4 Protein Interaction. Two unbound proteins (Ligand and
Receptor) with complementarity in shape and charge
distribution interact with each other to form a protein 12
-
xiii
complex. Interface of the complex at 6Å distance threshold
is shown with sticks in magenta color ....................................
Figure 2.5 Types of protein interactions and complexes. Protein
Complexes are homomeric if one type of protein chains is
involved in interactions otherwise if various type of protein
chains are involved in complex formation then those
complexes are called heteromeric. Further protein complexes
are divided into stable or transient based on the duration of
interactions. Binding affinity is a measure of the strength of
interaction between the protein involved in a complex
formation. Binding affinity is measured in terms of
disassociation constant (𝐾𝑑) and binding affinity is high for
low 𝐾𝑑 values. Stable complexes have high and weak
transient have low binding affinity ........................................ 14
Figure 2.6 Experimental methods to determine protein interactions,
binding affinity, and interaction site or interface. .................. 17
Figure 2.7 Classical Computational methods to predict protein
interactions, binding affinity, and interaction site or interface
of a protein complex. .............................................................. 18
Figure 2.8 Classical Computational methods for protein interaction,
binding affinity and interface prediction. (a) Interolog search;
(b) Docking. ........................................................................... 19
Figure 2.9 A general framework for developing machine learning
models for PPIs, binding affinity and interface prediction. ... 22
Figure 2.10 Machine learning methods for protein interactions, binding
affinity, and interface or interaction site prediction. .............. 27
Figure 3.1 A general framework of machine learning models used to
predict the host-pathogen protein interactions (HPIs)............ 31
Figure 3.2 A comparison of two different cross-validation (CV)
schemes on a toy dataset. K-fold (shown in left panel) and
Leave One Pathogen Protein Out (LOPO) (shown in right
panel). In both evaluation protocols, number of folds is equal
to the number of pathogen proteins in toy dataset. In K-fold 33
-
xiv
CV folds are created randomly while in LOPO folds are
created with respect to pathogen proteins. Overlap of data
occurs using K-fold CV for both host and pathogen proteins
e.g., proteins 𝑝1and ℎ1 occur in both train and test sets in each
fold. Whereas, by using LOPO CV overlap vanishes with
respect to pathogen proteins ...................................................
Figure 3.3 Precision-recall curves obtained through K-fold and LOPO
cross-validation. (a-d) Human-HIV and (e-h) Human-
Adenovirus interaction datasets. Mean area under the curves
across folds along with standard deviation is shown in
parenthesis .............................................................................. 41
Figure 3.4 Receiver operating characteristic (ROC) curves obtained
through K-fold and LOPO cross-validation. (a-d) Human-
HIV and (e-h) Human-Adenovirus interaction datasets. Mean
area under the curves across folds along with standard
deviation is shown in parenthesis ........................................... 43
Figure 3.5 Radar plots of the area under th ROC curve (AUC-ROC )
using two different cross-validation schemes for all models . 45
Figure 3.6 Radar plots of the area under the precision-recall curves
(AUC-PR) using two different cross-validation schemes for
all models ............................................................................... 46
Figure 4.1 MIL Framework for CaM interaction site prediction. The
protein sequence 𝑝 is represented as a line while the
annotated CaM interaction site as a box. All overlapping
windows with the annotated interaction site in 𝑝 constitute
positive examples (𝐵𝑝) and the rest of the windows constitute
negative examples (𝑁𝑝). The score obtained from the trained
discriminant function 𝑓(𝑥) should be higher for at least one
positive example than the scores generated for all negative
examples in 𝑝 ......................................................................... 50
Figure 4.2 MIL training algorithm with SSGO for CaM interaction site
prediction ........................................................................................ 53
-
xv
Figure 4.3 The online user interface for CaMELS webserver. (a) This
webserver accepts FASTA file or plain sequence of a protein
for CaM interaction and interaction site prediction; (b)
Interaction prediction model; (c) Interaction site prediction
model ...................................................................................... 64
Figure 4.4 (a) Receiver Operating Characteristic (ROC); (b) Precision-
recall (PR) curves for CaM interaction prediction for all
models. The averaged area under the curve across folds is
shown in parenthesis .............................................................. 65
Figure 4.5 (a) Precision-recall (PR) curves showing a comparison of
CaMELS with MI-1 and iLoops. The averaged area under the
curve across folds is shown in parenthesis; (b) Violin plot
showing density distributions of scores for positive (CaM
interacting) and negative (non-interacting) proteins
generated through DFS and CaMELS. Dotted lines show
density quartiles ..................................................................... 66
Figure 4.6 (a) Receiver Operating Characteristic (ROC) curves; (b)
Precision-recall (PR) curves; (c) ROC0.1 curves; (d) RFPP
curves for CaM interaction site prediction across different
models. The averaged area under the curves across folds is
shown in parenthesis .............................................................. 71
Figure 4.7 Predicted interaction sites of complexes of proteins with
CaM in the validation dataset through CaMELS. Calmodulin
(CaM) (grey with light shade); CaM interaction protein (grey
with dark shade); The predicted central residue of the
interaction site (sphere); Residues of the CaM interacting
protein within 5Å of CaM (stick form). (a) PDB ID: 1NWD;
(b) PDB ID: 1SY9; (c) PDB ID: 2M0K; (d) PDB ID: 5DOW
(e) PDB ID: 1YRT ................................................................. 74
Figure 4.8 Interaction site prediction score through CaMELS for
proteins used in mutagenic studies. Location of the predicted
interaction site has been denoted with a red dot. (a) LCa; (b)
SGS3 of Nicotiana Benthamiana ........................................... 75
-
xvi
Figure 4.9 Learned weight vectors of classifiers during training
CaMELS. (a) Weights obtained during training using AAC
feature representation; (b) Heat map of the weights obtained
during training using PDC feature representation; (c) Top 50
motifs learned during training using PDGT feature
representation. Actual weight value learned for each feature
during training is shown in the numeric column .................... 77
Figure 5.1 A general framework for protein affinity prediction using
machine learning techniques .................................................. 81
Figure 5.2 Techniques adopted for generating sequence-based feature
representation of a protein complex for developing machine
learning based protein binding affinity prediction models .... 86
Figure 5.3 The online user interface for ISLAND webserver. A user can
submit pair of plain sequence of proteins of interest for
binding affinity prediction ...................................................... 89
Figure 5.4 Cumulative histogram of absolute error between actual and
predicted binding affinity values through ISLAND and PPA-
Pred2 on external independent validation dataset .................. 91
Figure 6.1 A framework to classify protein complexes based on their
binding affinities through the paradigm of learning using
privileged information (LUPI). Privileged information (3D
structural information) is only required at training time (left
panel) to help better performance at test time (right panel)
using sequence information) alone ......................................... 94
Figure 6.2 Learning using privileged information with stochastic sub-
gradient optimization training ................................................ 99
Figure 6.3 Number of interacting residue pairs (NIRP) in the interface
of a protein complex. The frequency of non-repeating pairs
(considering A: B and B: A same) was computed from the
bound 3D structures of ligand (L) and receptor (R) of a
protein complex. Residues (shown as spheres) at a distance
cutoff of 8 angstroms (Å) are considered the interface of the 101
-
xvii
complex. The bottom panel of the figure shows the form of
feature vector extracted through this scheme .........................
Figure 6.4 The online user interface for LUPI-SVM webserver. (a) A
user can submit pair of plain sequences of proteins of interest
for binding affinity prediction; (b) An elucidation of
predicted score ....................................................................... 103
Figure 6.5 (a) ROC and (b) PR curves showing a performance
comparison between LUPI-SVM (with 2-mer as input and
Moal Descriptors as privileged feature space) and the
baseline classifiers (XGBoost, classical SVM (SVM), and
Random Forest (RF) with 2-mer features) on the affinity
benchmark dataset. The average area under the ROC and PR
curve (AUC) is shown in parenthesis ..................................... 104
Figure 6.6 Feature analysis using SHAP. The impact of 2-mer features
on model output is shown using SHAP values. The plot
shows the top 20 2-mers for the Ligand (L) or Receptor (R)
by the sum of their SHAP values over all samples. Feature
value is shown in color (Red: High; Blue: Low) reveals for
example that a high value of L (EK) (Counts of ‘EK’ in a
protein sequence designated as a ligand) contributes more for
predicting low binding affinity complexes ............................ 108
Figure 6.7 Weight vectors of the trained classifiers for the ligand
Blosum features. (a) SVM with LUPI framework using
Blosum (Protein) as input and Moal Descriptors as privileged
feature space; (b) Classical SVM using Blosum (Protein)
features ................................................................................... 109
Figure 7.1 Flowchart of the PAIRpred webserver. PAIRpred takes a pair
of proteins in PDB or FASTA format. Upon successful
format validation, PAIRpred performs chain selection and
feature extraction from the given sequences or structures.
Extracted features are used to generate predictions from a
pre-trained SVM classifier. These prediction results are
available for download and as an email attachment. .............. 112
-
xviii
Figure 7.2 Web interface of the PAIRpred webserver. (a) Home page
with user input and files upload options; (b) Chain selection;
(c) job submission notification and view results options ....... 113
Figure 7.3 Figure 7.3. Input pdb files with modified B factor. B factors
of the Ligand pdb file are replaced with 'Ligand scores' and
of the receptor pdb file with 'Receptor scores' ....................... 114
Figure 8.1 Machine learning techniques to handle sparsity in labeled
training data. SMEs: Subject Matter Experts; LUPI: Learning
Using Privileged Information; MIL: Multiple Instance
Learning; GANs, Generative Adversarial Networks ............. 120
-
xix
List of Tables
Table 3.1 Proposed biologist centric metrics to assess the generalization
performance of HPIs predictors over LOPO cross-validation
across all models ................................................................... 47
Table 4.1 Results showing the performance of CaMELS in comparison
to DFS for CaM Interaction prediction for all models .......... 67
Table 4.2 Results showing performance of CaMELS and MI-1 via Gene
Ontology term enrichment analysis .......................................... 69
Table 4.3 Results showing the performance of CaMELS for CaM
interaction site prediction in comparison to SVM (baseline),
mi-SVM and MI-1 across all models. AUCPR was unavailable
for SVM (baseline), mi-SVM and MI-1 ................................... 73
Table 5.1 Evaluation of PPA-Pred2 through its webserver on affinity
benchmark dataset 2.0 ............................................................... 90
Table 6.1 Protein complex classification results obtained using classical
SVM, Random Forest and XGBoost using input and privileged
features with LOCO cross-validation over the affinity
benchmark dataset ..................................................................... 105
Table 6.2 Protein complex classification results obtained through
classical SVM and LUPI across different features using LOCO
cross-validation over the affinity benchmark dataset ................ 106
Table 6.3 Comparison of classical SVM and LUPI-SVM on the external
independent validation dataset with training on affinity
benchmark dataset ..................................................................... 107
Table A1 Top 241 predicted CaM binding proteins from the proteome of
A. thaliana through CaMELS along with their predicted
interaction sites .......................................................................... 122
Table A2 241 CaM binders from interaction dataset along with predicted
binding sites through CaMELS ................................................. 124
-
xx
Table A3 List of 250 proteins used as negative set in the independent
validation dataset ....................................................................... 125
-
xxi
Acknowledgments
After humbly thanking Allah Almighty, I want to express my gratitude to those who
helped me to conduct this research work and enable me to complete my PhD: First, I
am very thankful to my adviser, Dr. Fayyaz Ul Amir Afsar Minhas for his time,
devotions, encouragements, motivations, guidance, support, and valuable discussions
that helped me to define and execute my thesis research. During the four years of my
Ph.D., we spent a significant amount of time during meetings, discussions,
brainstorming and presentation sessions. He always tried to deliver every bit of
knowledge to me and I always found myself much more relaxed, focused and motivated
after every meeting with him. I wish to continue him as mentor and guide in my rest of
the life.
I am also grateful to the members of my research committee: Dr. Sikander Majid
Mirza and Dr. Asifullah Khan for their feedback on the research proposal. I also want
to acknowledge the support and guidelines of our collaborators Prof. Dr. Asa Ben-Hur,
Colorado State University, USA and Dr. Imran Amin, Principal Scientist, NIBGE. I am
also thankful to other faculty members of the Department of Computer and Information
Sciences, PIEAS: Dr. Abdul Jalil, Dr. Mutawarra Hussain, Dr. Anila Usman, Dr. Abid
Mughal, Dr. Javaid Khurshid, Dr. Naeem Akhtar, Dr. Shahzad Ahmad Qureshi for their
support and specially to Dr. Muhammad Hanif Durad for providing me seating place in
his lab. I also want to extend my sincere gratitude to my lab fellows: Amina Asif, Sadaf
Khan, Adiba Yaseen, Kanza Hamid, Abdul Hanan Basit, Bismillah Jan, Fahad ul
Hassan, Asif Khan, Muhammad Dawood and Hira Kamal for their support and help. I
would never forget those wonderful hikes and parties which I had with them during my
Ph.D. studies. I am also indebted to my other friends at PIEAS: Muhammad Imran,
Naveed Akhtar, Mohsin Sittar, Naveed Chohan, Noorul Wahab, Faheem Afsar,
Muhammad Bashir, and Mirqad Ayaz for their cooperation and moral support. If I
missed your name on this list and you think it belongs here, I apologize.
I would also like to express my gratitude towards my family, especially my
parents (Arshad Habib Abbasi and Safia Shaheen), Uncles (Arif Habib, Sardar Imtiaz
-
xxii
Abbasi, Tariq Habib, M. Shafiq Abbasi and Abid), Aunties (Razia Shaheen, Shaheen,
Balqees and Razia Sultan), brothers (Amjid, Badar Munir, Nayyer, Waseem, Asad Ali,
Zohaib, Umer, Rizwan, Shujahat Ali and Abdullah) and sisters (Fozia Arshad, Qudsia,
Fozia Aziz, Kiren, Uzma, Rubi, Faiza, Maryam and Nida) for their care, love and good
wishes. I would greatly acknowledge the support, encouragement, care, love, and
patience of my beloved wife Dr. Saiqa Andleeb. Without her support, it would be
impossible for me to complete my doctorate. Also, thanks to my daughter Zarnish
Habib, niece and nephews (Ahmed, Ahsan, Ayan Mansoor, Mahrosh Habib, Zain, Esa,
Aryan Habib, Hashim Habib, and Mohid Habib) all of you are the reason for me to keep
going.
I must also extend my thanks to Muhammad Sadique Awan, Shabir Ahmed
Abbasi and Imran Abbasi at the University of Azad Jammu and Kashmir for their
support in official matters.
Lastly, I would like to acknowledge the Higher Education Commission (HEC)
of Pakistan for funding my Ph.D. studies via a grant (PIN: 213-58990-2PS2-046) under
indigenous 5000 Ph.D. fellowship scheme. I am also thankful for providing me funds
under the International Research Support Initiative Program (IRSIP) to pursue my
Ph.D. research work at the Colorado State University (CSU), USA. My primary reason
to thank this scholarship is the fact that it provided me with an opportunity to expand
my horizons and knowledge.
Wajid Arshad Abbasi
-
xxiii
Abstract
Protein interactions are crucial in the cell for performing cellular functions and the study
of protein interactions is a very important domain of research in bioinformatics. In
reference to protein interactions, biologists are usually interested in three core
problems: determining pairwise protein interactions, determination of binding affinity,
and identification of the interface. Computational methods to solve these protein
interaction problems have emerged as an active research area due to tedious, costly, and
time-consuming experimental procedures. Our aim in this work is to develop novel
machine learning based methods for protein interaction, binding affinity and interaction
prediction with improved generalization performance.
In this dissertation, we have developed host-pathogen protein interaction predictors
using machine learning. One of our findings is that existing methods for protein
interaction prediction that use K-fold cross-validation for performance assessment
report over-estimated accuracy values as K-fold cross-validation does not take pairwise
protein similarity between training and test examples into account. To control this data
redundancy at pathogen protein level, we have proposed and advocated the use of an
alternate evaluation scheme called Leave One Pathogen Protein Out (LOPO) cross-
validation along with some biologist centric metrics for designing protein-protein
interaction prediction methods.
We have also designed a novel machine learning model called CaMELS (CalModulin
intEraction Learning System) for interaction and interaction site prediction of
Calmodulin (CaM) which is a very important and highly conserved protein across all
eukaryotes. CaMELS relies on a novel implementation of multiple instance learning
solver for protein binding site prediction that leads to significant improvement in
predictive performance. One of our collaborators has confirmed the effectiveness of
CaMELS through wet-lab experiments as well.
We have also focused on the more generic problem of predicting binding affinity in
protein interactions and presented various sequence-based machine learning models.
-
xxiv
For this purpose, we have developed a novel machine learning method which is based
on the framework of Learning Using Privileged Information (LUPI). Our state-of-the-
art method uses protein 3D structure as privileged information at training time while
expecting only protein sequence information during testing. This makes our machine
learning method flexible by allowing it to leverage protein structure information during
training while requiring only protein sequence information during testing.
We have also developed a webserver for an existing state-of-the-art protein-protein
interface prediction method called PAIRPred. The accuracy of this webserver has also
been validated by our collaborators through wet-lab experiments as well.
-
xxv
List of Publications and Patents
Journal Publications
Wajid Arshad Abbasi, Amina Asif, Asa Ben-Hur and Fayyaz ul Amir Afsar
Minhas, “Learning Protein Binding Affinity using Privileged Information”,
BMC Bioinformatics, vol. 19, 425, 2018.
Abdul Hanan Basit, Wajid Arshad Abbasi, Amina Asif and Fayyaz ul Amir
Afsar Minhas, “Training Large Margin Host-Pathogen Protein-Protein
Interaction Predictors”, Journal of Bioinformatics and Computational Biology,
vol. 16, 18500142, 2018.
Wajid Arshad Abbasi, Amina Asif, Saiqa Andleeb and Fayyaz ul Amir Afsar
Minhas, “CaMELS: In silico prediction of calmodulin binding proteins and their
binding sites”, Proteins: Structure, Function and Bioinformatics, vol. 85 (9), pp.
1724–1740, 2017.
Wajid Arshad Abbasi and Fayyaz ul Amir Afsar Minhas, “Issues in
performance evaluation for host–pathogen protein interaction prediction”,
Journal of Bioinformatics and Computational Biology, vol. 14 (3), 1650011,
2016.
Conference Publications
Adiba Yaseen, Wajid Arshad Abbasi and Fayyaz ul Amir Afsar Minhas,
“Protein binding affinity prediction using support vector regression and
interfecial features”, 15th International Bhurban Conference on Applied
Sciences and Technology (IBCAST), IEEE, 2018, pp. 194-198.
Kanza Hamid, Amina Asif, Wajid Arshad Abbasi, Durre Sabih and Fayyaz ul
Amir Afsar Minhas, “Machine Learning with Abstention for Automated Liver
Disease Diagnosis”, in Proceedings of the 15th International Conference on
Frontiers of Information Technology, IEEE, 2017, pp. 356-361.
Preprints
Wajid Arshad Abbasi, Fahad Ul Hassan, Adiba Yaseen, Fayyaz Ul Amir Afsar Minhas. “ISLAND: In-Silico Prediction of Proteins Binding Affinity
Using Sequence Descriptors”, arXiv:1711.0540.
Amina Asif, Wajid Arshad Abbasi, Farzeen Munir, Asa Ben-Hur, and Fayyaz ul Amir Afsar Minhas. “pyLEMMINGS: Large Margin Multiple
Instance Classification and Ranking for Bioinformatics Applications”,
arXiv:1711.04913.
-
xxvi
List of Abbreviations and Symbols
𝑷𝒓 Pearson Correlation Coefficient
3-D Three Dimensional
AAC Amino Acid Composition
AUC Area under the ROC Curve
AUC-PR Areas Under the Precision-Recall Curve
AUC-ROC Areas Under the ROC curve
BIP-BIANA Biologic Interactions and Network Analysis
BLOSUM Blocks Substitution Matrix
CaM Calmodulin
CaMELS Calmodulin Interaction Learning System
CV Cross-Validation
DFS Discriminant Function Scoring
DNA Deoxyribonucleic Acid
FHR False Hit Rate
FP Fluorescence Polarization
GO Gene Ontology
HIV Human Immunodeficiency Virus
HPIs Host-Pathogen Interactions
HTS High-Throughput Sequencing
IR Insulin Receptor Tyrosine Kinase
ISLAND In-Silico Protein Affinity Predictor
ITC Isothermal Titration Calorimetry
JBCB Journal of Bioinformatics and Computational Biology
LOCO Leave One Complex Out
LOPO Leave One Pathogen Protein Out
LUPI Learning Using Privileged Information
MDS Molecular Dynamic Simulation
MIL Multiple Instance Learning
MRFPP Median Rank of the First Positive Prediction
-
xxvii
NCBI National Center for Biotechnology Information
NIRP Number of Interacting Residue Pairs
NMR Nuclear Magnetic Resonance
OLSR Ordinary Least-Squares Regression
PAIRpred Partner Aware Interacting Residue Predictor
PD-Blosum Position Dependent BLOSUM-62
PDC Position Dependent Composition
PDGT Position Dependent Gappy Triplet
PHISTO Pathogen-Host Interaction Search Tool
PPIs Protein-Protein Interactions
PR Area Under the Precision-Recall Curve
PseAAC Pseudo-Amino Acid Compositions
PSFMs Position Specific Frequency Matrices
PSSMs Position Specific Scoring Matrices
RB Retinoblastoma
RF Random Forest
RFPP Rank of the First Positive Prediction
RFR Random Forest Regression
RMSE Root Mean Squared Error
RNA Ribonucleic Acid
ROC Receiver Operating Characteristic
SPR Surface Plasmon Resonance
SSGO Stochastic Sub-Gradient Optimization
SVM Support Vector Machines
SVR Support Vector Regression
TAP Tandem Affinity Purification
THR True Hit Rate
UniProt Universal Protein Knowledgebase
WHO World Health Organization
XGBoost Extreme Gradient Boosting
Y2H Yeast Two-Hybrid
-
1
1 Introduction
In order to better understand the complexity of life and mechanism of biological systems,
we need to analyze dynamic interactions of these biological systems at the molecular level
[1]. In all living organisms, the cell is the fundamental unit of life and it is composed of
different molecules which perform all life-sustaining functions [2]. There are three
molecules in a cell that are primarily responsible for sustaining life: deoxyribonucleic acid
(DNA), ribonucleic acid (RNA), and proteins. These three molecules function under the
principle of Central Dogma of Molecular Biology [3]. This dogma operates in two steps:
first, a portion of DNA called a gene is copied to form a messenger RNA (transcription)
and then the messenger RNA is used as a template to synthesize proteins (translation) as
shown in Fig. 1.1. Proteins are the key molecules which perform almost all the biologically
significant functions at cellular level.
Figure 1.1. Central Dogma of Molecular Biology. Portion of DNA called a
gene is transcribed to RNA which is used as a template to synthesize proteins
during translation.
-
Introduction
2
Functionally, proteins are the dominant player and second most abundant
biomolecule present in the cell after water. The importance of proteins in our body can be
appreciated by considering the fact that 50% of the dry weight of the human body is protein
[4]. Proteins perform their functions in different forms such as pepsin helps in digestion as
an enzyme, insulin controls blood sugar as a hormone, calmodulin (CaM) affects
intracellular signaling, hemoglobin transports oxygen, histones play a role in gene
regulation, antibodies combat infectious diseases and many more [5]. Considering this
huge functional diversity of proteins, it is important to decipher their working mechanism
in order to completely understand cellular behavior.
Up to the 1970s, it was widely believed that a single protein performs a single
function under a dogma called ‘one gene/one enzyme/one function’ and protein
interactions were considered as purification artifacts [6], [7]. The idea of single protein-
single function was ultimately shown to be incorrect with the discovery of the involvement
of multiple proteins in DNA replication (e.g., DNA helicase, DNA primase) besides the
polymerase and the participation of more than 20 proteins in the protein import into
mitochondria [7]–[9]. Now, it is an established fact that proteins do not function in isolation
and more than 80% of all cellular proteins perform their biological functions by forming
complexes through protein-protein interactions (PPIs) [10], [11]. Proteins achieve their
functional diversity through interactions and such interactions mediate overall organismal
systems including metabolic pathways and cell-to-cell interactions [12]. Because of their
dominant role in biological processes, protein interactions are normally responsible for
healthy or diseased states in an organism. For example, the retinoblastoma (RB) protein is
a tumor suppressor which prevents abnormal cell division by binding to the E2F
transcription factor. When this interaction gets perturbed due to the absence or mutation of
RB protein, E2F will be freely available for unrestrained cell division and formation of the
tumor [13]. Therefore, understanding the protein-protein interactions is crucial to know the
basic cellular biology, functions of a previously uncharacterized protein, and the disease
mechanisms. Moreover, knowledge about protein interactions is also important in
therapeutics to develop effective and personalized drugs with fewer side effects because
more than 80% of current therapeutic targets are proteins [5].
-
Introduction
3
Protein-protein interactions (PPIs) are the noncovalent physical connections
established between amino acids in the 3-D structures of two or more proteins at the
specific locations called binding/interaction sites. Physiochemically complementary
protein interactions are normally steered by the hydrogen bonding, electrostatic forces, and
hydrophobic effects [14]. Biologists are normally interested in solving the following three
main challenging problems related to PPIs.
a) Pairwise Protein Interactions: Whether two given proteins interact or not?
b) Binding Affinity: Strength of the interaction.
c) Interface or Interaction Site: Exact location of the interaction.
In this work, we have developed machine learning based computational methods
which would assist biologists in the wet lab for solving the aforementioned protein
interaction related challenges.
1.1 Motivations
The field of biology experienced two important conceptual shifts in the 20th century with
the discovery of Mendel’s laws and restriction enzymes [15], [16]. In the meanwhile,
complete sequencing of the human genome in 2001 [17] and emerging high-throughput
sequencing (HTS) technologies empowered biologists to think of complex biological
questions in terms of molecules. Raw data of DNA and protein sequences is growing at an
exponential rate in different databases such as GenBank [18] and Universal Protein
Knowledgebase (UniProt) [19]. The major challenge now is to analyze this large amount
of data as most of the genes and proteins sequenced are of unknow functions and un-
characterized [20]. The task of analyzing data in proteomics is further complicated because
of the involvement of complex protein interactions. This problem creates a wide scope for
researchers in the field of computer science and bioinformatics to analyze data in-silico
and assist biologists in solving interesting biological problems.
Moreover, in the study of protein interactions called interactomics, experimental
methods are often laborious, time-consuming and expensive, making it difficult to
investigate all possible protein interactions within and across organisms. For instance, the
bacterium Bacillus anthracis has 5,508 protein-encoding genes [21], which when paired
-
Introduction
4
with the 20,000 or so human genes [22], gives more than 100 million possible protein
interaction pairs to validate experimentally. It is not practical to verify all possible
interaction pairs through wet-lab experiments. Therefore, there is an extreme need for
computational approaches to support wet-lab methods by predicting and ranking probable
PPIs. Such computational approaches can assist biologists in focusing on the most likely
interactions.
1.2 Problem Statement and Research Aims
Among computational approaches, application of machine learning techniques to
bioinformatics for the prediction of PPIs is a well-accepted idea. In machine learning based
predictors of PPIs, models are normally built by using sequence and structure information
of protein interactions which have been discovered through experimental methods.
Unavailability of the structural information of most of novel proteins limits the practical
use these predictors and therefore sequence information is the only practical choice.
Prediction of PPIs through machine learning using sequence data only is a challenging
problem because proteins in real interact in a specific three-dimensional conformation.
Moreover, in proteins, involvements of significant conformational changes, motion and
flexibility, alternate binding modes, the dependence of binding propensity on the binding
partner, and uncertainties in the annotation of available scientific data make the problem
of the prediction of PPIs hard. In every machine learning setting, it is vital to thoroughly
understand the nature of the problem, availability and amount of training data and the
prospective use of the system while designing the predictive model, its evaluation
methodology, and the performance metrics. However, this requirement is even more
crucial in bioinformatics in comparison to other application areas because of its role as a
tool for biological discovery. Also, with the growth in proteome data of different
organisms, there is a pressing need for PPIs predictors which can incorporate more and
more information from different sources to gain better generalization. To achieve these
goals, we have formulated and accomplished the following research aims in this study.
We have performed a survey of existing machine learning based computational
techniques for protein-protein interaction prediction in order to assess the suitability
-
Introduction
5
and limitations of existing evaluation protocols and performance metrics for this
purpose.
We have designed sequence-based machine learning models of PPIs for interaction,
interaction site, and affinity prediction with improved generalization accuracy by
incorporating learning data at the proteome level and combining information of
interactions and interaction sites.
Generally, computational techniques exploit protein sequence and 3-D structural
information to develop predictive models for protein interactions. One of the major
issues with techniques using protein 3-D structural information which limits their
applicability is the unavailability of solved 3-D structures of novel proteins. Our
aim in this study is also to design such a machine learning based model for PPIs
prediction which can use both protein structural and sequence information during
training but it only requires sequence information for testing.
1.3 Dissertation Organization and Chapters’ Digest
This dissertation is divided into the following chapters.
In chapter 2 “Problem Formulation and Literature Survey”, we give the
required background of proteins, their structure, functions, and interactions along with
experimental procedures of determining these interactions. We also perform a literature
survey of existing computational techniques of predicting protein-protein interactions
(PPIs), interfaces/interaction sites, and protein binding affinity along with the formulation
of these problems as machine learning problems.
In chapter 3 “Issues in Host-Pathogen Protein Interaction Prediction”, we have
performed a survey of existing machine learning based host-pathogen protein interactions
(HPIs) prediction techniques. The objective of the survey was to assess the suitability and
limitations of existing evaluation protocols and performance metrics to design predictive
HPI models. In this chapter, we have investigated the usefulness of K-fold cross-validation
for evaluating the generalization performance of pairwise protein interaction predictors in
host-pathogen interactions (HPIs). K-fold cross-validation does not avoid redundancy
between the train and test data and results in an inflated accuracy. To control this data
-
Introduction
6
redundancy at pathogen protein level, we have proposed and shown the effectiveness of a
new evaluation scheme called Leave One Pathogen Protein Out (LOPO) cross-validation.
We have also proposed and suggested the use of some biologist-centric metrics for HPIs
predictors. Our findings of this study have been published in the Journal of Bioinformatics
and Computational Biology (JBCB), 14(3), 2016, 1650011.
In chapter 4 “CaMELS: Calmodulin Interaction Learning System”, we present
a machine learning based algorithm suite called CaMELS (CalModulin intEraction
Learning System) for CaM interaction and interaction site prediction using sequence
information alone. CaMELS models CaM interaction and interaction site prediction as two
separate classification problems and gives state-of-the-art accuracy for both tasks. To
predict CaM interaction, CaMELS uses traditional support vector machine (SVM) along
with features extracted from the whole sequence of a protein instead of localized window
level features. Whereas, for solving CaM interaction site prediction problem CaMELS used
Multiple Instance Learning (MIL) paradigm to handle imprecisions in binding site
annotations in the training data. To solve the multiple instance machine learning model,
CaMELS used a custom-built algorithm based on stochastic sub-gradient optimization
(SSGO) that allows more fast and effective learning. We have shown improved
generalization performance of CaMELS using a variety of evaluation techniques including
wet-lab experiments. Python code for training and evaluating CaMELS together with a
webserver implementation is available at the URL:
http://faculty.pieas.edu.pk/fayyaz/software.html#camels. We have published the outcomes
of this work in Proteins: Structure, Function, and Bioinformatics, 85(9), 2017, 1724–1740.
In chapter 5 “ISLAND: In-Silico Protein Affinity Predictor”, sequence-based
protein binding affinity prediction methods using machine learning have been explored.
Specifically, we present our findings that the true generalization performance of even the
state-of-the-art sequence-only predictor is far from satisfactory and that the development
of machine learning methods for binding affinity prediction with improved generalization
performance is still an open problem. We have also proposed a sequence-based novel
protein binding affinity predictor called ISLAND which gives better accuracy than existing
methods over the same validation set as well as on external independent test dataset. A
http://faculty.pieas.edu.pk/fayyaz/software.html#camels
-
Introduction
7
cloud-based webserver implementation of ISLAND and its python code are available at
http://faculty.pieas.edu.pk/fayyaz/software.html#island.
In chapter 6 “Learning Using Privileged Information for Protein Binding
Affinity Prediction”, we have developed a novel machine learning method for predicting
binding affinity which is based on the framework of learning using privileged information
(LUPI). This method uses protein 3D structure as privileged information at training time
while expecting only protein sequence information during testing. The proposed method
outperforms several baseline learners and a state-of-the-art binding affinity predictor not
only in cross-validation but also on an additional validation dataset. This demonstrates the
utility of the implemented LUPI framework developed for this work in other areas of
bioinformatics as well. A Python implementation of the proposed method together with a
webserver is available at http://faculty.pieas.edu.pk/fayyaz/software.html#LUPI. The
outcomes of this work have been accepted for publication in the BMC Bioinformatics
journal.
In chapter 7 “PAIRpred: A Webserver for Partner-Aware Protein Interface
Prediction”, a web server has been developed and deployed for PAIRpred which is a state-
of-the-art method for predicting partner-specific interface of a protein complex using either
sequence information alone or in conjunction with features derived from the unbound
structures of the two proteins in the complex. The web server is available at
http://faculty.pieas.edu.pk/fayyaz/software.html#pairpred. This webserver takes a pair of
proteins in fasta or pdb format and produces downloadable predictions along with
highlighted predicted interface in its output PDB files.
In chapter 8, “Conclusions and Future Work”, we have summarized the
conclusions drawn from this study along with the details of the projects to be completed in
future.
http://faculty.pieas.edu.pk/fayyaz/software.html#islandhttp://faculty.pieas.edu.pk/fayyaz/software.html#LUPIhttp://faculty.pieas.edu.pk/fayyaz/software.html#pairpred
-
8
2 Problem Formulation and Literature Survey
In this chapter, we start with a brief introduction of proteins and their characteristics to
assist the reader with relevant biological background. Then, we discuss interesting
biological problems in protein interactions, along with experimental and computational
methods of solving these problems. Further, we formulate biological questions in
protein-protein interaction as machine learning problems and perform a literature
survey of known machine learning techniques in this domain to highlight important
research questions.
2.1 Proteins
Proteins are the second most abundant macromolecules present in a cell [5]. Proteins
are made up of smaller units called amino acids. There are twenty naturally accruing
amino acids which are considered as the raw material of all proteins. An amino acid is
an organic compound containing amine (-NH2) and carboxyl (-COOH) group together
with a side chain functional group as shown in Fig. 2.1 [14]. Every amino acid has a
unique functional group attached to it. Different amino acids have different
physiochemical properties (see Fig. 2.1) and are encoded by codons in the
Deoxyribonucleic acid (DNA) through a linear relationship [14]. These 20 amino acids
are linked with each other through peptide bonds in various combinations to form
Figure 2.1. The chemistry of an amino acid (left panel) and properties of side
chain (Right panel). Every amino acid has a carbon atom, called an alpha carbon
(Cα), bonded to a carboxylic acid (–COOH) group, an amine (-NH2) group, a
hydrogen atom, and an R group (side chain) that is unique for every amino acid.
Physiochemical properties of amino acids are determined by the nature of its side
chain.
-
Problem Formulation and Literature Survey
9
proteins with diverse structures and functions. Proteins vary in length from a hundred
to thousands of amino acids.
2.1.1 Protein Structures
Proteins have four level of their structures: primary, secondary, tertiary, and quaternary.
These different levels of protein structures are shown in Fig. 2.2. Proteins are also called
polypeptides where different amino acids are joined together in various combinations
through peptide bonds and form a linear string called the primary structure of the
protein. These peptide bonds are formed between the amino and carboxylic groups of
two amino acids by producing one water molecule [14]. Primary structure of the protein
is also called amino acids sequence and normally available in FASTA format. Some
parts of protein sequences have a biological significant pattern called a motif. For
example, IQ Calmodulin-binding motif has the following sequence pattern:
Figure 2.2. Different levels of protein structure. Different amino acids joined
together in various combinations through covalent bonds and form primary structure.
Different sections of primary structure fold together through backbone hydrogen
bonding and form alpha helix and beta sheets. Elements in secondary structure again
fold through side chain interactions to from tertiary structure stabilized by ionic
bonds, disulfide bonds, hydrophobic interactions, and hydrogen bonding. Protein
quaternary structures are formed through interaction or binding of two or more
independent tertiary structures.
-
Problem Formulation and Literature Survey
10
[FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY], where x represents any amino acid and
square brackets represent alternative amino acids.
Bonds which link the amino group to the alpha carbon and the alpha carbon
atom to the carbonyl carbon are free to rotate and allows various orientations of amino
acids in the polypeptide chain. Rotations of these bonds are represented in 𝜙 and 𝜓
torsion angles and all possible allowed values of these angles are shown in the
Ramachandran plot [23]. These allowed rotations of peptide bonds let polypeptide to
fold into secondary structures such as alpha helix and beta sheets (see Fig. 2.2). These
secondary structures are formed and stabilized by the hydrogen bonding between the
backbone atoms of the residues. Alpha helix is generated by the hydrogen bonding of
neighbor residues while beta sheets are formed through hydrogen bonding of distant
residues in the sequence [14].
Different secondary structures joined together and fold into a protein native 3-
D tertiary structure as shown in Fig. 2.2. In protein tertiary structure, various interactive
forces between atoms of side chains of residues in a polypeptide play an important role
[14]. These interactive forces include hydrogen bonding, ionic bonds, disulfide bonds,
and hydrophobic interactions. Amino acids sequence of a stable protein contains
enough information to fold into native tertiary structure [24]. This 3-D structure of a
protein determines its functions [14]. Protein 3-D structures are available in the form
of coordinates of atoms of all residues in the PDB format.
Protein quaternary structures are formed when various tertiary structures joined
together (see Fig. 2.2). Protein quaternary structures result from protein interactions
and are also called protein complexes. In the formation of these quaternary structures,
similar interactive forces are involved as involved in tertiary structure formation.
2.1.2 Protein Functions
Proteins are involved in all the important biological processes and perform almost all
tasks at the cellular level. Proteins are diverse in their functions and are responsible for
cell shape, product manufacture, routine maintenance, waste cleanup, and inner
organization. Proteins perform their roles as enzymes, antibodies, structural
component, or messenger as shown in Fig. 2.3 [25]–[28]. There are thousands of
chemical reactions involved in different metabolic pathways within a cell. These
-
Problem Formulation and Literature Survey
11
chemical reactions are catalyzed to proceed millions of times faster by proteins called
enzymes [25]. For example, sucrase catalyzes the hydrolysis of sucrose. Antibodies are
proteins which are used by the immune system of an organism to identify and neutralize
foreign invaders such as viruses or pathogenic bacteria, e.g., T-cell receptors are
proteins that act as antibodies. Antibodies perform their function through interaction
with an antigen present on the surface of the invading organism [26]. Proteins such as
Actin also provide structural support to the cell and enable it to dynamically remodel
itself in response to internal or environmental stimuli [27]. Cells also communicate to
coordinate and perform basic activities such as tissue repair and immunity. Insulin is a
protein which helps in glucose and lipid metabolism by activating a cascade of cellular
processes through interaction with insulin receptor tyrosine kinase (IR) [29]. Growth
hormones are proteins which stimulate tissue repair through cell regeneration in human
[28].
Most protein functions involve the interaction of two or more proteins to form
a protein complex [14]. For example, enzymes must bind to their substrates to perform
catalysis and structural proteins bind together in order to gain strength and toughness
[14]. Similarly, antibodies and messenger proteins also perform their functions by
interacting with other proteins. Therefore, the study of protein interactions is of utmost
importance in biology to decipher functions of proteins, to characterize different
biological processes or pathways, to interpret disease mechanisms and design effective
Figure 2.3. Protein functions. Proteins perform their functions as enzymes
(Sucrase), antibodies (T-cell receptor), messenger (Insulin), or structural component
(Actin). The most fundamental function that proteins perform and which underpin all
the other biochemical functions is their ability to bind or interact with other proteins
or macromolecule.
-
Problem Formulation and Literature Survey
12
drugs. Proteins interact with other proteins and macromolecules such as DNA and RNA
but in this dissertation, we specifically focus on protein-protein interactions (PPIs)
studies. In the next sections of this chapter, we discuss how proteins interact with other
proteins, biological significant problems in protein interactions, and how computational
techniques can contribute to handling these problems.
2.2 Protein Interactions and Complex Formation
Proteins generally do not function in isolation and interact with each other to perform
a vital role in various biological processes and metabolic pathways [11], [14]. More
than 80% of all the cellular proteins are involved in these type of interactions [10], [11].
Protein-protein interactions (PPIs) are physical connections between residues of two
proteins (Ligand and Receptor) in a highly specific manner (see Fig. 2.4). These
interactions happen in a specific biomolecular context and are normally piloted by a
chain of the same electrostatic forces and hydrophobic effects as involved in protein
folding [14], [30]. Complementarity in shape and charge distribution on the surface of
proteins are the two major factors which play a significant role in protein interactions
[14]. During an interaction, proteins can also go through conformational changes
augmented by conformational selections model [14], [31]. These conformational
changes enable optimized interaction and support formation of a stable complex.
Protein complexes which are formed through protein-protein interactions as
shown in Fig. 2.4. These protein interactions constitute the interactome of an organism.
Protein interactions can happen within the same organism (intra-species) and across
different organisms (inter-species) such as host-pathogen protein interactions (HPIs).
Studies in protein interactions with different perspectives such as molecular dynamics,
Figure 2.4. Protein interaction. Two unbound proteins (Ligand and Receptor) with
complementarity in shape and charge distribution interact with each other to form a
protein complex. The interface of the complex at 6Å distance threshold is shown with
sticks in magenta color.
-
Problem Formulation and Literature Survey
13
biochemistry, and signal transduction create protein interaction networks. These protein
interaction networks, like metabolic pathways, help biologists to gain a better
understanding of underlying biological processes, to understand disease mechanisms
and to aid studies for the design, discovery, and effectiveness of therapeutic drugs [32].
2.2.1 Binding Affinity of Interacting Proteins
Binding affinity is a measure of the strength of interaction between proteins which bind
reversibly in a protein complex [5], [7]. High binding affinity indicates tighter binding
between proteins involved in an interaction. Experimentally, it can be measured in
terms of the dissociation constant (𝐾𝑑 =[𝐿][𝑅]
[𝐿𝑅]⁄ ), which is a ratio between the
concentration of free ligand and receptor proteins ([𝐿], [𝑅]) and the concentration of
protein complex ([𝐿𝑅]) [7]. Smaller values of 𝐾𝑑 show high binding affinity and vice
versa. Thermodynamically, the formation of protein complexes through protein
interactions also involves loss in free energy [5]. Higher loss in free energy shows high
binding affinity and results in a more stable protein complex. Therefore, binding
affinity can also be measured by taking the difference between the free energy of the
protein complex and the sum of free energies of unbound proteins. This difference is
called change in the Gibbs free energy upon binding (∆∆𝐺). Binding affinity is usually
very small ranges from -2.5 to -22 kcal/mol.
2.2.2 Interfaces or Interaction Sites of Proteins
When two proteins interact to form a protein complex, only a part of proteins is
involved in binding as shown in Fig. 2.4. This part on one protein is called the
interaction site of the protein whereas all the interacting residue pairs on both proteins
constitute the interface of the protein complex (see Fig. 2.4). Therefore, in finding an
interaction site, we are only interested in residues of one protein in a complex which
are involved in interaction without considering residues of other protein. In contrast,
while determining the interface of the protein complex, we find all residue pairs on both
proteins which are involved in the interaction. It is interesting to note that if we have
known interface of a protein complex then we can easily extract the interaction sites of
interacting proteins in the complex.
If we have a solved 3-D structure of a protein complex, then we can extract the interface
of the complex by considering all those residue pairs of interacting proteins whose alpha
-
Problem Formulation and Literature Survey
14
carbon atoms are within a distance of 6.0 to 8.0 Angstroms [14], [33]. This approach of
extracting an interface from the protein complex is quite trivial and has been used by
many researchers in the field. However, residue pairs within this distance are not always
guaranteed to be interacting [34].
2.2.3 Types of Protein Interactions and Complexes
Protein-protein interactions (PPIs) and formation of protein complexes can be
differentiated based on the permanence of these complexes and the number of different
protein chains that are involved in the interaction. Protein complexes can be homomeric
or heteromeric as shown in Fig. 2.5 [35], [36]. Homomeric protein complexes are
formed through the interaction of a single type of protein chains and these complexes
are called as dimer, trimer and so on, based on the number of chains involved in
complex formation. Most of the transcription regulatory factors and scaffolding
proteins perform their functions as homomers [36]. In heteromeric protein complexes
formation, distinct protein chains are involved in the interaction. In the cell signaling,
heteromeric protein complexes are involved in the biochemical cascade [36].
Figure 2.5. Types of protein interactions and complexes. Protein Complexes are
homomeric if one type of protein chains is involved in interactions otherwise if
various type of protein chains are involved in complex formation then those
complexes are called heteromeric. Further protein complexes are divided into stable
or transient based on the duration of interactions. Binding affinity is a measure of the
strength of interaction between the protein involved in a complex formation. Binding
affinity is measured in terms of disassociation constant (Kd) and binding affinity is high for low Kd values. Stable complexes have high and weak transient have low binding affinity.
-
Problem Formulation and Literature Survey
15
Protein interactions can be classified as stable or transient based on their
interaction duration (see Fig. 2.5) [37]. Stable protein interactions involve those
interactions which stay for a long time and make permanent complexes for different
molecular roles [38]. In most of the homomeric and in some heteromeric stable
interactions are involved. Core RNA polymerase and Hemoglobin are examples of
stable complexes. In contrast, transient protein interactions occur reversibly for a short
duration in a specific molecular context [38]. For example, most protein interactions in
cell signaling are transient. Transient interactions control most cellular functions such
as protein folding, protein modification, and cell cycling. Folding and binding are
inseparable in case of stable complexes whereas, in transient complexes, proteins
folding and binding are two separate entities.
In this dissertation, we generally focus on heteromeric transient protein
complexes regardless of their functions.
2.2.4 Biologically Significant Effects of Protein Interactions
Proteins interactions normally take place in a specific molecular context. Interacting
proteins have certain underlying functional objectives which are expressed in various
ways. Some of the measurable biological significant effects of protein interactions are
listed as follows [39], [40].
Activation or deactivation of a protein.
Changing the interaction behavior of a protein by altering its binding specificity
towards different binding partners.
Regulate cellular functionality by participating either in upstream or
downstream events.
Creation a new binding mode in a protein.
2.2.5 Problems of Interest in Protein Interactions
Biologist and pharmacologists have various objectives in studying protein-protein
interactions. Some of them are listed as follows.
To get an idea of the function and behavior of proteins.
To determine the biological process or a pathway in which a protein of unknown
function is involved.
To determine different binding modes of a protein.
-
Problem Formulation and Literature Survey
16
To determine the specificity of a protein towards multiple targets.
To discover, design and measure the effectiveness of drugs and therapeutic
agents.
To combat infectious diseases.
To promote or inhibit protein interactions.
To design new proteins.
To meet all the above objectives, biologists and drug designers are generally
interested in solving the following three related problems in protein interactions.
i) Protein Interaction: Whether two given proteins interact or not?
ii) Binding Affinity: What is the strength of their interaction?
iii) Interface or Interaction Site: What is the exact location of interaction?
We perform a literature survey of existing experimental and computational
methods of solving these problems in the following sections.
2.3 Experimental Methods
Several experimental methods have been developed to determine protein interaction,
binding affinity, and interface or interaction site as shown in Fig. 2.6. These
experimental procedures are performed in-vivo (within an organism) or in-vitro (outside
organism). The problem of knowing whether two given proteins interact or not can be
taken as a binary classification problem. Experimental methods of determining protein-
protein interactions are classified as small-scale or high throughput methods [41], [42].
Small-scale methods such as Co-immunoprecipitations [43] and Surface Plasmon
Resonance [44] are often used to detect one interaction at a time. High throughput
methods such as Yeast Two-Hybrid (Y2H) [45] and Tandem Affinity Purification
(TAP) [46] are used to get thousands of interactions at a time. Binding affinity is the
measure of the strength of interaction between two proteins. Experimental methods
such as Isothermal Titration Calorimetry (ITC) [47], Surface Plasmon Resonance
(SPR) [48], and Fluorescence Polarization (FP) [49] can be used to determine protein
binding affinity. Interface or binding site is the region of proteins that are involved in
the interaction. In order to determine Interface or binding site, there also exist some
experimental procedures such as X-ray crystallography [50], Nuclear Magnetic
Resonance (NMR) [51] and different biological assays such as site-directed
mutagenesis [52]. A detailed discussion of these experimental techniques is out of the
-
Problem Formulation and Literature Survey
17
scope of this dissertation as the primary focus of this study is on computational
techniques. Interested readers are referred to [43]–[52] for further details. Here, we
provide briefly, some shortcomings of these experimental techniques.
Experimental techniques can accurately determine protein-protein interactions
(PPIs) but these techniques are expensive and time-consuming [39], [53], [54]. In the
meanwhile, high throughput methods produce many false positives and false negatives.
Moreover, these methods are difficult to reproduce and have limited coverage [41].
Furthermore, experimental methods depend on laboratory protocols and experimental
conditions which make it difficult to have an unbiased comparison across different
studies. Due to these shortcomings in experimental techniques, accurate computational
methods for protein interaction, binding affinity, and interface prediction are required.
2.4 Computational Methods
Cost and time constraints of experimental methods make them infeasible for their large-
scale applications at an interactome level of an organism. Therefore, there is high
demand for accurate computational approaches to support wet-lab methods by
predicting and ranking probable PPIs. Such computational approaches can assist
biologists in focusing on most likely interactions [55]. Several computational
techniques exist in the literature for protein-protein interaction problems. These
computational techniques can roughly be categorized into classical and machine
learning based methods. In this study, we focus on machine learning based methods
while classical methods are not within the scope of this dissertation. However, we give
a brief detail of these classical computational techniques in the next section to show
their limitations and to highlight the importance of machine learning based techniques
in solving protein interactions related problems.
Figure 2.6. Experimental methods to determine protein interactions, binding
affinity, and interaction site or interface.
-
Problem Formulation and Literature Survey
18
2.4.1 Classical Computational Methods
A number of computational methods, other than machine learning based techniques,
have been developed to determine protein interaction, binding affinity, and interface or
interaction site of proteins in a protein complex as shown in Fig. 2.7. These methods
have been grouped as homology-based (Interolog Search, Phylogenetic Similarity, and
Template based), simulations based (Molecular Dynamic Simulation), and others (Text
Mining, Network Topology Based, Docking, Energy Perturbation and Empirical
Scoring). A detailed discussion of these methods is not within the scope of this study
as our primary focus is on machine learning techniques. However, interested readers
are referred to [7], [10], [56]–[58] for further study. Here, we provide a brief overview
of these techniques along with their inherent limitations.
Homology-Based Methods: Homology-based methods take a basic
assumption of conserved protein interactions among different organisms. In Interolog
search, protein-protein interactions are predicted based on the homology of proteins
across different organism as shown in Fig. 2.8(a) [59]–[64]. Methods such as Molecular
Interaction Search Tool (MIST) and BIP-BIANA (Biologic Interactions and Network
Analysis) have been proposed and made accessible through their webserver for PPI
prediction through Interolog search [61], [65]. A similar approach followed in
Figure 2.7. Classical computational methods to predict protein interactions,
binding affinity, and interaction site or interface of a protein complex.
-
Problem Formulation and Literature Survey
19
homology-based methods is a phylogenetic