Investigating Machine Learning Based Prediction of...

Wajid Arshad Abbasi

2019

Department of Computer and Information Sciences

Pakistan Institute of Engineering and Applied Sciences

Nilore, Islamabad, Pakistan

Investigating Machine Learning Based

Prediction of Protein Interactions

This page intentionally left blank.

Reviewers and Examiners

Foreign Reviewers

1. Dr. Brian J. Geiss, Associate Professor, Colorado State University (CSU), USA

2. Prof. Dr. Shihua Zhang, Professor, Chinese Academy of Science (CAS), China

3. Dr. Henri Xhaard, Assistant Professor, University of Helsinki, Finland

Thesis Examiners

1. Prof. Dr. Ijaz Mansoor Qureshi, Professor, Air University, Islamabad

2. Dr. Hammad Naveed, Associate Professor, NUCES, Islamabad

3. Dr. Imran Amin, Principal Scientist, NIBGE, Faisalabad

Head of the Department (Name): Dr. Asifullah Khan

Signature with Date: _________________________________

Thesis Submission Approval

This is to certify that the work contained in this thesis entitled Investigating machine

learning based prediction of protein interactions, was carried out by Wajid Arshad

Abbasi, and in my opinion, it is fully adequate, in scope and quality, for the degree of

Ph.D. Furthermore, it is hereby approved for submission for review and thesis defense.

Supervisor: ___________________________________

Name: Dr. Fayyaz ul Amir Afsar Minhas

Date: 28 March, 2019

Place: PIEAS, Islamabad.

Head, Department of Computer and Information Sciences: ___________________

Name: Dr. Asifullah Khan

Date: 28 March, 2019

Place: PIEAS, Islamabad.

Investigating Machine Learning Based

Prediction of Protein Interactions

Wajid Arshad Abbasi

Submitted in partial fulfillment of the requirements

for the degree of Ph.D.

2019

Department of Computer and Information Sciences

Pakistan Institute of Engineering and Applied Sciences

Nilore, Islamabad, Pakistan

ii

Dedications

To my grandparents and my uncle Asif Habib Abbasi - who are not in this world

anymore but continue to live on in my heart. Also, to my parents, my wife Dr. Saiqa

Andleeb, and my daughter Zarnish Habib, whose love and support have been

fundamental in completing my thesis.

iii

Author’s Declaration

I, Wajid Arshad Abbasi hereby declare that my Ph.D. thesis titled “Investigating

machine learning based prediction of protein interactions” is my own work and has not

been submitted previously by me or anybody else for taking any degree from Pakistan

Institute of Engineering and Applied Sciences (PIEAS) or any other university/institute

in the country/world.

At any time if my statement is found to be incorrect (even after my graduation), the

university has the right to withdraw my Ph.D. degree.

____________________

(Wajid Arshad Abbasi)

28 March, 2019

PIEAS, Islamabad.

iv

Plagiarism Undertaking

I, Wajid Arshad Abbasi, solemnly declare that research work presented in the thesis

titled “Investigating machine learning based prediction of protein interactions” is solely

my research work with no significant contribution from any other person. Small

contribution/help wherever taken has been duly acknowledged or referred and that

complete thesis has been written by me.

I understand the zero-tolerance policy of the Higher Education Commission (HEC) and

Pakistan Institute of Engineering and Applied Sciences (PIEAS) towards plagiarism.

Therefore, I, as an author of the thesis titled above declare that no portion of my thesis

has been plagiarized and any material used as a reference is properly referred/cited.

I undertake that if I am found guilty of any formal plagiarism in the thesis titled above

even after the award of my Ph.D. degree, PIEAS reserves the rights to withdraw/revoke

my Ph.D. degree and that HEC and PIEAS has the right to publish my name on the

HEC / PIEAS Website on which name of students are placed who submitted plagiarized

thesis.

____________________

(Wajid Arshad Abbasi)

28 March, 2019

PIEAS, Islamabad.

v

Copyrights Statement

The entire contents of this thesis entitled Investigating Machine Learning Based

Prediction of Protein Interactions by Wajid Arshad Abbasi are an intellectual

property of Pakistan Institute of Engineering & Applied Sciences (PIEAS). No portion

of the thesis should be reproduced without obtaining explicit permission from PIEAS.

vi

Table of Contents

Dedications......................................................................................................... ii

Author’s Declaration ........................................................................................ iii

Plagiarism Undertaking..................................................................................... iv

Copyrights Statement ......................................................................................... v

Table of Contents .............................................................................................. vi

List of Figures .................................................................................................. xii

List of Tables .................................................................................................. xix

Acknowledgments ........................................................................................... xxi

Abstract ........................................................................................................ xxiii

List of Publications and Patents ..................................................................... xxv

List of Abbreviations and Symbols ............................................................... xxvi

1 Introduction ............................................................................................... 1

1.1 Motivations........................................................................................... 3

1.2 Problem Statement and Research Aims ............................................... 4

1.3 Dissertation Organization and Chapters’ Digest .................................. 5

2 Problem Formulation and Literature Survey ......................................... 8

2.1 Proteins ................................................................................................. 8

2.1.1 Protein Structures ........................................................................... 9

2.1.2 Protein Functions .......................................................................... 10

2.2 Protein Interactions and Complex Formation .................................... 12

2.2.1 Binding Affinity of Interacting Proteins ...................................... 13

2.2.2 Interfaces or Interaction Sites of Proteins .................................... 13

2.2.3 Types of Protein Interactions and Complexes .............................. 14

2.2.4 Biologically Significant Effects of Protein Interactions ............... 15

vii

2.2.5 Problems of Interest in Protein Interactions ................................. 15

2.3 Experimental Methods ....................................................................... 16

2.4 Computational Methods ..................................................................... 17

2.4.1 Classical Computational Methods ................................................ 18

2.4.2 Machine Learning......................................................................... 21

2.4.2.1 Protein Interaction Prediction .................................................... 21

2.4.2.2 Protein Binding Affinity Prediction .......................................... 25

2.4.2.3 Protein Interface or Interaction Site Prediction ......................... 26

3 Issues in Host-Pathogen Protein Interaction Prediction ...................... 30

3.1 Methods .............................................................................................. 33

3.1.1 Datasets and Preprocessing .......................................................... 33

3.1.1.1 Human-HIV Interaction Dataset (HH) ...................................... 34

3.1.1.2 Human-Adenovirus Interaction Dataset (HA) .......................... 34

3.1.2 Classifiers ..................................................................................... 34

3.1.3 Feature Extraction ........................................................................ 36

3.1.4 Model Evaluation ......................................................................... 37

3.1.5 Performance Metrics .................................................................... 38

3.2 Results and Discussion ....................................................................... 40

3.2.1 Analysis of Evaluation Methodologies ........................................ 40

3.2.2 Metrics for HPI Prediction ........................................................... 43

3.3 Chapter Summary ............................................................................... 45

4 CaMELS: Calmodulin Interaction Learning System .......................... 48

4.1 Methods .............................................................................................. 49

4.1.1 Dataset and Preprocessing ............................................................ 49

4.1.1.1 CaM Interaction Site Dataset .................................................... 50

4.1.1.2 CaM Interaction Dataset ............................................................ 50

4.1.2 Machine Learning Models............................................................ 51

viii

4.1.2.1 MIL Based CaM-Interaction Site Prediction ............................. 51

4.1.2.3 Interaction Prediction ................................................................ 55

4.1.3 Feature Extraction ........................................................................ 56

4.1.3.1 Window Level Feature Representation ..................................... 56

4.1.3.2 Protein Level Feature Representation ....................................... 58

4.1.4 Performance Evaluation ............................................................... 59

4.1.4.1 Evaluation of Interaction Prediction.......................................... 59

4.1.4.2 Evaluation of Interaction Site Prediction .................................. 62

4.1.5 Model Selection ............................................................................ 63

4.1.6 Webserver ..................................................................................... 63


4.2.1 Interaction Prediction ................................................................... 64

4.2.1.1 Improved CaM Interaction Prediction ....................................... 66

4.2.1.2 Motifs Search Fails to Predict CaM Interactions ...................... 68

4.2.1.3 Importance of the Whole Protein Sequence .............................. 68

4.2.1.4 GO Term Enrichment Analysis ................................................. 68

4.2.1.5 Performance Evaluation on Validation Set ............................... 69

4.2.1.6 In Silico Mutation Analysis ....................................................... 70

4.2.1.7 Validation Through Wet-Lab Experiments ............................... 70

4.2.1.8 Feature Analysis ........................................................................ 70

4.2.2 Interaction Site Prediction ............................................................ 71

4.2.2.1 Improved CaM Interaction Site Prediction ............................... 72

4.2.2.2 Motifs Search Fails to Predict CaM Interaction Site ................. 73

4.2.2.3 Performance Evaluation on Validation Set ............................... 74

4.2.2.4 Validation Through Wet-Lab Experiments ............................... 76

4.2.2.5 Contribution of Amino Acids and Motifs Identification ........... 76

4.2.2.6 MIL Using SSGO Method ........................................................ 78

ix

4.2.2.7 Analysis of Features in Interaction Site Prediction ................... 78

4.3 Chapter Summary ............................................................................... 78

5 ISLAND: In-Silico Protein Affinity Predictor ...................................... 80

5.1 Methods .............................................................................................. 81


5.1.2 Evaluation of the PPA-Pred2 Webserver ..................................... 82

5.1.3 Sequence Homology as Affinity Predictor ................................... 82

5.1.4 Proposed Methodology................................................................. 83

5.1.5 Sequence-Based Features ............................................................. 83

5.1.5.1 Explicit Features ........................................................................ 83

5.1.5.2 Kernel Representations.............................................................. 84

5.1.6 Complex Level Features Representation ...................................... 85

5.1.6.1 Feature Concatenation ............................................................... 86

5.1.6.2 Combining Kernels.................................................................... 86

5.1.7 Regression Models ....................................................................... 87

5.1.7.1 Ordinary Least-Squares Regression (OLSR) ............................ 87

5.1.7.2 Support Vector Regression (SVR) ............................................ 87

5.1.7.3 Random Forest Regression (RFR) ............................................ 88

5.1.8 Model Validation and Performance Assessment.......................... 88

5.1.9 Webserver ..................................................................................... 88


5.2.1 Binding Affinity Prediction Through Sequence Homology ........ 89

5.2.2 Binding Affinity Prediction Through ISLAND ........................... 89

5.2.3 Comparison Using External Independent Test Dataset ................ 90

5.3 Chapter Summary ............................................................................... 91

6 Learning Protein Binding Affinity Using Privileged Information ...... 93

6.1 Methods .............................................................................................. 94

x


6.1.2 Proposed Approach ...................................................................... 95

6.1.2.1 Baseline Classifiers ................................................................... 96

6.1.2.2 LUPI-SVM ................................................................................ 97

6.1.3 Feature Representation ................................................................. 99

6.1.3.1 Sequence-Based Features ........................................................ 100

6.1.3.2 Structure-Based Features ......................................................... 100

6.1.4 Model Validation, Selection and Performance Assessment ....... 102

6.1.5 Webserver ................................................................................... 103

6.2 Results and Discussion ..................................................................... 104

6.2.1 Performance of Baseline Learners ............................................. 104

6.2.2 Performance of LUPI-SVM ....................................................... 105

6.2.3 Evaluation Through Validation Dataset ..................................... 107

6.2.4 Feature Analysis for Binding Affinity Prediction ...................... 108

6.2.5 Learned Models Using LUPI and Classical SVM...................... 109

6.3 Chapter Summary ............................................................................. 109

7 PAIRpred: A Webserver for Protein Interface Prediction ............... 111

7.1 Implementation................................................................................. 111

7.2 Usage ................................................................................................ 113

7.3 Results .............................................................................................. 114

7.4 Validation Through Wet-Lab Experiments ...................................... 115

8 Conclusions and Future Work ............................................................. 117

8.1 Conclusions ...................................................................................... 117

8.2 Future Work ..................................................................................... 119

8.2.1 Application of Learning Using Privileged Information ............. 120

8.2.2 Handling Data Sparsity in Protein Interaction Domain.............. 120

Appendix A: Predictions Through CaMELS ............................................ 122

xi

References ..................................................................................................... 128

xii

List of Figures

Figure 1.1 Central Dogma of Molecular Biology. Portion of DNA called

a gene is transcribed to RNA which is used as a template to

synthesize proteins during translation ................................... 1

Figure 2.1 The chemistry of an amino acid (left panel) and properties of

side chain (Right panel). Every amino acid has a carbon

atom, called an alpha carbon (Cα), bonded to a carboxylic

acid (–COOH) group, an amine (-NH2) group, a hydrogen

atom, and an R group (side chain) that is unique for every

amino acid. Physiochemical properties of amino acids are

determined by the nature of its side chain .............................. 8

Figure 2.2 Different levels of protein structure. Different amino acids

joined together in various combinations through covalent

bonds and form primary structure. Different sections of

primary structure fold together through backbone hydrogen

bonding and form alpha helix and beta sheets. Elements in

secondary structure again fold through side chain interactions

to from tertiary structure stabilized by ionic bonds, disulfide

bonds, hydrophobic interactions, and hydrogen bonding.

Protein quaternary structures are formed through interaction

or binding of two or more independent tertiary structures ..... 9

Figure 2.3 Protein Functions. Proteins perform their functions as

enzymes (Sucrase), antibodies (T-cell receptor), messenger

(Insulin), or structural component (Actin). The most

fundamental function that proteins perform and which

underpin all the other biochemical functions is their ability to

bind or interact with other proteins or macromolecule .......... 11

Figure 2.4 Protein Interaction. Two unbound proteins (Ligand and

Receptor) with complementarity in shape and charge

distribution interact with each other to form a protein 12

xiii

complex. Interface of the complex at 6Å distance threshold

is shown with sticks in magenta color ....................................

Figure 2.5 Types of protein interactions and complexes. Protein

Complexes are homomeric if one type of protein chains is

involved in interactions otherwise if various type of protein

chains are involved in complex formation then those

complexes are called heteromeric. Further protein complexes

are divided into stable or transient based on the duration of

interactions. Binding affinity is a measure of the strength of

interaction between the protein involved in a complex

formation. Binding affinity is measured in terms of

disassociation constant (𝐾𝑑) and binding affinity is high for

low 𝐾𝑑 values. Stable complexes have high and weak

transient have low binding affinity ........................................ 14

Figure 2.6 Experimental methods to determine protein interactions,

binding affinity, and interaction site or interface. .................. 17

Figure 2.7 Classical Computational methods to predict protein

interactions, binding affinity, and interaction site or interface

of a protein complex. .............................................................. 18

Figure 2.8 Classical Computational methods for protein interaction,

binding affinity and interface prediction. (a) Interolog search;

(b) Docking. ........................................................................... 19

Figure 2.9 A general framework for developing machine learning

models for PPIs, binding affinity and interface prediction. ... 22

Figure 2.10 Machine learning methods for protein interactions, binding

affinity, and interface or interaction site prediction. .............. 27

Figure 3.1 A general framework of machine learning models used to

predict the host-pathogen protein interactions (HPIs)............ 31

Figure 3.2 A comparison of two different cross-validation (CV)

schemes on a toy dataset. K-fold (shown in left panel) and

Leave One Pathogen Protein Out (LOPO) (shown in right

panel). In both evaluation protocols, number of folds is equal

to the number of pathogen proteins in toy dataset. In K-fold 33

xiv

CV folds are created randomly while in LOPO folds are

created with respect to pathogen proteins. Overlap of data

occurs using K-fold CV for both host and pathogen proteins

e.g., proteins 𝑝1and ℎ1 occur in both train and test sets in each

fold. Whereas, by using LOPO CV overlap vanishes with

respect to pathogen proteins ...................................................

Figure 3.3 Precision-recall curves obtained through K-fold and LOPO

cross-validation. (a-d) Human-HIV and (e-h) Human-

Adenovirus interaction datasets. Mean area under the curves

across folds along with standard deviation is shown in

parenthesis .............................................................................. 41

Figure 3.4 Receiver operating characteristic (ROC) curves obtained

through K-fold and LOPO cross-validation. (a-d) Human-

HIV and (e-h) Human-Adenovirus interaction datasets. Mean

area under the curves across folds along with standard

deviation is shown in parenthesis ........................................... 43

Figure 3.5 Radar plots of the area under th ROC curve (AUC-ROC )

using two different cross-validation schemes for all models . 45

Figure 3.6 Radar plots of the area under the precision-recall curves

(AUC-PR) using two different cross-validation schemes for

all models ............................................................................... 46

Figure 4.1 MIL Framework for CaM interaction site prediction. The

protein sequence 𝑝 is represented as a line while the

annotated CaM interaction site as a box. All overlapping

windows with the annotated interaction site in 𝑝 constitute

positive examples (𝐵𝑝) and the rest of the windows constitute

negative examples (𝑁𝑝). The score obtained from the trained

discriminant function 𝑓(𝑥) should be higher for at least one

positive example than the scores generated for all negative

examples in 𝑝 ......................................................................... 50

Figure 4.2 MIL training algorithm with SSGO for CaM interaction site

prediction ........................................................................................ 53

xv

Figure 4.3 The online user interface for CaMELS webserver. (a) This

webserver accepts FASTA file or plain sequence of a protein

for CaM interaction and interaction site prediction; (b)

Interaction prediction model; (c) Interaction site prediction

model ...................................................................................... 64

Figure 4.4 (a) Receiver Operating Characteristic (ROC); (b) Precision-

recall (PR) curves for CaM interaction prediction for all

models. The averaged area under the curve across folds is

shown in parenthesis .............................................................. 65

Figure 4.5 (a) Precision-recall (PR) curves showing a comparison of

CaMELS with MI-1 and iLoops. The averaged area under the

curve across folds is shown in parenthesis; (b) Violin plot

showing density distributions of scores for positive (CaM

interacting) and negative (non-interacting) proteins

generated through DFS and CaMELS. Dotted lines show

density quartiles ..................................................................... 66

Figure 4.6 (a) Receiver Operating Characteristic (ROC) curves; (b)

Precision-recall (PR) curves; (c) ROC0.1 curves; (d) RFPP

curves for CaM interaction site prediction across different

models. The averaged area under the curves across folds is

shown in parenthesis .............................................................. 71

Figure 4.7 Predicted interaction sites of complexes of proteins with

CaM in the validation dataset through CaMELS. Calmodulin

(CaM) (grey with light shade); CaM interaction protein (grey

with dark shade); The predicted central residue of the

interaction site (sphere); Residues of the CaM interacting

protein within 5Å of CaM (stick form). (a) PDB ID: 1NWD;

(b) PDB ID: 1SY9; (c) PDB ID: 2M0K; (d) PDB ID: 5DOW

(e) PDB ID: 1YRT ................................................................. 74

Figure 4.8 Interaction site prediction score through CaMELS for

proteins used in mutagenic studies. Location of the predicted

interaction site has been denoted with a red dot. (a) LCa; (b)

SGS3 of Nicotiana Benthamiana ........................................... 75

xvi

Figure 4.9 Learned weight vectors of classifiers during training

CaMELS. (a) Weights obtained during training using AAC

feature representation; (b) Heat map of the weights obtained

during training using PDC feature representation; (c) Top 50

motifs learned during training using PDGT feature

representation. Actual weight value learned for each feature

during training is shown in the numeric column .................... 77

Figure 5.1 A general framework for protein affinity prediction using

machine learning techniques .................................................. 81

Figure 5.2 Techniques adopted for generating sequence-based feature

representation of a protein complex for developing machine

learning based protein binding affinity prediction models .... 86

Figure 5.3 The online user interface for ISLAND webserver. A user can

submit pair of plain sequence of proteins of interest for

binding affinity prediction ...................................................... 89

Figure 5.4 Cumulative histogram of absolute error between actual and

predicted binding affinity values through ISLAND and PPA-

Pred2 on external independent validation dataset .................. 91

Figure 6.1 A framework to classify protein complexes based on their

binding affinities through the paradigm of learning using

privileged information (LUPI). Privileged information (3D

structural information) is only required at training time (left

panel) to help better performance at test time (right panel)

using sequence information) alone ......................................... 94

Figure 6.2 Learning using privileged information with stochastic sub-

gradient optimization training ................................................ 99

Figure 6.3 Number of interacting residue pairs (NIRP) in the interface

of a protein complex. The frequency of non-repeating pairs

(considering A: B and B: A same) was computed from the

bound 3D structures of ligand (L) and receptor (R) of a

protein complex. Residues (shown as spheres) at a distance

cutoff of 8 angstroms (Å) are considered the interface of the 101

xvii

complex. The bottom panel of the figure shows the form of

feature vector extracted through this scheme .........................

Figure 6.4 The online user interface for LUPI-SVM webserver. (a) A

user can submit pair of plain sequences of proteins of interest

for binding affinity prediction; (b) An elucidation of

predicted score ....................................................................... 103

Figure 6.5 (a) ROC and (b) PR curves showing a performance

comparison between LUPI-SVM (with 2-mer as input and

Moal Descriptors as privileged feature space) and the

baseline classifiers (XGBoost, classical SVM (SVM), and

Random Forest (RF) with 2-mer features) on the affinity

benchmark dataset. The average area under the ROC and PR

curve (AUC) is shown in parenthesis ..................................... 104

Figure 6.6 Feature analysis using SHAP. The impact of 2-mer features

on model output is shown using SHAP values. The plot

shows the top 20 2-mers for the Ligand (L) or Receptor (R)

by the sum of their SHAP values over all samples. Feature

value is shown in color (Red: High; Blue: Low) reveals for

example that a high value of L (EK) (Counts of ‘EK’ in a

protein sequence designated as a ligand) contributes more for

predicting low binding affinity complexes ............................ 108

Figure 6.7 Weight vectors of the trained classifiers for the ligand

Blosum features. (a) SVM with LUPI framework using

Blosum (Protein) as input and Moal Descriptors as privileged

feature space; (b) Classical SVM using Blosum (Protein)

features ................................................................................... 109

Figure 7.1 Flowchart of the PAIRpred webserver. PAIRpred takes a pair

of proteins in PDB or FASTA format. Upon successful

format validation, PAIRpred performs chain selection and

feature extraction from the given sequences or structures.

Extracted features are used to generate predictions from a

pre-trained SVM classifier. These prediction results are

available for download and as an email attachment. .............. 112

xviii

Figure 7.2 Web interface of the PAIRpred webserver. (a) Home page

with user input and files upload options; (b) Chain selection;

(c) job submission notification and view results options ....... 113

Figure 7.3 Figure 7.3. Input pdb files with modified B factor. B factors

of the Ligand pdb file are replaced with 'Ligand scores' and

of the receptor pdb file with 'Receptor scores' ....................... 114

Figure 8.1 Machine learning techniques to handle sparsity in labeled

training data. SMEs: Subject Matter Experts; LUPI: Learning

Using Privileged Information; MIL: Multiple Instance

Learning; GANs, Generative Adversarial Networks ............. 120

xix

List of Tables

Table 3.1 Proposed biologist centric metrics to assess the generalization

performance of HPIs predictors over LOPO cross-validation

across all models ................................................................... 47

Table 4.1 Results showing the performance of CaMELS in comparison

to DFS for CaM Interaction prediction for all models .......... 67

Table 4.2 Results showing performance of CaMELS and MI-1 via Gene

Ontology term enrichment analysis .......................................... 69

Table 4.3 Results showing the performance of CaMELS for CaM

interaction site prediction in comparison to SVM (baseline),

mi-SVM and MI-1 across all models. AUCPR was unavailable

for SVM (baseline), mi-SVM and MI-1 ................................... 73

Table 5.1 Evaluation of PPA-Pred2 through its webserver on affinity

benchmark dataset 2.0 ............................................................... 90

Table 6.1 Protein complex classification results obtained using classical

SVM, Random Forest and XGBoost using input and privileged

features with LOCO cross-validation over the affinity

benchmark dataset ..................................................................... 105

Table 6.2 Protein complex classification results obtained through

classical SVM and LUPI across different features using LOCO

cross-validation over the affinity benchmark dataset ................ 106

Table 6.3 Comparison of classical SVM and LUPI-SVM on the external

independent validation dataset with training on affinity

benchmark dataset ..................................................................... 107

Table A1 Top 241 predicted CaM binding proteins from the proteome of

A. thaliana through CaMELS along with their predicted

interaction sites .......................................................................... 122

Table A2 241 CaM binders from interaction dataset along with predicted

binding sites through CaMELS ................................................. 124

xx

Table A3 List of 250 proteins used as negative set in the independent

validation dataset ....................................................................... 125

xxi

Acknowledgments

After humbly thanking Allah Almighty, I want to express my gratitude to those who

helped me to conduct this research work and enable me to complete my PhD: First, I

am very thankful to my adviser, Dr. Fayyaz Ul Amir Afsar Minhas for his time,

devotions, encouragements, motivations, guidance, support, and valuable discussions

that helped me to define and execute my thesis research. During the four years of my

Ph.D., we spent a significant amount of time during meetings, discussions,

brainstorming and presentation sessions. He always tried to deliver every bit of

knowledge to me and I always found myself much more relaxed, focused and motivated

after every meeting with him. I wish to continue him as mentor and guide in my rest of

the life.

I am also grateful to the members of my research committee: Dr. Sikander Majid

Mirza and Dr. Asifullah Khan for their feedback on the research proposal. I also want

to acknowledge the support and guidelines of our collaborators Prof. Dr. Asa Ben-Hur,

Colorado State University, USA and Dr. Imran Amin, Principal Scientist, NIBGE. I am

also thankful to other faculty members of the Department of Computer and Information

Sciences, PIEAS: Dr. Abdul Jalil, Dr. Mutawarra Hussain, Dr. Anila Usman, Dr. Abid

Mughal, Dr. Javaid Khurshid, Dr. Naeem Akhtar, Dr. Shahzad Ahmad Qureshi for their

support and specially to Dr. Muhammad Hanif Durad for providing me seating place in

his lab. I also want to extend my sincere gratitude to my lab fellows: Amina Asif, Sadaf

Khan, Adiba Yaseen, Kanza Hamid, Abdul Hanan Basit, Bismillah Jan, Fahad ul

Hassan, Asif Khan, Muhammad Dawood and Hira Kamal for their support and help. I

would never forget those wonderful hikes and parties which I had with them during my

Ph.D. studies. I am also indebted to my other friends at PIEAS: Muhammad Imran,

Naveed Akhtar, Mohsin Sittar, Naveed Chohan, Noorul Wahab, Faheem Afsar,

Muhammad Bashir, and Mirqad Ayaz for their cooperation and moral support. If I

missed your name on this list and you think it belongs here, I apologize.

I would also like to express my gratitude towards my family, especially my

parents (Arshad Habib Abbasi and Safia Shaheen), Uncles (Arif Habib, Sardar Imtiaz

xxii

Abbasi, Tariq Habib, M. Shafiq Abbasi and Abid), Aunties (Razia Shaheen, Shaheen,

Balqees and Razia Sultan), brothers (Amjid, Badar Munir, Nayyer, Waseem, Asad Ali,

Zohaib, Umer, Rizwan, Shujahat Ali and Abdullah) and sisters (Fozia Arshad, Qudsia,

Fozia Aziz, Kiren, Uzma, Rubi, Faiza, Maryam and Nida) for their care, love and good

wishes. I would greatly acknowledge the support, encouragement, care, love, and

patience of my beloved wife Dr. Saiqa Andleeb. Without her support, it would be

impossible for me to complete my doctorate. Also, thanks to my daughter Zarnish

Habib, niece and nephews (Ahmed, Ahsan, Ayan Mansoor, Mahrosh Habib, Zain, Esa,

Aryan Habib, Hashim Habib, and Mohid Habib) all of you are the reason for me to keep

going.

I must also extend my thanks to Muhammad Sadique Awan, Shabir Ahmed

Abbasi and Imran Abbasi at the University of Azad Jammu and Kashmir for their

support in official matters.

Lastly, I would like to acknowledge the Higher Education Commission (HEC)

of Pakistan for funding my Ph.D. studies via a grant (PIN: 213-58990-2PS2-046) under

indigenous 5000 Ph.D. fellowship scheme. I am also thankful for providing me funds

under the International Research Support Initiative Program (IRSIP) to pursue my

Ph.D. research work at the Colorado State University (CSU), USA. My primary reason

to thank this scholarship is the fact that it provided me with an opportunity to expand

my horizons and knowledge.

Wajid Arshad Abbasi

xxiii

Abstract

Protein interactions are crucial in the cell for performing cellular functions and the study

of protein interactions is a very important domain of research in bioinformatics. In

reference to protein interactions, biologists are usually interested in three core

problems: determining pairwise protein interactions, determination of binding affinity,

and identification of the interface. Computational methods to solve these protein

interaction problems have emerged as an active research area due to tedious, costly, and

time-consuming experimental procedures. Our aim in this work is to develop novel

machine learning based methods for protein interaction, binding affinity and interaction

prediction with improved generalization performance.

In this dissertation, we have developed host-pathogen protein interaction predictors

using machine learning. One of our findings is that existing methods for protein

interaction prediction that use K-fold cross-validation for performance assessment

report over-estimated accuracy values as K-fold cross-validation does not take pairwise

protein similarity between training and test examples into account. To control this data

redundancy at pathogen protein level, we have proposed and advocated the use of an

alternate evaluation scheme called Leave One Pathogen Protein Out (LOPO) cross-

validation along with some biologist centric metrics for designing protein-protein

interaction prediction methods.

We have also designed a novel machine learning model called CaMELS (CalModulin

intEraction Learning System) for interaction and interaction site prediction of

Calmodulin (CaM) which is a very important and highly conserved protein across all

eukaryotes. CaMELS relies on a novel implementation of multiple instance learning

solver for protein binding site prediction that leads to significant improvement in

predictive performance. One of our collaborators has confirmed the effectiveness of

CaMELS through wet-lab experiments as well.

We have also focused on the more generic problem of predicting binding affinity in

protein interactions and presented various sequence-based machine learning models.

xxiv

For this purpose, we have developed a novel machine learning method which is based

on the framework of Learning Using Privileged Information (LUPI). Our state-of-the-

art method uses protein 3D structure as privileged information at training time while

expecting only protein sequence information during testing. This makes our machine

learning method flexible by allowing it to leverage protein structure information during

training while requiring only protein sequence information during testing.

We have also developed a webserver for an existing state-of-the-art protein-protein

interface prediction method called PAIRPred. The accuracy of this webserver has also

been validated by our collaborators through wet-lab experiments as well.

xxv

List of Publications and Patents

Journal Publications

Wajid Arshad Abbasi, Amina Asif, Asa Ben-Hur and Fayyaz ul Amir Afsar

Minhas, “Learning Protein Binding Affinity using Privileged Information”,

BMC Bioinformatics, vol. 19, 425, 2018.

Abdul Hanan Basit, Wajid Arshad Abbasi, Amina Asif and Fayyaz ul Amir

Afsar Minhas, “Training Large Margin Host-Pathogen Protein-Protein

Interaction Predictors”, Journal of Bioinformatics and Computational Biology,

vol. 16, 18500142, 2018.

Wajid Arshad Abbasi, Amina Asif, Saiqa Andleeb and Fayyaz ul Amir Afsar

Minhas, “CaMELS: In silico prediction of calmodulin binding proteins and their

binding sites”, Proteins: Structure, Function and Bioinformatics, vol. 85 (9), pp.

1724–1740, 2017.

Wajid Arshad Abbasi and Fayyaz ul Amir Afsar Minhas, “Issues in

performance evaluation for host–pathogen protein interaction prediction”,

Journal of Bioinformatics and Computational Biology, vol. 14 (3), 1650011,

2016.

Conference Publications

Adiba Yaseen, Wajid Arshad Abbasi and Fayyaz ul Amir Afsar Minhas,

“Protein binding affinity prediction using support vector regression and

interfecial features”, 15th International Bhurban Conference on Applied

Sciences and Technology (IBCAST), IEEE, 2018, pp. 194-198.

Kanza Hamid, Amina Asif, Wajid Arshad Abbasi, Durre Sabih and Fayyaz ul

Amir Afsar Minhas, “Machine Learning with Abstention for Automated Liver

Disease Diagnosis”, in Proceedings of the 15th International Conference on

Frontiers of Information Technology, IEEE, 2017, pp. 356-361.

Preprints

Wajid Arshad Abbasi, Fahad Ul Hassan, Adiba Yaseen, Fayyaz Ul Amir Afsar Minhas. “ISLAND: In-Silico Prediction of Proteins Binding Affinity

Using Sequence Descriptors”, arXiv:1711.0540.

Amina Asif, Wajid Arshad Abbasi, Farzeen Munir, Asa Ben-Hur, and Fayyaz ul Amir Afsar Minhas. “pyLEMMINGS: Large Margin Multiple

Instance Classification and Ranking for Bioinformatics Applications”,

arXiv:1711.04913.

xxvi

List of Abbreviations and Symbols

𝑷𝒓 Pearson Correlation Coefficient

3-D Three Dimensional

AAC Amino Acid Composition

AUC Area under the ROC Curve

AUC-PR Areas Under the Precision-Recall Curve

AUC-ROC Areas Under the ROC curve

BIP-BIANA Biologic Interactions and Network Analysis

BLOSUM Blocks Substitution Matrix

CaM Calmodulin

CaMELS Calmodulin Interaction Learning System

CV Cross-Validation

DFS Discriminant Function Scoring

DNA Deoxyribonucleic Acid

FHR False Hit Rate

FP Fluorescence Polarization

GO Gene Ontology

HIV Human Immunodeficiency Virus

HPIs Host-Pathogen Interactions

HTS High-Throughput Sequencing

IR Insulin Receptor Tyrosine Kinase

ISLAND In-Silico Protein Affinity Predictor

ITC Isothermal Titration Calorimetry

JBCB Journal of Bioinformatics and Computational Biology

LOCO Leave One Complex Out

LOPO Leave One Pathogen Protein Out

LUPI Learning Using Privileged Information

MDS Molecular Dynamic Simulation

MIL Multiple Instance Learning

MRFPP Median Rank of the First Positive Prediction

xxvii

NCBI National Center for Biotechnology Information

NIRP Number of Interacting Residue Pairs

NMR Nuclear Magnetic Resonance

OLSR Ordinary Least-Squares Regression

PAIRpred Partner Aware Interacting Residue Predictor

PD-Blosum Position Dependent BLOSUM-62

PDC Position Dependent Composition

PDGT Position Dependent Gappy Triplet

PHISTO Pathogen-Host Interaction Search Tool

PPIs Protein-Protein Interactions

PR Area Under the Precision-Recall Curve

PseAAC Pseudo-Amino Acid Compositions

PSFMs Position Specific Frequency Matrices

PSSMs Position Specific Scoring Matrices

RB Retinoblastoma

RF Random Forest

RFPP Rank of the First Positive Prediction

RFR Random Forest Regression

RMSE Root Mean Squared Error

RNA Ribonucleic Acid

ROC Receiver Operating Characteristic

SPR Surface Plasmon Resonance

SSGO Stochastic Sub-Gradient Optimization

SVM Support Vector Machines

SVR Support Vector Regression

TAP Tandem Affinity Purification

THR True Hit Rate

UniProt Universal Protein Knowledgebase

WHO World Health Organization

XGBoost Extreme Gradient Boosting

Y2H Yeast Two-Hybrid

1

1 Introduction

In order to better understand the complexity of life and mechanism of biological systems,

we need to analyze dynamic interactions of these biological systems at the molecular level

[1]. In all living organisms, the cell is the fundamental unit of life and it is composed of

different molecules which perform all life-sustaining functions [2]. There are three

molecules in a cell that are primarily responsible for sustaining life: deoxyribonucleic acid

(DNA), ribonucleic acid (RNA), and proteins. These three molecules function under the

principle of Central Dogma of Molecular Biology [3]. This dogma operates in two steps:

first, a portion of DNA called a gene is copied to form a messenger RNA (transcription)

and then the messenger RNA is used as a template to synthesize proteins (translation) as

shown in Fig. 1.1. Proteins are the key molecules which perform almost all the biologically

significant functions at cellular level.

Figure 1.1. Central Dogma of Molecular Biology. Portion of DNA called a

gene is transcribed to RNA which is used as a template to synthesize proteins

during translation.

Introduction

2

Functionally, proteins are the dominant player and second most abundant

biomolecule present in the cell after water. The importance of proteins in our body can be

appreciated by considering the fact that 50% of the dry weight of the human body is protein

[4]. Proteins perform their functions in different forms such as pepsin helps in digestion as

an enzyme, insulin controls blood sugar as a hormone, calmodulin (CaM) affects

intracellular signaling, hemoglobin transports oxygen, histones play a role in gene

regulation, antibodies combat infectious diseases and many more [5]. Considering this

huge functional diversity of proteins, it is important to decipher their working mechanism

in order to completely understand cellular behavior.

Up to the 1970s, it was widely believed that a single protein performs a single

function under a dogma called ‘one gene/one enzyme/one function’ and protein

interactions were considered as purification artifacts [6], [7]. The idea of single protein-

single function was ultimately shown to be incorrect with the discovery of the involvement

of multiple proteins in DNA replication (e.g., DNA helicase, DNA primase) besides the

polymerase and the participation of more than 20 proteins in the protein import into

mitochondria [7]–[9]. Now, it is an established fact that proteins do not function in isolation

and more than 80% of all cellular proteins perform their biological functions by forming

complexes through protein-protein interactions (PPIs) [10], [11]. Proteins achieve their

functional diversity through interactions and such interactions mediate overall organismal

systems including metabolic pathways and cell-to-cell interactions [12]. Because of their

dominant role in biological processes, protein interactions are normally responsible for

healthy or diseased states in an organism. For example, the retinoblastoma (RB) protein is

a tumor suppressor which prevents abnormal cell division by binding to the E2F

transcription factor. When this interaction gets perturbed due to the absence or mutation of

RB protein, E2F will be freely available for unrestrained cell division and formation of the

tumor [13]. Therefore, understanding the protein-protein interactions is crucial to know the

basic cellular biology, functions of a previously uncharacterized protein, and the disease

mechanisms. Moreover, knowledge about protein interactions is also important in

therapeutics to develop effective and personalized drugs with fewer side effects because

more than 80% of current therapeutic targets are proteins [5].

Introduction

3

Protein-protein interactions (PPIs) are the noncovalent physical connections

established between amino acids in the 3-D structures of two or more proteins at the

specific locations called binding/interaction sites. Physiochemically complementary

protein interactions are normally steered by the hydrogen bonding, electrostatic forces, and

hydrophobic effects [14]. Biologists are normally interested in solving the following three

main challenging problems related to PPIs.

a) Pairwise Protein Interactions: Whether two given proteins interact or not?

b) Binding Affinity: Strength of the interaction.

c) Interface or Interaction Site: Exact location of the interaction.

In this work, we have developed machine learning based computational methods

which would assist biologists in the wet lab for solving the aforementioned protein

interaction related challenges.

1.1 Motivations

The field of biology experienced two important conceptual shifts in the 20th century with

the discovery of Mendel’s laws and restriction enzymes [15], [16]. In the meanwhile,

complete sequencing of the human genome in 2001 [17] and emerging high-throughput

sequencing (HTS) technologies empowered biologists to think of complex biological

questions in terms of molecules. Raw data of DNA and protein sequences is growing at an

exponential rate in different databases such as GenBank [18] and Universal Protein

Knowledgebase (UniProt) [19]. The major challenge now is to analyze this large amount

of data as most of the genes and proteins sequenced are of unknow functions and un-

characterized [20]. The task of analyzing data in proteomics is further complicated because

of the involvement of complex protein interactions. This problem creates a wide scope for

researchers in the field of computer science and bioinformatics to analyze data in-silico

and assist biologists in solving interesting biological problems.

Moreover, in the study of protein interactions called interactomics, experimental

methods are often laborious, time-consuming and expensive, making it difficult to

investigate all possible protein interactions within and across organisms. For instance, the

bacterium Bacillus anthracis has 5,508 protein-encoding genes [21], which when paired

Introduction

4

with the 20,000 or so human genes [22], gives more than 100 million possible protein

interaction pairs to validate experimentally. It is not practical to verify all possible

interaction pairs through wet-lab experiments. Therefore, there is an extreme need for

computational approaches to support wet-lab methods by predicting and ranking probable

PPIs. Such computational approaches can assist biologists in focusing on the most likely

interactions.

1.2 Problem Statement and Research Aims

Among computational approaches, application of machine learning techniques to

bioinformatics for the prediction of PPIs is a well-accepted idea. In machine learning based

predictors of PPIs, models are normally built by using sequence and structure information

of protein interactions which have been discovered through experimental methods.

Unavailability of the structural information of most of novel proteins limits the practical

use these predictors and therefore sequence information is the only practical choice.

Prediction of PPIs through machine learning using sequence data only is a challenging

problem because proteins in real interact in a specific three-dimensional conformation.

Moreover, in proteins, involvements of significant conformational changes, motion and

flexibility, alternate binding modes, the dependence of binding propensity on the binding

partner, and uncertainties in the annotation of available scientific data make the problem

of the prediction of PPIs hard. In every machine learning setting, it is vital to thoroughly

understand the nature of the problem, availability and amount of training data and the

prospective use of the system while designing the predictive model, its evaluation

methodology, and the performance metrics. However, this requirement is even more

crucial in bioinformatics in comparison to other application areas because of its role as a

tool for biological discovery. Also, with the growth in proteome data of different

organisms, there is a pressing need for PPIs predictors which can incorporate more and

more information from different sources to gain better generalization. To achieve these

goals, we have formulated and accomplished the following research aims in this study.

We have performed a survey of existing machine learning based computational

techniques for protein-protein interaction prediction in order to assess the suitability

Introduction

5

and limitations of existing evaluation protocols and performance metrics for this

purpose.

We have designed sequence-based machine learning models of PPIs for interaction,

interaction site, and affinity prediction with improved generalization accuracy by

incorporating learning data at the proteome level and combining information of

interactions and interaction sites.

Generally, computational techniques exploit protein sequence and 3-D structural

information to develop predictive models for protein interactions. One of the major

issues with techniques using protein 3-D structural information which limits their

applicability is the unavailability of solved 3-D structures of novel proteins. Our

aim in this study is also to design such a machine learning based model for PPIs

prediction which can use both protein structural and sequence information during

training but it only requires sequence information for testing.

1.3 Dissertation Organization and Chapters’ Digest

This dissertation is divided into the following chapters.

In chapter 2 “Problem Formulation and Literature Survey”, we give the

required background of proteins, their structure, functions, and interactions along with

experimental procedures of determining these interactions. We also perform a literature

survey of existing computational techniques of predicting protein-protein interactions

(PPIs), interfaces/interaction sites, and protein binding affinity along with the formulation

of these problems as machine learning problems.

In chapter 3 “Issues in Host-Pathogen Protein Interaction Prediction”, we have

performed a survey of existing machine learning based host-pathogen protein interactions

(HPIs) prediction techniques. The objective of the survey was to assess the suitability and

limitations of existing evaluation protocols and performance metrics to design predictive

HPI models. In this chapter, we have investigated the usefulness of K-fold cross-validation

for evaluating the generalization performance of pairwise protein interaction predictors in

host-pathogen interactions (HPIs). K-fold cross-validation does not avoid redundancy

between the train and test data and results in an inflated accuracy. To control this data

Introduction

6

redundancy at pathogen protein level, we have proposed and shown the effectiveness of a

new evaluation scheme called Leave One Pathogen Protein Out (LOPO) cross-validation.

We have also proposed and suggested the use of some biologist-centric metrics for HPIs

predictors. Our findings of this study have been published in the Journal of Bioinformatics

and Computational Biology (JBCB), 14(3), 2016, 1650011.

In chapter 4 “CaMELS: Calmodulin Interaction Learning System”, we present

a machine learning based algorithm suite called CaMELS (CalModulin intEraction

Learning System) for CaM interaction and interaction site prediction using sequence

information alone. CaMELS models CaM interaction and interaction site prediction as two

separate classification problems and gives state-of-the-art accuracy for both tasks. To

predict CaM interaction, CaMELS uses traditional support vector machine (SVM) along

with features extracted from the whole sequence of a protein instead of localized window

level features. Whereas, for solving CaM interaction site prediction problem CaMELS used

Multiple Instance Learning (MIL) paradigm to handle imprecisions in binding site

annotations in the training data. To solve the multiple instance machine learning model,

CaMELS used a custom-built algorithm based on stochastic sub-gradient optimization

(SSGO) that allows more fast and effective learning. We have shown improved

generalization performance of CaMELS using a variety of evaluation techniques including

wet-lab experiments. Python code for training and evaluating CaMELS together with a

webserver implementation is available at the URL:

http://faculty.pieas.edu.pk/fayyaz/software.html#camels. We have published the outcomes

of this work in Proteins: Structure, Function, and Bioinformatics, 85(9), 2017, 1724–1740.

In chapter 5 “ISLAND: In-Silico Protein Affinity Predictor”, sequence-based

protein binding affinity prediction methods using machine learning have been explored.

Specifically, we present our findings that the true generalization performance of even the

state-of-the-art sequence-only predictor is far from satisfactory and that the development

of machine learning methods for binding affinity prediction with improved generalization

performance is still an open problem. We have also proposed a sequence-based novel

protein binding affinity predictor called ISLAND which gives better accuracy than existing

methods over the same validation set as well as on external independent test dataset. A

http://faculty.pieas.edu.pk/fayyaz/software.html#camels

Introduction

7

cloud-based webserver implementation of ISLAND and its python code are available at

http://faculty.pieas.edu.pk/fayyaz/software.html#island.

In chapter 6 “Learning Using Privileged Information for Protein Binding

Affinity Prediction”, we have developed a novel machine learning method for predicting

binding affinity which is based on the framework of learning using privileged information

(LUPI). This method uses protein 3D structure as privileged information at training time

while expecting only protein sequence information during testing. The proposed method

outperforms several baseline learners and a state-of-the-art binding affinity predictor not

only in cross-validation but also on an additional validation dataset. This demonstrates the

utility of the implemented LUPI framework developed for this work in other areas of

bioinformatics as well. A Python implementation of the proposed method together with a

webserver is available at http://faculty.pieas.edu.pk/fayyaz/software.html#LUPI. The

outcomes of this work have been accepted for publication in the BMC Bioinformatics

journal.

In chapter 7 “PAIRpred: A Webserver for Partner-Aware Protein Interface

Prediction”, a web server has been developed and deployed for PAIRpred which is a state-

of-the-art method for predicting partner-specific interface of a protein complex using either

sequence information alone or in conjunction with features derived from the unbound

structures of the two proteins in the complex. The web server is available at

http://faculty.pieas.edu.pk/fayyaz/software.html#pairpred. This webserver takes a pair of

proteins in fasta or pdb format and produces downloadable predictions along with

highlighted predicted interface in its output PDB files.

In chapter 8, “Conclusions and Future Work”, we have summarized the

conclusions drawn from this study along with the details of the projects to be completed in

future.

http://faculty.pieas.edu.pk/fayyaz/software.html#islandhttp://faculty.pieas.edu.pk/fayyaz/software.html#LUPIhttp://faculty.pieas.edu.pk/fayyaz/software.html#pairpred

8

2 Problem Formulation and Literature Survey

In this chapter, we start with a brief introduction of proteins and their characteristics to

assist the reader with relevant biological background. Then, we discuss interesting

biological problems in protein interactions, along with experimental and computational

methods of solving these problems. Further, we formulate biological questions in

protein-protein interaction as machine learning problems and perform a literature

survey of known machine learning techniques in this domain to highlight important

research questions.

2.1 Proteins

Proteins are the second most abundant macromolecules present in a cell [5]. Proteins

are made up of smaller units called amino acids. There are twenty naturally accruing

amino acids which are considered as the raw material of all proteins. An amino acid is

an organic compound containing amine (-NH2) and carboxyl (-COOH) group together

with a side chain functional group as shown in Fig. 2.1 [14]. Every amino acid has a

unique functional group attached to it. Different amino acids have different

physiochemical properties (see Fig. 2.1) and are encoded by codons in the

Deoxyribonucleic acid (DNA) through a linear relationship [14]. These 20 amino acids

are linked with each other through peptide bonds in various combinations to form

Figure 2.1. The chemistry of an amino acid (left panel) and properties of side

chain (Right panel). Every amino acid has a carbon atom, called an alpha carbon

(Cα), bonded to a carboxylic acid (–COOH) group, an amine (-NH2) group, a

hydrogen atom, and an R group (side chain) that is unique for every amino acid.

Physiochemical properties of amino acids are determined by the nature of its side

chain.

Problem Formulation and Literature Survey

9

proteins with diverse structures and functions. Proteins vary in length from a hundred

to thousands of amino acids.

2.1.1 Protein Structures

Proteins have four level of their structures: primary, secondary, tertiary, and quaternary.

These different levels of protein structures are shown in Fig. 2.2. Proteins are also called

polypeptides where different amino acids are joined together in various combinations

through peptide bonds and form a linear string called the primary structure of the

protein. These peptide bonds are formed between the amino and carboxylic groups of

two amino acids by producing one water molecule [14]. Primary structure of the protein

is also called amino acids sequence and normally available in FASTA format. Some

parts of protein sequences have a biological significant pattern called a motif. For

example, IQ Calmodulin-binding motif has the following sequence pattern:

Figure 2.2. Different levels of protein structure. Different amino acids joined

together in various combinations through covalent bonds and form primary structure.

Different sections of primary structure fold together through backbone hydrogen

bonding and form alpha helix and beta sheets. Elements in secondary structure again

fold through side chain interactions to from tertiary structure stabilized by ionic

bonds, disulfide bonds, hydrophobic interactions, and hydrogen bonding. Protein

quaternary structures are formed through interaction or binding of two or more

independent tertiary structures.


10

[FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY], where x represents any amino acid and

square brackets represent alternative amino acids.

Bonds which link the amino group to the alpha carbon and the alpha carbon

atom to the carbonyl carbon are free to rotate and allows various orientations of amino

acids in the polypeptide chain. Rotations of these bonds are represented in 𝜙 and 𝜓

torsion angles and all possible allowed values of these angles are shown in the

Ramachandran plot [23]. These allowed rotations of peptide bonds let polypeptide to

fold into secondary structures such as alpha helix and beta sheets (see Fig. 2.2). These

secondary structures are formed and stabilized by the hydrogen bonding between the

backbone atoms of the residues. Alpha helix is generated by the hydrogen bonding of

neighbor residues while beta sheets are formed through hydrogen bonding of distant

residues in the sequence [14].

Different secondary structures joined together and fold into a protein native 3-

D tertiary structure as shown in Fig. 2.2. In protein tertiary structure, various interactive

forces between atoms of side chains of residues in a polypeptide play an important role

[14]. These interactive forces include hydrogen bonding, ionic bonds, disulfide bonds,

and hydrophobic interactions. Amino acids sequence of a stable protein contains

enough information to fold into native tertiary structure [24]. This 3-D structure of a

protein determines its functions [14]. Protein 3-D structures are available in the form

of coordinates of atoms of all residues in the PDB format.

Protein quaternary structures are formed when various tertiary structures joined

together (see Fig. 2.2). Protein quaternary structures result from protein interactions

and are also called protein complexes. In the formation of these quaternary structures,

similar interactive forces are involved as involved in tertiary structure formation.

2.1.2 Protein Functions

Proteins are involved in all the important biological processes and perform almost all

tasks at the cellular level. Proteins are diverse in their functions and are responsible for

cell shape, product manufacture, routine maintenance, waste cleanup, and inner

organization. Proteins perform their roles as enzymes, antibodies, structural

component, or messenger as shown in Fig. 2.3 [25]–[28]. There are thousands of

chemical reactions involved in different metabolic pathways within a cell. These


11

chemical reactions are catalyzed to proceed millions of times faster by proteins called

enzymes [25]. For example, sucrase catalyzes the hydrolysis of sucrose. Antibodies are

proteins which are used by the immune system of an organism to identify and neutralize

foreign invaders such as viruses or pathogenic bacteria, e.g., T-cell receptors are

proteins that act as antibodies. Antibodies perform their function through interaction

with an antigen present on the surface of the invading organism [26]. Proteins such as

Actin also provide structural support to the cell and enable it to dynamically remodel

itself in response to internal or environmental stimuli [27]. Cells also communicate to

coordinate and perform basic activities such as tissue repair and immunity. Insulin is a

protein which helps in glucose and lipid metabolism by activating a cascade of cellular

processes through interaction with insulin receptor tyrosine kinase (IR) [29]. Growth

hormones are proteins which stimulate tissue repair through cell regeneration in human

[28].

Most protein functions involve the interaction of two or more proteins to form

a protein complex [14]. For example, enzymes must bind to their substrates to perform

catalysis and structural proteins bind together in order to gain strength and toughness

[14]. Similarly, antibodies and messenger proteins also perform their functions by

interacting with other proteins. Therefore, the study of protein interactions is of utmost

importance in biology to decipher functions of proteins, to characterize different

biological processes or pathways, to interpret disease mechanisms and design effective

Figure 2.3. Protein functions. Proteins perform their functions as enzymes

(Sucrase), antibodies (T-cell receptor), messenger (Insulin), or structural component

(Actin). The most fundamental function that proteins perform and which underpin all

the other biochemical functions is their ability to bind or interact with other proteins

or macromolecule.


12

drugs. Proteins interact with other proteins and macromolecules such as DNA and RNA

but in this dissertation, we specifically focus on protein-protein interactions (PPIs)

studies. In the next sections of this chapter, we discuss how proteins interact with other

proteins, biological significant problems in protein interactions, and how computational

techniques can contribute to handling these problems.

2.2 Protein Interactions and Complex Formation

Proteins generally do not function in isolation and interact with each other to perform

a vital role in various biological processes and metabolic pathways [11], [14]. More

than 80% of all the cellular proteins are involved in these type of interactions [10], [11].

Protein-protein interactions (PPIs) are physical connections between residues of two

proteins (Ligand and Receptor) in a highly specific manner (see Fig. 2.4). These

interactions happen in a specific biomolecular context and are normally piloted by a

chain of the same electrostatic forces and hydrophobic effects as involved in protein

folding [14], [30]. Complementarity in shape and charge distribution on the surface of

proteins are the two major factors which play a significant role in protein interactions

[14]. During an interaction, proteins can also go through conformational changes

augmented by conformational selections model [14], [31]. These conformational

changes enable optimized interaction and support formation of a stable complex.

Protein complexes which are formed through protein-protein interactions as

shown in Fig. 2.4. These protein interactions constitute the interactome of an organism.

Protein interactions can happen within the same organism (intra-species) and across

different organisms (inter-species) such as host-pathogen protein interactions (HPIs).

Studies in protein interactions with different perspectives such as molecular dynamics,

Figure 2.4. Protein interaction. Two unbound proteins (Ligand and Receptor) with

complementarity in shape and charge distribution interact with each other to form a

protein complex. The interface of the complex at 6Å distance threshold is shown with

sticks in magenta color.


13

biochemistry, and signal transduction create protein interaction networks. These protein

interaction networks, like metabolic pathways, help biologists to gain a better

understanding of underlying biological processes, to understand disease mechanisms

and to aid studies for the design, discovery, and effectiveness of therapeutic drugs [32].

2.2.1 Binding Affinity of Interacting Proteins

Binding affinity is a measure of the strength of interaction between proteins which bind

reversibly in a protein complex [5], [7]. High binding affinity indicates tighter binding

between proteins involved in an interaction. Experimentally, it can be measured in

terms of the dissociation constant (𝐾𝑑 =[𝐿][𝑅]

[𝐿𝑅]⁄ ), which is a ratio between the

concentration of free ligand and receptor proteins ([𝐿], [𝑅]) and the concentration of

protein complex ([𝐿𝑅]) [7]. Smaller values of 𝐾𝑑 show high binding affinity and vice

versa. Thermodynamically, the formation of protein complexes through protein

interactions also involves loss in free energy [5]. Higher loss in free energy shows high

binding affinity and results in a more stable protein complex. Therefore, binding

affinity can also be measured by taking the difference between the free energy of the

protein complex and the sum of free energies of unbound proteins. This difference is

called change in the Gibbs free energy upon binding (∆∆𝐺). Binding affinity is usually

very small ranges from -2.5 to -22 kcal/mol.

2.2.2 Interfaces or Interaction Sites of Proteins

When two proteins interact to form a protein complex, only a part of proteins is

involved in binding as shown in Fig. 2.4. This part on one protein is called the

interaction site of the protein whereas all the interacting residue pairs on both proteins

constitute the interface of the protein complex (see Fig. 2.4). Therefore, in finding an

interaction site, we are only interested in residues of one protein in a complex which

are involved in interaction without considering residues of other protein. In contrast,

while determining the interface of the protein complex, we find all residue pairs on both

proteins which are involved in the interaction. It is interesting to note that if we have

known interface of a protein complex then we can easily extract the interaction sites of

interacting proteins in the complex.

If we have a solved 3-D structure of a protein complex, then we can extract the interface

of the complex by considering all those residue pairs of interacting proteins whose alpha


14

carbon atoms are within a distance of 6.0 to 8.0 Angstroms [14], [33]. This approach of

extracting an interface from the protein complex is quite trivial and has been used by

many researchers in the field. However, residue pairs within this distance are not always

guaranteed to be interacting [34].

2.2.3 Types of Protein Interactions and Complexes

Protein-protein interactions (PPIs) and formation of protein complexes can be

differentiated based on the permanence of these complexes and the number of different

protein chains that are involved in the interaction. Protein complexes can be homomeric

or heteromeric as shown in Fig. 2.5 [35], [36]. Homomeric protein complexes are

formed through the interaction of a single type of protein chains and these complexes

are called as dimer, trimer and so on, based on the number of chains involved in

complex formation. Most of the transcription regulatory factors and scaffolding

proteins perform their functions as homomers [36]. In heteromeric protein complexes

formation, distinct protein chains are involved in the interaction. In the cell signaling,

heteromeric protein complexes are involved in the biochemical cascade [36].

Figure 2.5. Types of protein interactions and complexes. Protein Complexes are

homomeric if one type of protein chains is involved in interactions otherwise if

various type of protein chains are involved in complex formation then those

complexes are called heteromeric. Further protein complexes are divided into stable

or transient based on the duration of interactions. Binding affinity is a measure of the

strength of interaction between the protein involved in a complex formation. Binding

affinity is measured in terms of disassociation constant (Kd) and binding affinity is high for low Kd values. Stable complexes have high and weak transient have low binding affinity.


15

Protein interactions can be classified as stable or transient based on their

interaction duration (see Fig. 2.5) [37]. Stable protein interactions involve those

interactions which stay for a long time and make permanent complexes for different

molecular roles [38]. In most of the homomeric and in some heteromeric stable

interactions are involved. Core RNA polymerase and Hemoglobin are examples of

stable complexes. In contrast, transient protein interactions occur reversibly for a short

duration in a specific molecular context [38]. For example, most protein interactions in

cell signaling are transient. Transient interactions control most cellular functions such

as protein folding, protein modification, and cell cycling. Folding and binding are

inseparable in case of stable complexes whereas, in transient complexes, proteins

folding and binding are two separate entities.

In this dissertation, we generally focus on heteromeric transient protein

complexes regardless of their functions.

2.2.4 Biologically Significant Effects of Protein Interactions

Proteins interactions normally take place in a specific molecular context. Interacting

proteins have certain underlying functional objectives which are expressed in various

ways. Some of the measurable biological significant effects of protein interactions are

listed as follows [39], [40].

Activation or deactivation of a protein.

Changing the interaction behavior of a protein by altering its binding specificity

towards different binding partners.

Regulate cellular functionality by participating either in upstream or

downstream events.

Creation a new binding mode in a protein.

2.2.5 Problems of Interest in Protein Interactions

Biologist and pharmacologists have various objectives in studying protein-protein

interactions. Some of them are listed as follows.

To get an idea of the function and behavior of proteins.

To determine the biological process or a pathway in which a protein of unknown

function is involved.

To determine different binding modes of a protein.


16

To determine the specificity of a protein towards multiple targets.

To discover, design and measure the effectiveness of drugs and therapeutic

agents.

To combat infectious diseases.

To promote or inhibit protein interactions.

To design new proteins.

To meet all the above objectives, biologists and drug designers are generally

interested in solving the following three related problems in protein interactions.

i) Protein Interaction: Whether two given proteins interact or not?

ii) Binding Affinity: What is the strength of their interaction?

iii) Interface or Interaction Site: What is the exact location of interaction?

We perform a literature survey of existing experimental and computational

methods of solving these problems in the following sections.

2.3 Experimental Methods

Several experimental methods have been developed to determine protein interaction,

binding affinity, and interface or interaction site as shown in Fig. 2.6. These

experimental procedures are performed in-vivo (within an organism) or in-vitro (outside

organism). The problem of knowing whether two given proteins interact or not can be

taken as a binary classification problem. Experimental methods of determining protein-

protein interactions are classified as small-scale or high throughput methods [41], [42].

Small-scale methods such as Co-immunoprecipitations [43] and Surface Plasmon

Resonance [44] are often used to detect one interaction at a time. High throughput

methods such as Yeast Two-Hybrid (Y2H) [45] and Tandem Affinity Purification

(TAP) [46] are used to get thousands of interactions at a time. Binding affinity is the

measure of the strength of interaction between two proteins. Experimental methods

such as Isothermal Titration Calorimetry (ITC) [47], Surface Plasmon Resonance

(SPR) [48], and Fluorescence Polarization (FP) [49] can be used to determine protein

binding affinity. Interface or binding site is the region of proteins that are involved in

the interaction. In order to determine Interface or binding site, there also exist some

experimental procedures such as X-ray crystallography [50], Nuclear Magnetic

Resonance (NMR) [51] and different biological assays such as site-directed

mutagenesis [52]. A detailed discussion of these experimental techniques is out of the


17

scope of this dissertation as the primary focus of this study is on computational

techniques. Interested readers are referred to [43]–[52] for further details. Here, we

provide briefly, some shortcomings of these experimental techniques.

Experimental techniques can accurately determine protein-protein interactions

(PPIs) but these techniques are expensive and time-consuming [39], [53], [54]. In the

meanwhile, high throughput methods produce many false positives and false negatives.

Moreover, these methods are difficult to reproduce and have limited coverage [41].

Furthermore, experimental methods depend on laboratory protocols and experimental

conditions which make it difficult to have an unbiased comparison across different

studies. Due to these shortcomings in experimental techniques, accurate computational

methods for protein interaction, binding affinity, and interface prediction are required.

2.4 Computational Methods

Cost and time constraints of experimental methods make them infeasible for their large-

scale applications at an interactome level of an organism. Therefore, there is high

demand for accurate computational approaches to support wet-lab methods by

predicting and ranking probable PPIs. Such computational approaches can assist

biologists in focusing on most likely interactions [55]. Several computational

techniques exist in the literature for protein-protein interaction problems. These

computational techniques can roughly be categorized into classical and machine

learning based methods. In this study, we focus on machine learning based methods

while classical methods are not within the scope of this dissertation. However, we give

a brief detail of these classical computational techniques in the next section to show

their limitations and to highlight the importance of machine learning based techniques

in solving protein interactions related problems.

Figure 2.6. Experimental methods to determine protein interactions, binding

affinity, and interaction site or interface.


18

2.4.1 Classical Computational Methods

A number of computational methods, other than machine learning based techniques,

have been developed to determine protein interaction, binding affinity, and interface or

interaction site of proteins in a protein complex as shown in Fig. 2.7. These methods

have been grouped as homology-based (Interolog Search, Phylogenetic Similarity, and

Template based), simulations based (Molecular Dynamic Simulation), and others (Text

Mining, Network Topology Based, Docking, Energy Perturbation and Empirical

Scoring). A detailed discussion of these methods is not within the scope of this study

as our primary focus is on machine learning techniques. However, interested readers

are referred to [7], [10], [56]–[58] for further study. Here, we provide a brief overview

of these techniques along with their inherent limitations.

Homology-Based Methods: Homology-based methods take a basic

assumption of conserved protein interactions among different organisms. In Interolog

search, protein-protein interactions are predicted based on the homology of proteins

across different organism as shown in Fig. 2.8(a) [59]–[64]. Methods such as Molecular

Interaction Search Tool (MIST) and BIP-BIANA (Biologic Interactions and Network

Analysis) have been proposed and made accessible through their webserver for PPI

prediction through Interolog search [61], [65]. A similar approach followed in

Figure 2.7. Classical computational methods to predict protein interactions,

binding affinity, and interaction site or interface of a protein complex.


19

homology-based methods is a phylogenetic

Investigating Machine Learning Based Prediction of...

Documents

Transcript of Investigating Machine Learning Based Prediction of...