QUANTITATIVE FUNCTIONAL MEASUREMENT OF A PROTEIN …
Transcript of QUANTITATIVE FUNCTIONAL MEASUREMENT OF A PROTEIN …
The Pennsylvania State University
The Graduate School
Integrative Biosciences
QUANTITATIVE FUNCTIONAL MEASUREMENT OF A PROTEIN
USING PHYLOGENETIC PROFILES
A Dissertation in
Integrative Biosciences
by
Kyung Dae Ko
2009 Kyung Dae Ko
Submitted in Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
August 2009
The dissertation of Kyung Dae Ko was reviewed and approved* by the following:
Randen L. Patterson Assistant Professor of Biology Dissertation Advisor Chair of Committee
Réka Albert Professor of Physics and Biology
Anton Nekrutenko Associate Professor of Biochemistry and Molecular Biology
Michael N. Teng Assistant Professor of Biochemistry and Molecular Biology
Damian van Rossum Assistant Professor, Research of Biology
Peter Hudson Willaman Professor of Biology Director, Huck Institutes of the Life Sciences
*Signatures are on file in the Graduate School
iii
ABSTRACT
In principle, the amino acid sequence of a protein contains structural, functional,
and evolutionary characteristics. Investigating these characteristics using computational
methods provides a powerful resource. However, these methods have limitations in their
ability to annotate the characteristics of proteins accurately. In an attempt to overcome
this drawback, I have developed a unified computational pipeline, called the Gestalt
Domain Detection Algorithm Basic Local Alignment Tool (GDDA-BLAST), for
measuring the structural, functional and evolutionary characteristics of a protein using
phylogenetic profiles. The performance of GDDA-BLAST is better than those of other
method such as SAM and psi-BLAST in homology detection.
Using GDDA-BLAST, I also implemented a classification library to find
quantitative thresholds capable of inferring protein function using phylogenetic profiles.
Using this library, I identified RNA-binding Proteins (RBPs) containing structural unique
motifs by 2695 expanded Position Specific Scoring Metric (PSSM) profiles in a testing
dataset with 37 positive and 118 negative sequences. We achieved 100% specificity,
96.8% accuracy, and 86.5% sensitivity. For the specific nucleotide binding folds (dsRNA
vs. dsDNA, dsRNA vs. dsDNA, and ssRNA vs. ssDNA), our results exceeded those of
obtained using Support Vector Machine (SVM) learning algorithms. Using this method, I
also identified 29 and 168 novel RBPs in yeast and human proteomes. These results
suggest that this method can be used to create PSSM databases for the quantitative
measurement and classification of any protein function.
iv
TABLE OF CONTENTS
LIST OF FIGURES .....................................................................................................vi
LIST OF TABLES.......................................................................................................viii
ACKNOWLEDGEMENTS......................................................................................... ix
Chapter 1 Introduction .................................................................................................1
1.1 Current computational methods for the prediction of protein characteristics ................................................................................................2
1.2 Motivation and Objective ...............................................................................10
Chapter 2 GDDA (Gestalt Domain Detection Algorithm) – BLAST (Basic Local Alignment Tool) with Phylogenetic Profiles........................................................12
2.1 Backgrounds and Motives ..............................................................................12 2.2 GDDA-BLAST with phylogenetic profiles....................................................14 2.3 The prediction of functional characteristics of proteins by GDDA-BLAST..17 2.4 The investigation of evolutionary relations among proteins using GDDA-
BLAST ..........................................................................................................20 2.5 The prediction of structural boundaries of ion-channels using GDDA-
BLAST ..........................................................................................................22 2.6 The discovery of novel lipid-binding domains in vitro ..................................24 2.7 Summary and discussion ................................................................................26
Chapter 3 The Performance of GDDA-BLAST in homology detection .....................29
3.1 The backgrounds and Motives........................................................................29 3.2 Results and Discussion ...................................................................................33
3.2.1 Datasets for the performance evaluation ..............................................33 3.3 Homology detection methods for the performance evaluation.......................35 3.4 The performance evaluation ...........................................................................37 3.5 Summary and discussion ................................................................................41
Chapter 4 The identification of RNA binding proteins using the quantitative functional measurement........................................................................................44
4.1 A classification library for RNA binding proteins .........................................46 4.2 The identification of RNA binding proteins...................................................54 4.3 The investigation of functional relations among RRM containing proteins...64 4.4 Summary.........................................................................................................67
Chapter 5 Summary and Discussion ............................................................................70
v
5.1 Summary.........................................................................................................70 5.2 Discussion.......................................................................................................74
Chapter 6 Future Perspectives .....................................................................................77
Bibliography ................................................................................................................83
vi
LIST OF FIGURES
Figure 1-1: Homology-based methods.........................................................................3
Figure 1-2: The schemes of machine learning methods. .............................................8
Figure 1-3: The schematic diagram of a phylogenetic profile method for function inferences..............................................................................................................9
Figure 2.1: The workflow of GDDA-BLAST. ............................................................16
Figure 2-2: GDDA-BLAST model of the ATP-binding Ankyrin Repeat in TRPV1...............................................................................................................................19
Figure 2-3: Water Channel (Aquaporin) Phylogeny....................................................21
Figure 2-4: GDDA-BLAST models of the ion transport domain in TRPC channels...............................................................................................................................23
Figure 2-5: Functional Information via GDDA-BLAST analysis. ..............................25
Figure 3-1: The statistical information of protein families.. ........................................32
Figure 3-2: Five hierarchical levels of SCOP classification........................................34
Figure 3-3: The schemes of homology-based methods. ..............................................36
Figure 3-4: The ROC graphs for the performance evaluation of GDDA-BLAST.. ....39
Figure 4-1: The structures of RNA binding proteins. ..................................................45
Figure 4-2: The overview of EMSA. ...........................................................................47
Figure 4-3: The problems of functional annotations in conventional programs..........48
Figure 4-4: The pipeline of GDDA-BLAST for the identification of RNA binding proteins. ................................................................................................................49
Figure 4-5: The false positive sequences in phylogenetic profiles from GDDA-BLAST..................................................................................................................50
Figure 4-6: A classification library for the identification of RNA binding proteins. ..50
Figure 4-7: A residue-based phylogenetic profile. ......................................................53
vii
Figure 4-8: The sequence of a hypothetic protein for describing derivation of the feature vector of a protein.....................................................................................54
Figure 4-9: Thresholds for the positive sequences in training sets. .............................55
Figure 4-10: The identification of RRM containing proteins in a testing dataset containing 20 positive and 137 negative sequences. ............................................56
Figure 4-11: The threshold for the classification of single-stranded RNA binding proteins. ................................................................................................................60
Figure 4-12: The classification between double-stranded DNA and single-stranded RNA binding proteins. ...........................................................................62
Figure 4-13: The classification among other DNA and RNA binding proteins ..........62
Figure 4-14: The dendrogram of control sequences. ...................................................65
Figure 4-15: The proteins with U2AF-homology motif (UHM). ................................66
Figure 6-1: The functional dendrogram for the prediction of UHM proteins. Orange boxes indicate the UHM control sequences.............................................79
Figure 6-2: The prediction of UHM candidate proteins. .............................................80
Figure 6-3: The inference of new annotations from reference annotation of NP_005869. ..........................................................................................................82
viii
LIST OF TABLES
Table 4-1: The comparison of the performances between Interproscan and SVM .....48
Table 4-2: Comparison with the sensitivities of other methods for the identification of RRM containing proteins in training and testing sets. ...............57
Table 4-3: Comparison with the sensitivities of other methods for the identification of the four single-fold RNA binding protein groups in training sets. .......................................................................................................................58
Table 4-4: The results of identification for five RNA binding protein groups in yeast and human proteomes..................................................................................58
Table 4-5: The comparison with other methods. .........................................................60
Table 4-6: The classification among six types of RNA binding proteins such as double-stranded RNA binding vs. double stranded DNA binding proteins, single-stranded RNA binding vs. double-stranded DNA binding proteins, and single-stranded RNA binding vs. single-stranded DNA binding proteins. ..........61
ix
ACKNOWLEDGEMENTS
Most of all, I really appreciate God to guide and support me in my graduation, and I am
indebted to many people who have significantly helped me shaping my dissertation.
It is difficult to overstate my gratitude to my advisor, Dr. Randen Patterson. With his
enthusiasm and his inspiration, I was able to endure a long journey at Penn State
University. In my research, he gave me encouragement, advice, and lots of good ideas,
and I could not finish my dissertation without him. I would like to extend my gratitude to
Dr. Damian van Rossum for many things. Whenever I lost my path in my research, he
always helps guided me back to the light. I am also grateful to my thesis committee
members, Dr.Réka Albert, Dr.Anton Nekrutenko and Dr. Michael N. Teng. They
provided me with insightful comments and enthusiastic support during my thesis.
I would like to express my appreciation for my colleagues in the lab, including YooJin
Hong and Gaurav Bhardwaj. I will never forget their support during my research.
I also express my thanks to my friends: Bob and Marin Ford, and Ken and Joyce Layton.
I cannot thank you enough for your support of me and my family.
Finally, I owe everything to my family: my father, my mother, my brother, my father-in-
law, and my mother-in-law for their love and support. I would like to express my deepest
gratitude to my wife for her patience, love, and support. My daughter, Grace, you always
give me happy smile whenever I am exhausted. To them I dedicate my thesis.
Chapter 1
Introduction
Proteins in a cell are involved in the development and process of a cell. Therefore,
the analysis of structures and functions for proteins is important to understand the
pathways of cell interactions. The study of protein evolution also provides clues to
genetic historical information, which shows various combinations during the molecular
evolution. Therefore, the identification and classification of proteins are important in the
analyses of structural and functional characteristics and investigation of molecular
evolutionary history for proteins
A protein usually consists of domains, which can be independent of the rest of the
protein chains. Domains fold autonomously and can bind to ligands or other domains [1].
Domains are components of the protein structure, and can work in the protein as
functional units. Domains may also exist in various evolutionary related proteins among
species. Therefore, the detection of domains plays a very important role in the
identification and classification of proteins.
However, overwhelmed with predicted proteins from genomes, we face several
obstacles to annotate structural, functional and evolutionary properties of proteins. First,
even though experimental methods identify many uncharacterized proteins in proteomes,
the annotations of these proteins take longer time than the identification, and the existing
erroneous annotation can generate the false annotation of a new protein in some case.
Second, the annotation requires the accurate subjective and contextual definition of a
2
protein function because the protein may have multiple functions. Because of these two
problems, the accurate structural and functional annotations of a protein are the
challenging tasks in all biological fields [2]. In this chapter, we start to review current
computational methods for the prediction of protein characteristics and describe their
deficiencies, which lead to our motivation for the development of GDDA (Gestalt
Domain Detection Algorithm)-BLAST. Then, we conclude this chapter by discussing
motivations and objectives of our research in more detail.
1.1 Current computational methods for the prediction of protein characteristics
In principle, the amino acid sequence of a protein can contain structural,
functional, and evolutionary characteristics, and the characteristics have been
investigated using many computational methods such as homology detection, machine
learning method, and phylogenetic profile.
Among these methods, the simplest and fastest algorithm is homology detection.
Homologous proteins generally have high similarities in their structures and functions
from the literatures [53,54]. As establishing a homology between new and reference
proteins, we can infer assorted information such as function, structure, and evolution of
the new protein. Many algorithms for homology detection are classified into three
categories; sequence-sequence comparison, sequence-profile comparison, and profile-
profile comparison [3].
First, shown in Figure 1-1(a), sequence-sequence comparison measures the
similarity between new and reference sequences. If their identity is high, they have
3
structural and functional relationships. Based on these relationships, we can infer the
characteristics of a new protein. [2].
However, if their sequence identity is not high enough to find their relationship,
sequence-sequence comparison algorithms lose the sensitivity to detect the functional or
structural relationship of these sequences. Even though they cannot detect their
relationship, the empirical analyses prove that some sequences with low identity still
have functional and/or structural relationships because these sequences are distantly
related in their evolution [4].
To increase the sensitivity for the detection of remote homologues, instead of
comparing two proteins directly by aligning their sequences, the test sequence is
compared with profiles, which contain common information from known protein
sequences in the same families [4]. Indeed, after building the multiple alignments of
related sequences in the same family, PSSM (Position Specific Scoring Matrices) or
a
EQLAK
E A K Q A A
EAKQ
3.0 0.0 3.0 1.0 0.0 0.0
0.0 4.0 0.0 0.0 3.0 4.0
0.0 0.5 3.0 0.0 0.0 0.0
1.0 0.0 0.0 2.0 0.0 0.0
E A K Q A A
EAKQ
3.0 0.0 3.0 1.0 0.0 0.0
0.0 4.0 0.0 0.0 3.0 4.0
0.0 0.5 3.0 0.0 0.0 0.0
1.0 0.0 0.0 2.0 0.0 0.0
b
E A K Q A A
EAKQ
3.0 0.0 3.0 1.0 0.0 0.0
0.0 4.0 0.0 0.0 3.0 4.0
0.0 0.5 3.0 0.0 0.0 0.0
1.0 0.0 0.0 2.0 0.0 0.0
E A K Q A A
EAKQ
3.0 0.0 3.0 1.0 0.0 0.0
0.0 4.0 0.0 0.0 3.0 4.0
0.0 0.5 3.0 0.0 0.0 0.0
1.0 0.0 0.0 2.0 0.0 0.0
E A K Q A A
EAKQ
3.0 0.0 3.0 1.0 0.0 0.0
0.0 4.0 0.0 0.0 3.0 4.0
0.0 0.5 3.0 0.0 0.0 0.0
1.0 0.0 0.0 2.0 0.0 0.0
E A K Q A A
EAKQ
3.0 0.0 3.0 1.0 0.0 0.0
0.0 4.0 0.0 0.0 3.0 4.0
0.0 0.5 3.0 0.0 0.0 0.0
1.0 0.0 0.0 2.0 0.0 0.0
c
Figure 1-1: Homology-based methods. (a) Sequence-sequence comparison. (b) Sequence-
profile comparison. (C) Profile-profile comparison.
4
HMM (Hidden Markov Model) profile is generated on the basis of the common
information from their multiple alignments. Using PSSM or HMM, sequence-profile
comparison methods such as PSI-BLAST and SAM can increase the sensitivity to detect
the distant homologous sequences with low sequence identities [5,6].
However, if an unknown protein is even distant from the related protein family,
the profile is not sensitive to recognize that this protein belongs to the same family.
Therefore, profile-profile comparison methods such as FFAS[7] and Prof_sim[3] were
developed to solve this problem. Shown in Figure 1-1(b), it first generates the profile
from multiple alignments of sequences related to an unknown sequence. Then, comparing
the profile of the unknown sequence with the profiles of reference sequences, it can
discover the homologous pairs between two profiles.
Even though homology-based method improves the ability to detect functional
and structural relations among proteins, it still has problems to predict the properties of
proteins. First, it is still not sensitive to detect distant homologous sequences below 10%
sequence identity [8]. In fact, two sequences which have very low identity are generally
determined to be unrelated sequences in homology-based method because the possibility
to align them by chance is statistically high. However, Sander and Schneider [55] have
shown that the sequences below 10% sequence identity still have high secondary
structural similarity. Russ et al. [56] have also concluded that a small number of
conserved residues with 8% identity can build 3D folds with similar functions in proteins.
Second, the homology-based method cannot predict the properties of specific
proteins such as enzymes from their homologous pairs because the important residues of
these proteins are not conserved well among sequences even with high sequence
5
similarity [2]. For example, in several researches [57,58], enzymes over 40% sequence
identity can generally establish catalytic functional relationships among them [59].
However, due to high false-negative rate, the information about these functional
relationships is sometimes lost in the sequence over 60% sequence identity. Thus, even
though sequence similarity is generally correlated to functional or structural similarity,
this correlation can be affected by some evolutionary event such as domain shuffling,
which contains the addition, deletion and redistribution of domains [60,61].
Finally, if the existing annotations in databases contain errors, homology-based
method allows these erroneous annotations to amplify and propagate the errors through
the databases [2]. In principle, the addition of more reference sequences to the databases
supports homology-based method to predict the properties of a protein more accurately.
However, if one of these sequences contains erroneous annotations, the new prediction
contains erroneous information. In addition, if iterative computational methods such as
PSI-BLAST and SAM use these databases for the detection of homologous pairs, the
error may propagate an entire PSSM or HMM.
Machine learning method predicts the functional properties of proteins on the
basis of sequence-derived features. Since machine learning method uses physical or
chemical features extracted from the sequences of proteins, it is independent of sequence
similarity. Among many machine learning algorithms, SVM (Support Vector Machine)
and ANN (Artificial Neural Networks) are popularly used for the functional classification
of proteins [9].
SVMs are classified into two groups such non-linear and linear SVM. While non-
linear SVM has the better performance for classifying proteins with diverse sequences or
6
structures than linear SVM, linear SVM is popularly used for general protein
classification because linear SVM is easy to implement. Figure 1-2(a) explains the
procedures to build a SVM. Using feature vectors, SVM first creates a hyper-plane to
divide these feature vectors into two classes with a maximum margin. Eq. (1.1) and
Eq. (1.2) are used for linear and non-linear classification [9]. Then, projecting their
feature vectors into a multi-dimensional space, members and non-members of a
functional class are separated by a hyper-plane in the space. Finally, a new protein can be
classified into a member or non-member class by its feature vector close to the side of the
hyper-plane to which other proteins with similar features are located [9].
where w is vector normal to a hyper-space, xi is a feature vector, b is a parameter, and γi
is group index.
where xi and xj are feature vectors, and σ is standard deviation.
Shown in Figure 1-2(b), ANN has three layers such as input, hidden, and output
layers, and each layer consists of nodes and connections. Each node contains a
classification function, which determines whether each input feature belongs to the
member class or not. Based on the output from each node, the weights of connections
among all nodes are changed using Eq. (1.3). After ANN trains its own network for two-
class classification using training datasets, the trained classifier can predict the functions
of proteins [9].
w x 1 f o r 1, p o s i t i v e c la s s
w x 1 f o r 1, n e g a t i v e c la s si i
i i
b
b
1.1
2
22,j ix x
i jK x x e
1.2
7
where w0j is the output weight of a hidden node j to an output node, g is the output
function, hj is the value of a hidden layer node, xi is the feature vector of a protein whose
components are their computed descriptors, wji is the input weight from an input node i to
a hidden node j, wj is the threshold weight from an input node of value 1 to a hidden node
j, and σ is an active function.
As machine learning method uses physical or chemical features without sequence
similarity for the functional prediction of proteins, it identifies functional or structural
properties of proteins such as enzymes. However, the biased results can be produced by
the number of sequences and properties of features from the datasets because the
accuracy of prediction depends on training sets and feature extracting methods [9]. In fact,
since the training datasets for machine learning models cannot be fully representative of
the members and non-members for particular functional classes of proteins, inadequate
sampling for training and testing datasets can affect the accuracy of prediction for them.
Due to this problem, machine learning method is not applied to classify proteins with
insufficient knowledge about their specific functions. In addition, it is very important to
develop efficient feature extracting methods from sequences for machine learning method
because feature descriptors provide an impact to their performance directly.
0 , j j j i j jj j
g w h h w x w
1.3
8
a b
Figure 1-2: The schemes of machine learning methods. (a) Schematic diagram illustrating
the process of the training and prediction of the functional class of proteins using SVM
[9]. (b) Schematic diagram illustrating the process of the prediction of functional class of
proteins using ANN [9].
9
A phylogenetic profile method encodes the presences or absences of proteins
across genomes for inferring functional relationships among proteins. The basic idea of
the phylogenetic profile method is that functionally related proteins tend to co-evolve in
their organisms because of evolutionary constraints [10]. Thus, if similar proteins are
discovered between two organisms, their phylogenetic profiles are also similar because
they may have functional relationships each other. Figure 1-3 describes the procedures of
the phylogenetic profile method for the functional prediction of proteins. [11].
However, the phylogenetic profiles from genomes are often not informative
because they do not offer information of proteins themselves. Moreover, while the
phylogenetic profiles from prokaryotic genomes describe the functional relationships of
Figure 1-3: The schematic diagram of a phylogenetic profile method for function inferences [11].
10
proteins clearly, the phylogenetic profiles from eukaryotic genomes are less informative
to predict the functional relationship, despite some successful researches for specific
protein function predictions [62]. In addition, the accuracy of the analysis is low due to
the limitation of genome and genome sequences.
1.2 Motivation and Objective
This thesis is motivated by two purposes for the prediction of protein
characteristics to overcome the drawbacks as discussed above. First, we can, in principle,
infer functional, structural and evolutionary properties of a protein on the basis of only its
sequence because its primary amino acid sequence contains information about its
characteristics. However, there is no accurate method to predict these three properties of
a protein together only using its sequence.
To solve this problem, we have developed a unified computational pipeline,
called GDDA-BLAST, for measuring the structural, functional and evolutionary
characteristics of a protein using phylogenetic profiles. Indeed, GDDA-BLAST can
identify structural and functional domain boundaries in TRPC ion channels, and generate
a phylogenetic tree of evolutional related RT sequences which approximates their
evolutionary relationships in our previous studies [12,13]
Based on these previous researches, the objectives of this dissertation are to
improve the performance of GDDA-BLAST in homology detection, and to develop a
method for functional quantitative measurement of a protein. To achieve these objectives,
11
we will investigate the thresholds for the identification of RNA binding proteins and
design a new pylogenetic profile for their functional annotations.
. In this thesis, Chapter 2 describes the background and pipeline of GDDA-
BLAST, and introduces our previous researches using GDDA-BLAST which are
validated by literatures and wet experiments. Chapter 3 explains the background of the
performance evaluation, and compares the performance of GDDA-BLAST with those of
other methods. Chapter 4 reviews computational classifiers for the identification of RNA
binding proteins, and suggests a new method to identify RNA binding proteins by the
quantitative measurement of GDDA-BLAST. In chapter 5, we summarize our results of
evaluations, and discuss the implications of GDDA-BLAST. Finally, in Chapter 6, the
conclusions and recommendations for future research are discussed.
Chapter 2 GDDA (Gestalt Domain Detection Algorithm) – BLAST (Basic Local Alignment
Tool) with Phylogenetic Profiles
2.1 Backgrounds and Motives
Despite decades of researches, it is still unsolved to identify structure, function,
and evolutionary characteristics of a protein from the amino acid sequence. For example,
homolog detection to infer function and structure of an unknown protein has limitation to
identify homologous pairs among highly divergent protein sequences [3]. Indeed, if
pairwise sequence alignments between protein sequences drop down below 25%, the
sequence alignments cannot be reliable for matching two sequences and their alignments
are treated as random events [14]. However, a small number of conserved residues with
8% identity can coordinate the 3-D fold and/or function of proteins, whereas two proteins
with 88% identity can still preserve independent structure and function [15].
Therefore, the abovementioned studies raise fundamental questions about the
structure, sequence and function of a protein. Which residues within amino acid
sequences are important to determine the function and/or structure of a protein? Do
proteins with similar sequence and structure have a common ancestor? Furthermore, if
sequence and structure similarity suggest an evolutionary history, do weak similarities
mean they have different evolutionary history? All of these questions are essentially
connected to the relation among the sequence, structure and function of a protein.
13
However, all these questions have not been clearly solved either experimentally or
theoretically.
For example, common computational alignment programs such as BLAST and
FASTA fail to detect remote homologous sequences with sufficient statistical
significance [16]. To improve the performance of the sequence alignment, Blake and
Cohen [17] built amino acid substation matrices to measure properties of amino acid
residues in a sequence. More recently, advanced sequence comparison methods have
been developed using the shared features from related sequences in the same protein
families. Based on these approaches such as templates [18,19], profiles[20,21] and HMM
(Hidden Markov Models) [22,23],several popular programs such as PSI-BLAST [5] and
SAM [24] have improved the sensitivity to detect the distant homologues. In addition,
threading algorithms are also developed to improve detection of homologous pairs in the
twilight zone [25]. Despite of these improvements, these methods still cannot annotate
the relationships between function and structure of a protein.
The purpose of all these methods is basically to explore information encoded in
sequences. Due to the resent advance of computer technology for knowledge bases and
the analysis of complex data, invaluable information can be teased out from protein
sequences more accurately. Therefore, integrating several advanced methods such as
phylogenetic profiles, RPS(Reverse specific position)-BLAST, and profile databases for
the analysis of biological data, we proposed a unified framework, called GDDA-BLAST,
for inferring structural, functional, and evolutionary information from sequences. In this
chapter, we will introduce the concept and backgrounds of GDDA-BLAST. Then, we
will describe several researches and their results using this computational assay.
14
2.2 GDDA-BLAST with phylogenetic profiles
A phylogenetic profile is a vector that encodes the existence of the protein across
different genomes to predict functional relations and physical interactions between
proteins [26,27]. This approach has applied to one entire sequence with one protein
(single profile method) or separate segments of a sequence with different proteins
(multiple profile method). In principle, when proteins have the similar patterns in their
sequences, the proteins may interact with each other directly or share a common
functional role in their pathways. Thus, the underlying hypothesis of phylogenetic profile
is that functionally linked proteins tend to be inherited or eliminated in a correlated
manner, and, the homologues of the proteins may exist in the same subset of organisms.
Similarly, GDDA-BLAST creates a matrix that encodes the existence of the alignments
of a domain profile across different proteins [12].
The basic idea of GDDA-BLAST is to collect a set of profiles that align to the
query sequence. These profiles can be attained from various knowledge-base sources
such as PDB (Protein Data Bank), Pfam, and SMART, CDD (Conserved Domain
Database) from NCBI (National Center for Biotechnology Information) and/or actual
sequence of a representative protein domain. Then, RPS-BLAST is utilized to compare
query sequences with these profiles. RPS-BLAST generally search protein sequences
against a database of PSSM (position specific scoring matrices) to identify the sequences
with fast speed, and it is informative for the identification of the possible function(s) the
query protein may have. However it is not sensitive to identify divergent sequences. For
15
overcoming this limitation, GDDA-BLAST employed innovative methods to align the
query sequence to the profiles by RPS-BLAST.
First, we utilize a single domain profile database for pairwise comparisons. Since
RPS-BLAST searches aligned profiles in a whole profile database, the searching speed
becomes very slow if a thousand of sequences are used to search the profiles. As dividing
a whole profile database into a number of single domain profile databases, we increase
the speed of profile searches. Next, we record and quantify non-seeded alignments from
unmodified query sequence and “seeded” alignments from modified query sequence. The
modified query sequences are generated with a “seed” from the profile to create a
consistent initiation site. This consistent site assists rps-BLAST to extend an alignment
between highly divergent sequence segments. This approach is designed to amplify and
encode the alignments to hit for any given query sequence. Seeds can be obtained at
multiple proportions (e.g. 3-50% “seed” size) from any region of the profile sequence
(e.g. N-terminal, middle, C-terminal). These seeds are inserted at each position of the
query once at a time. Therefore, a query of N amino acids generates 2*N distinct test
sequences for each seed. Each of these test sequences is aligned by rps-BLAST against
the parent profile.
Based on these innovations, we developed GDDA-BLAST to improve the
performance of RPS-BLAST. Shown in Figure 2-1, the computational pipeline consists
of five procedures. First, we obtain domain profiles from multiple knowledge-based
sources such as Pfam, SMART and CDD or from real sequences. Then we modify the
query sequence with a seed from the profile to create a consistent initiation site. Next,
each of these modified sequences is aligned against the parent profile by rps-BLAST.
16
In the forth procedure, the results are filtered by thresholds such as % identity and %
coverage using Eq. (2.1) and Eq. (2.2).
Where lenalignment = the alignment length = qend – qstart +1
qstart = The start position of a modified query sequence in the alignment
qend = The end position of a modified query sequence in the alignment
Seeding rps-BLAST
Signal collection
Phylogenetic profiles
Figure 2.1: The workflow of GDDA-BLAST. (i and ii) The algorithm begins with a
modification of the query amino acid sequence at each amino acid position via the
insertion of a seed sequence from the profile of interest. These seeds are obtained from
the profile consensus sequences from Conserved Domain Database (CDD). (iii–v) Signals
are collected from optimal alignments between the ‘‘seeded’’ sequences and profiles by
using rps-BLAST and are incorporated as a composite score into an N by M data matrix
[13].
( % ) 1 0 0a l i g n m e n t
p r o f i l e
C o v e r a g el e nl e n
2.1
( % ) 1 0 01
i d e n t i c a l s e e d
a l i g n m e n t s e e d
I d e n t i t y n u m l e nl e n l e n
2.2
17
lenprofile = The length of a consensus sequence of a given profile
lenseed = The sequence length of a seed inserted into the query
numidentical = The number of identical residues in the alignment
The phygenetic profile is finally generated from the filtered sequence alignments
by representing an M (# of profiles) by N (# of queries) matrix. Then, the dedrogram is
produced from this profile on the basis of Pearson’s correlation between query sequences
using equation Eq. (2.3). This dedrogram is used to predict the functional relationships
among query sequences. If a phylogenetic tree is built on the basis of Euclidian distances
between the phylogenetic profiles from Eq. (2.4), we also measure the evolution
distances among sequences. In next chapters, we will introduce our studies, which
discovered experimental results to support our functional and evolutionary predictions
using GDDA-BLAST.
where X and Y are the averages of values in X and Y. X and Y are the standard
deviations of these values.
2.3 The prediction of functional characteristics of proteins by GDDA-BLAST
Since the seeding allows RPS-BLAST to extend the alignment between highly
divergent sequences, we identified divergent domains in proteins using GDDA-BLAST
[12]. Especially, if we use multiple domain profiles as the parent profiles, we detected
1,
1 i ii N
X Y
X X Y Yr
N
2.3
2
1 ,
( , )i M
D X Y X Y i iyx
2.4
18
multiple functional properties of a protein by GDDA-BLAST. For example, ankyrin
repeats can perform a number of functions such as ATP-binding, lipid-binding and
calmodulin-binding [28,29]. However, there are no current domain-detection algorithms
which can resolve their multi-functional nature. Thus, to detect their multi-functional
characteristics, we generated multiple phylogenetic profiles for vanilloid TRP (TRPV)
family using multiple domain profiles such as 131 peripheral lipid-binding (PLB), 98
Integral lipid-binding (ILB), 58 Trafficking (TRFK), 10 Calmodulin-binding (CBD), 4
Ankyrin Repeat (ANK), and 574 ATP (ATP) profiles. Shown in Figure 2-2 (a), we
observed the signals for all of these profiles within the ankyrin repeats of TRPV1 channel
at varying levels of intensity. To validate our predictions, we focused on the signal of
ATP binding domains among these signals.
Lishko et al. recently crystallized the ankyrin repeats of TRPV1 and TRPV2, and
they found their structures to be highly similar [28]. They also discovered both ankyrin
repeats bound to calmodulin, while only TRPV1 was capable of binding ATP in their
assays [28]. Indeed, when we obtain phylogenetic profiles for TRPV1 and TRPV2 using
GDDA-BLAST, we observe calmodulin signals in the ankyrin repeats of both TRPV1
and TRPV2. Comparing the ATP binding signals between two proteins, TRPV1 has a
robust ATP signal within its ankyrin repeats, while the ATP signal of TRPV2 is only
18% of TRPV1 in Figure 2-2 (b). This result suggested that TRPV1 may bind ATP but
TPRV2 may not.
In addition, we predicted the conserved residues from the alignments of ATP
binding domain profiles by GDDA-BLAST. Shown in Figure 2-2 (c), top scoring residue
19
in TRPV1 is E211, which coordinates the N6 amine binding of ATP in the active pocket.
Therefore, all of these results propose that GDDA-BLAST can predict the functional
properties of a protein, which matched the experimental results from the literatures.
a
b c
Figure 2-2: GDDA-BLAST model of the ATP-binding Ankyrin Repeat in TRPV1 [12].
(a) GDDA-BLAST results for human TRPV1 channel using131 peripheral lipid-binding
(PLB), 98 Integral lipid-binding (ILB), 58 Trafficking (TRFK, n=58), 10 Calmodulin-
binding (CBD), 4 Ankyrin Repeat (ANK), and 574 ATP profiles. (b) GDDA-BLAST
results for the screen of 574 ATP profiles in the ankyrin repeat domain of various TRP
channels was integrated to quantify the area under the curve and plotted in a bar graph.
(c) Left: Quantification of amino acid positions in human TRPV1 ankyrin which are
identical or similar in alignments with ATP profiles. Right: Crystal structure of the rat
TRPV1 ankyrin repeat complexed with ATP (PDB: 2PNN). Residues depicted in yellow
are homologous to those derived in human TRPV1
20
2.4 The investigation of evolutionary relations among proteins using GDDA-BLAST
To determine evolutionary relationships between homologous proteins, we should
measure evolutionary rates among the proteins. We assumed that the rate information can
be measured using a phylogenetic profile from GDDA-BLAST. Shown in Figure 2-1,
phylogenetic profiles from GDDA-BLAST are encoded as vectors. As each “seeded”
query can return either no alignment, or an alignment that ranges over %identity
and %coverage using RPS-BLAST; we encode this information into the N X M matrix
with these vectors. Then, an euclidian distance are generated from this N X M vector
matrix on the basis of the simple hypothesis that the distance between each N [query] in
the matrix is proportional to the rate of evolutionary divergence.
Indeed, Figure 2-3 represents the results of our characterization of 20 water-
channel (aquaporin) proteins with 23,605 profiles from the NCBI-CDD database [12]. In
this result, we discover that there are four distinct families with rates that accord with
previous studies employing multiple sequence alignment [30]. From random
considerations, the probability of organizing these twenty sequences correctly into 4
families is 9X10-13. Therefore, these results demonstrate that phylogenetic profiles
derived by GDDABLAST can contain evolutionary rate information, which is
independent of multiple sequence alignment based methods. We believe that rigorous
analyses on benchmark training sets will enable us to make more refined and statistically
robust measurements among distantly related and/or rapidly evolving proteins.
21
Figure 2-3: Water Channel (Aquaporin) Phylogeny [12]. Twenty Zea Maize aquaporin
channels (plasma membrane intrinsic proteins (PIPs), tonoplast intrinsic proteins (TIPs),
Nod26-like intrinsic proteins (NIPs), and small and basic intrinsic proteins (SIPs)) were
screened with GDDA-BLAST. The Euclidian distance is generated from the composite
scores and plotted in an unrooted tree using the MEGA3 minimum evolution algorithm
[31]. Scale bar reflects the Euclidian distance between sequences and color coding
reflects the distinct and known classes of aquaporins. Our results are in excellent accord
with the findings of Chaumontet al [32]
22
2.5 The prediction of structural boundaries of ion-channels using GDDA-BLAST
A recent study by Mio et al. obtained a cryo-EM structure of TRPC3(Transient
Receptor Potential Channel 3) and modeled the six transmembrane helices with the
atomic structure of the potassium channels KcsA and Kv1.2 [33]. Interestingly, these
authors also determined that TRPC3 contains a globular, and presumably hydrophobic,
inner-core surrounded by signal sensing antenna derived from the cytosolic N and C-
termini in Figure 2-4(a). We wondered whether these channel constituents could be
computationally modeled with GDDA-BLAST, by generating phylogenetic profiles from
sequences that comprise the appropriate structural elements/biological functions of
interest.
Initially, we queried human TRPC channels with a curated set of 98
transmembrane domain containing profiles to generate our GDDA-BLAST phylogenetic
profiles. The distribution of the alignments which are above threshold is plotted in
Figure 2-4(b). The results from this experiment accurately model the channel domain in
human TRPC channels when compared with transmembrane predictions by the hidden
Markov model TMHMM and the domain detection algorithm SMART [34,35].
We tested whether key-word searches of the NCBI CDD database (CDD) could
be used to collect additional points of information to our phylogenetic profiles. We
collected 536 profiles in CDD which have the following key words such as channel,
transmembrane, integral membrane, pump and performed our analysis repeatedly in
Figure 2-4(b).
23
a
b
Figure 2-4: GDDA-BLAST models of the ion transport domain in TRPC channels.
(a) 3D reconstruction of TRPC3 channel derived by Mio et al.[33]. Blue lines depict the
plasma-membrane. The scale on the left depicts the cryo-electron microscopic images of
horizontal slices parallel to the plasma-membrane (images 6-9) progressing into the
cytosol (images 10-15). The globular inner shell can be seen as a circular density in the
center of the images. (b) GDDA-BLAST results for human TRPC channels using 98
curated integral lipid-binding (ILB) profiles and 576 profiles parsed with key words for
(channel, transmembrane, integral membrane, and/or pump). The latter were also
analyzed with different % coverage thresholds. Ion transport boundaries in TRPC
channels predicted by SMART (default settings) are noted with the N-terminal boundary
denoted by an arrow. GDDA-BLAST results predict that the globular inner shell domain
is located to the left of the arrow.
24
We observe that alignments against these profiles also model the channel domain
boundaries. In addition, a pronounced peak is evident in TRPC3/6/7 that significantly
differs in TRPC1/4/5. This signal likely represents the hydrophobic globular inner-core
domain in TRPC3 identified by Mio et al.[33], and suggests that the channel domains in
TRPC1/4/5 are likely different structurally and/or functionally from TRPC3/6/7.
To determine whether these signals are robust, we recalculated the data using % coverage
thresholds ranging between 60% and 100% in Figure 2-4(b). Surprisingly, a 60%
threshold does not significantly alter the domain boundaries, but does increase the signal
in our results. Overall, the GDDA-BLAST model of TRPC ion-channel domains is in
excellent accord with other computational models and experimental evidence.
2.6 The discovery of novel lipid-binding domains in vitro
Using lipid-binding profiles, we also predicted the regions of lipid binding in
proteins whose functions are not annotated by any conventional algorithm using GDDA-
BLAST. Then, we designed an assay to validate our prediction for these proteins. Shown
in Figure 2-5 (a), we observe multiple peaks in the histograms generated from these
alignments. Next, we cloned the representative regions from each of these proteins and
prepared bacterially purified protein. These purified proteins were subjected to liposomal
assays containing lipids which mimic the plasma-membrane of animal cells.
Strikingly, each of the fragments containing GDDA-BLAST signals was positive
for lipid-binding in Figure 2-5 (b), whereas our negative controls were not. Although the
physiological relevance of these lipid binding domains remains to be determined, these
25
results clearly demonstrate that phylogenetic profiles generated using ontological
relationships are effective for identifying putative functions within protein domains.
a b
Figure 2-5: Functional Information via GDDA-BLAST analysis (a) GDDA-BLAST
results for three human proteins of unknown function (AAH33897, NP_872401, and
CAB45695.2) using 131 peripheral lipid-binding (PLB) profiles. The white bars depict
regions that we cloned for liposomal experiments in (b). (b) Western analysis of purified
CAB45695, AAH33897, NP_872401, fragments cloned into His vector (1 mg load).
These fragments were tested for binding to liposomes containing phosphatidylcholine
(PC), phosphatidylethanolamine (PE), phosphatidyl serine, and phosphatidylinositol (PI).
All fragments bound to liposomes except fragment 1 (CAB45695: aa 70-180) and the
HIS-tag in perfect accord with the predictions of GDDA-BLAST.
26
2.7 Summary and discussion
In summary, we introduced a new tool for using phylogenetic profiles to infer
structural, functional and evolutionary information from the amino acid sequence of a
protein in these chapters. GDDA-BLAST is a unified computational pipeline for
measuring the structural, functional and evolutionary characteristics of a protein using
phylogenetic profiles with a carefully selected set of profiles. There are two hypotheses to
implement GDDA-BLAST. First, the primary amino acid sequence contains information
of structure, function and evolution of a protein, and, second, the SF&E information can
be inferred from the sequence by a unified method, even if the pair-wise identity of
sequences is below 25%.
Based on these hypotheses, GDDA-BLAST consists of five procedures. First, we
utilize a single domain profile database for pair-wise comparisons. Then, we modify the
query with a “seed”. This seed can be generated from a profile by taking any fraction of
the profile such as N-terminus or C-terminus. This seed is inserted into every position of
the query at a time, creating a consistent initiation site. This site allows rps-BLAST to
extend an alignment even between highly divergent sequences. This resampling strategy
is designed to amplify and encode the alignments possible for any given query sequence.
Next, the results are filtered using thresholds such as % identity and % coverage. The
phylogenetic profiles are finally generated by representing each sequence as a vector of
non-negative numbers. These profiles can be used to create a dendrogram of functional
relationships among proteins using pearson correlations or a phylogenetic tree using
euclidian distances.
27
In our previous studies, GDDA-BLAST can accurately model structural and
functional relationships in TRP channels through these procedures. This is supported by
our findings that GDDABLAST predicts: (i) the ion-channel domains of TRP channels,
(ii) lipid-binding and trafficking function within the previously uncharacterized TRP_2
domain, and (iii) the multi-functional (lipid-, calmodulin-, and ATP-binding) natures of
ankyrin repeats within TRP channels. Our experimental evidences demonstrate that
TRPC3 with TRP_2 is a lipid/trafficking domain that contributes to DAG-sensitive
vesicle fusion. The models of TRPC channels by GDDA-BLAST also recapitulate
experimental evidences from other laboratories. For example, the homologous C-terminal
domain of TRPC6, recently reported to bind both PIP3 and calmodulin in various ion
channels, yet is undetectable by conventional methods [12].
GDDA-BLAST readily predicts this domain and its functions. GDDA-BLAST
also accurately models the ATP-binding activity contained in the ankyrin repeats of the
structurally resolved TRPV channels [12]. We also observe a segmented signal in
TRPC3/6/7 when tested by GDDA-BLAST with transmembrane domain profiles, which
likely represents the globular inner-core domain observed in the cryo-EM structure
obtained by Mio et al. [33]. In addition, GDDA-BLAST predicts that all plasma-
membrane resident ion channels likely contain peripheral-lipid binding and trafficking
domains, based on multiple lipid-binding domains that we also observed in all channels
tested (e.g. aquaporins, and Na+, K+, Cl-, Ca2+channels). All of these channels have
been demonstrated, empirically, to interact with lipids [63].
From these results, we concluded that GDDA-BLAST measurements can be
treated as “fingerprints” of structural, functional and evolutionary information. Through
28
the careful choice of knowledge-base profiles related for either structural or functional
qualities, GDDA-BLAST provides results which can be used to infer evolutionary rate
information, create functional models and identify structural boundaries for protein
sequences, even if no prior information exists. Perhaps most important, GDDA-BLAST
has the capacity to inform laboratory experiments of key amino acids essential to protein
function, thus speeding the discovery process. Our studies here demonstrate one way of
using phylogenetic profiles to quantitatively probe knowledge-bases to obtain structural,
functional and evolutionary information within the same unified framework. Future
works aimed at determining the data points collected by GDDABLAST which are
informative for structural, functional and evolutionary annotation, and which ones are
sufficiently noisy such that they are detrimental to the total information content will
enable us to understand and harness the underlying mechanisms of our algorithm
optimizing and refining our approach. For these purposes, we will suggest the methods to
improve the performance of GDDA-BLAST in next chapter.
Chapter 3
The Performance of GDDA-BLAST in homology detection
3.1 The backgrounds and Motives
Since proteins with similar sequences can share similar structures, the homology
between a know protein and unknown protein is used for investigating the structure and
function prediction of a new protein. In the modeling procedure, a new sequence is
usually compared against all the known sequences in a database. If the homology is
created, the structure and function of the new protein can be inferred from the
homologous protein.
For the identification of the relation, the similarity between the sequences is
calculated from the sequence alignments. If the similarity between two sequences is over
a threshold such as 25%, a literature proposed that the new and known sequences are
closely related [4]. If their sequence identity is not high enough to discover the
relationships, we need to decide whether they are related or not. Sequence-sequence
comparison algorithms generally cut off pair-wise alignments below 25% identity.
However, empirical analyses proved that some sequences with low identity still have
functional and/or structural relationships because these sequences are distantly related in
their evolution [14].
A main reason of this problem is the influence of evolution. Even though the
sequences can be changed significantly due to the mutations and insertions, many
30
proteins still have the same folds and close functional relationships with low sequence
similarity. However, the sensitivity to detect homologous proteins in homology-based
methods suddenly drops below 25 % sequence identity because homology-based methods
discriminate the alignments below 25%.
To detect homologous sequences with weak identities, one of possible solutions is
to increase the sensitivity of sequence comparison. For increasing its sensitivity, we need
to modify a calculating process of a sequence similarity. For example, instead of
comparing two sequences directly, many programs use statistical information of protein
families such as PSSM (Position Specific Scoring Matrix) and HMM (hidden Markov
model)s in Figure 3-1 [6]. While PSSM contains the frequencies of the residues in
specific positions of the sequence, HMMs have the probabilities of the residues which
exist in the positions.
Even though conventional homology-based methods such as PSI (Position
Specific Iterrative)-BLAST and SAM (Sequence Alignment and Modeling system)
increase to the sensitivity to detect the distant homologues on the basis of PSSM or
HMM, they still miss to detect sequences with very weak similarities such as below 10%
because of stringent thresholds for defining significant sequence similarity [3]. In an
attempt to rectify the shortcomings of the methods stated above, the GDDA (Gestalt
Domain Detection Algorithm)-BLAST was developed to increase the sensitivity of RPS-
BLAST by amplifying alignments with low identities. As increasing the sensitivity,
GDDA-BLAST detects the signals of the divergent alignments, which other
computational algorithms cannot detect, between domain profiles and the protein
31
sequence. Based on the signals, GDDA-BLAST can search homologous pairs among a
huge amount of proteins more sensitively. In addition, using multiple domain profiles
from various knowledge-base sources such as PDB, Pfam, SMART, CDD and/or real
sequences, GDDA-BLAST can also generate the phylogenetic profiles from which we
are able to derive biological information related to structures, functions and evolution
from the sequences.
To evaluate the performance of GDDA-BLAST, we need the objective
measurement for functional, structural and evolutionary predictions. Among all these
predictions, we will first evaluate the performance for structural homology detection.
Thus, we select PDB40D-J dataset which contains 935 sequences from SCOP for the
measurement its performance. Using these sequences, we compared the performances of
two methods such as PSI-BLAST and SAM-T21K to detect homologous pairs in pdb40d-
j dataset with that of GDDA-BLAST. We will explains the procedures and dataset for the
performance evaluation, and suggest methods to improve the performance of GDDA-
BLAST in these chapters.
32
a
b
Figure 3-1: The statistical information of protein families. (a) An example of a 49 residue
sample profile, generated from the four-probe sequences located at the left position [52].
(b) The model of HMM, modeling sequences of as and as two regions of potentially
different residue composition [6].
33
3.2 Results and Discussion
3.2.1 Datasets for the performance evaluation
For the evaluation, we used a structural benchmark dataset from Structural
Classification of Proteins (SCOP) database. SCOP database usually provides detailed and
comprehensive information of the structural and evolutionary relationships of proteins
whose structures are already proven in wet-lab experiments. Based on a protein domain
as a unit of classification in SCOP, small proteins with a single domain are treated as a
whole, and the domains within large proteins are classified individually. Thus, Figure 3-2
depicts that the classification in the database consists of five hierarchical levels on the
basis of the evolutionary and structural relationships [36].
In the classification, if the sequence identities between proteins are over 30% or
the functions and structures of proteins, even in low identities, are very similar each other,
these proteins are clustered into the same family which has a common evolutionary origin.
Proteins, whose identities are low and whose common evolutionary origin is probable,
are catagorized into superfamilies. If proteins in different superfamilies and families have
the same major secondary structures, these proteins belong to a common fold. Finally, the
different folds are divided into classes for user convenience. Based on the secondary
structures of which the folds composed, they are assigned to one of these five classes
such as i) all alpha, ii) all beta, iii) alpha and beta, iv) alpha plus beta, and v) multi-
domains [36].
34
Among these hierarchies, we use the sequences in superfamilies to evaluate the
performances of homology detection algorithms because the proteins in superfamilies can
represent the boundaries of groups which share the same structural and functional
features or have the common evolutionary origins [16]. Among many datasets to include
superfamilies, we selected PDB40-J dataset containing 935 sequences, whose sequence
identities are less than 40%, from the literatures [16]. In addition, we extracted 289
sequences in twilight zone, where the sequence identities are below 25%, from PDB40-J
because most of homology-based algorithms lose their sensitivity to detect homologous
sequences in this region.
1086 Folds
1777 Superfamilies
3464 Protein domains
97178 Protein domains from different species
1086 Folds
1777 Superfamilies
3464 Protein domains
97178 Protein domains from different species
Figure 3-2: Five hierarchical levels of SCOP classification [36]. The unit of classification
in SCOP is the protein domain. Small proteins with a single domain are treated as a
whole, and the domains within large proteins are classified individually.
35
After we calculated the sensitivity and specificity using the number of true and
false homology pairs which PSI-BLAST, SAM and GDDA-BLAST detected using these
two datasets, we compared their performances each other on the basis of the sensitivity
and specificity of these methods. The measuring procedures will be discussed in the
following chapters.
3.3 Homology detection methods for the performance evaluation
To evaluate the performance of GDDA-BLAST, we compared its performance to
those of PSI-BLAST and SAM because they are representative methods among many
homology-based methods. Shown in Figure 3-3(a), PSI-BLAST iteratively searches a set
of sequences which may be homologues for the fixed iterations or until it cannot find new
homologues. In the procedures of PSI-BLAST, GAP-BLAST first collects an initial set of
homologues from the sequence database such as NR (Non-Redundant protein database)
for a given query sequence. Then, weighted multiple alignments are generated using the
query sequence and the homologues whose scores are over a specified cut-off value. Next,
a new PSSM is constructed on the basis of the multiple alignments. Using this PSSM, it
searches the database for new homologues. These procedures are repeated until the
results satisfy the conditions given by users [5].
Using HMM instead of PSSM, SAM follows the similar procedures of PSI-
BLAST in Figure 3-3(b). First, SAM creates an initial HMM from a given query
sequence. After searching potential homologues from a sequence database with the initial
HMM, it selects new sequences, which have reliable local alignment scores with the
36
HMM, among potential homologues. After multiple alignments are generated using these
new sequences, a new HMM is created from the multiple alignments. These procedures
repeat for the fixed iterations [22]. For the performance evaluations with our datasets, we
used the default parameters of PSI-BLAST and SAM such as e-value (0.001) and three
iterations.
a
b
Figure 3-3: The schemes of homology-based methods. (a) The scheme of PSI-BLAST
with sequence profiles. (b) The scheme of SAM with HMM
37
3.4 The performance evaluation
After PSI-BLAST, SAM and GDDA-BLAST collect potential homologues in our
dataset, we evaluate their performances following these steps. First, we calculate the
similarity scores between test and reference sequences. Then, we rank test sequences in
ascending order on the basis of similarity scores. After counting the number of true and
false positives and negatives within a sliding window, we draw Receiver Operating
Characteristic (ROC) curve.
For the similarity score of PSI-BLAST and SAM, we calculated E-value which
represents the number of hits that can be shown by chance when searching a database of a
particular size using Eq. (3.1). For the similarity of GDDA-BLAST, we used Hybrid
LogWeighted scoring scheme using Eq. (3.2). This scoring scheme consists of two steps.
First, we calculate the scores of three phylogenetic profiles such as # of hits, % of max.
coverage, and % of avg. identity. Then, we adjust their scores on the basis of the
frequency of the domains aligned with queries.
where K and λ are parameters, m is the length of a domain sequence, n is the length of a
query sequence, and S is bit score.
where H is the number of hit alignments, I is the average of identity, and V is maximum
coverage.
λS
E = K m n e
3.1
( , ) ( ( ) , ( ) ) , , ,x yT
S i m x y P C a d j a d j T H I V
3.2
38
In detail, we rank potential homologues in ascending order on the basis of E-
values after calculating the E-values of all potential homologues in PSI-BLAST and
SAM. Then, changing window size, we count the number of true and false positive and
negative homologous pairs in the potential homologues. For GDDA-BLAST, after
calculating pearson correlation among three phylgenetic profiles, we adjust the value of
each phylgenetic profiles on the basis of the frequency of the domains aligned with
queries. Then we multiplied each scores for total scores together. Based on these scores,
we count the number of true and false positive and negative homologous pairs in the
potential homologues with a sliding window.
Since ROC curve is one of simple methods to represent the relationship between
the FPR (False Positive Rate), which is 1-sepcificity, and sensitivity, we should calculate
sensitivity and specificity for the detection of true homology pairs using the number of
true and false positive and negative homologous pairs. The sensitivity measures the
proportion of true positives using Eq. (3.3), and the specificity measures the proportion of
true negatives using Eq. (3.4).
where TP is the number of true positives, TN is the number of true negatives, and FP is
the number of false positive.
T PS e n s i t i v i t y
T P + F N 3.3
T NS e n s i t i v i t y
T N + F P 3.4
39
Based on the sensitivity and specificity from these equations, we first plotted the
performances among three methods with PDB40-J dataset. Shown Figure 3-4 (a), the X-
axis represents the false positive rate, and Y-axis represents sensitivity. Even though we
could measure the performance of PSI-BLAST by 0.3 in false positive rate because of the
data measuring limitation in PSI-BLAST, the total performance of GDDA-BLAST is
better than those of PSI-BLAST and SAM. When we especially focus on the
a
b
Figure 3-4: The ROC graphs for the performance evaluation of GDDA-BLAST (a) The
comparison of the performances among GDDA-BLAST, PSI-BLAST and SAM using the
dataset of superfamily. (b) The comparison of the performances among GDDA-BLAST,
PSI-BLAST and SAM using the dataset of twilight zone.
40
performances below 0.05 in false positive rate (the red circle in the left of Figure 3-4 (a)),
GDDA-BLAST is superior to other methods in the sensitivity to detect homologous pairs.
Based these results, we concluded that GDDA-BLAST would have the better
performance that those of other methods for the detection of the structural homologues in
a dataset whose sequence identities are over 40%.
Since many homology-based methods lose their sensitivities for the detection of
potential homologues in twilight zone, we also measured the performances of these
methods with sequences in this zone. Although all three methods lose their sensitivities to
detect homologous pairs, the total performance of GDDA-BLAST is still better than
those of others. In the range below 0.05 in false positive rate (the red circle in the left of
Figure 3-4 (b)), while the performance of SAM is better than that of GDDA-BLAST
below 0.02, GDDA-BLAST surpass SAM in the sensitivity of detection. Therefore, these
two ROC curves show that GDDA-BLAST outperforms SAM and PSI-BLAST for the
detection of homologous sequences in superfamilies and twilight zone.
41
3.5 Summary and discussion
We evaluated the performance of GDDA-BLAST for the homology detection in
this chapter. For the evaluation, we selected PDB40D-J to measure the number of true
homologous pairs detected by GDDA-BLAST. PDB40D-J contains 935 sequences which
have pair-wise identities of less 40% in the superfamilies from the structural
classification of proteins (SCOP) database. We also extracted 289 sequences below 25%
pair-wise identity from PDB40D-J to evaluate the performance in twilight zone. 26374
domain profiles from CDD and PDB are used as profiles for GDDA-BLAST.
First, we calculated the similarity scores between each test and reference sequence
to evaluate the performances of GDDA-BLAST, PSI-BLAST, and SAM after aligning
them. For the similarity score of PSI-BLAST and SAM, we use E-value, which
represents the expectation value of hits shown by chance when searching a database of a
particular size. For the similarity score of GDDA-BLAST, Hybrid LogWeighted scoring
scheme is used. Hybrid LogWeighted scoring scheme consists of two steps. First, we
calculate the scores of three phylogenetic profiles such as # of hits, % of maximum
coverage, and % of average identity. Then, their scores are adjusted on the basis of the
frequency of the domains aligned with queries. Next, test sequences are ranked in
ascending order on the basis of similarity scores. Based on the number of true and false
positives and negatives within a sliding window, receiver operating characteristic (ROC)
curve of each method is drawn.
Shown in Figure 3-4, the performance of GDDA-BLAST is better than those of
PSI-BLAST and SAM with datasets in superfamilies and twilight zone. In very low false
42
positive rate (<0.05), the sensitivity of GDDA-BLAST is higher than those of PSI-
BLAST and SAM. This means that GDDA-BLAST is more sensitive to detect
homologous pairs than other methods.
Even though GDDA-BLAST outperforms SAM and PSI-BLAST for the detection
of structural homologues in superfamilies and twilight zone with PDB40D-J dataset,
GDDA-BLAST still have disadvantages which we need to improve. First, we should
develop a method to build domain profiles for the generation of the best phylogenetic
profiles to predict specific functional or structural proteins. Generally, we used domain
profiles selected from CDD and PDB to generate the phylogenetic profiles for the
analysis of proteins. Despite being useful for the functional prediction of some proteins,
these profiles are not enough to predict functions of many proteins because some
domains in the profiles cause to generate noises in the phylogenetic profiles. For example,
if we use domain profiles from CDD to predict evolutionary relationships among RT
sequences, the phylogenetic tree using total domain profiles is worse than a phylogenetic
tree using domain profiles from RT sequences themselves [13].
Second, we need to develop the best scoring scheme for the comparison of the
performances in homology detection because the performance for homology detection
depends on the score for each sequence. Homology-based methods generally represent
the potential homologues with their scores after searching them in the reference database.
Among many scores, e-value and hit score are popular standards to detect homologous
pairs. Since e-value depends on the size of the database and hit score is decided by the
number of identical residues, these methods sometimes miss to detect remote
43
homologous sequences with low sequence identities. To overcome these problems,
GDDA-BLAST uses pearson correlation value to measure the similarity among
phylogenetic profiles from sequences. Although pearson correlation is independent of
sequence identities and the size of a database, pearson correlation itself is not enough to
measure the similarity between sequences because it is too sensitive for noises in the
phylogenetic profiles. Therefore, we need to implement the score system to measure the
similarity of phylogenetic profiles.
Finally, we have to design residue-based phylogenetic profiles for the collection
of accurate information from sequences. In several studies [15,65], a small number of
conserved residues in sequences with 8% sequence identity can coordinate the 3D fold
and/or function of proteins, with large portions of these proteins comprising
heteromorphic pairs. Therefore, if we can extract features of key residues to determine
the functional and structural characteristics of a protein from a sequence, we would
accurately measure the similarity among residue-based phylogenetic profiles from the
sequences.
Chapter 4
The identification of RNA binding proteins using the quantitative functional measurement
RNAs in a cell generally have many functions such as a carrier of genetic
information, a catalyst of biochemical reactions, an adapter molecule in protein synthesis,
and a regulator of RNA splicing/maintenance of telomeres [37]. If we would identify the
functions of RNAs, we should understand the functions of RNA binding proteins because
RNA interacts with a diversity of proteins to regulate a multitude of additional cellular
functions such as pre-mRNA processing, splicing, and translation [38]. Therefore, if we
identify RNA binding proteins related to a specific biological process, we are able to
discover the functions of RNAs in the biological process. However, since RNA structures
are various, the structures of proteins to interact with the RNAs can be very diverse.
Indeed, RNA binding proteins can be classified into six families on the basis of their
basic binding motifs [39], and the proteins in the same family do not share common
structures in Figure 4-1.
For example, while the structure of an arginine-rch motif is unstructured
secondary motif [40,41], the structure of a motif in an αβ protein domain family consists
of several antiparallel β sheets and α helices [42]. In addition, multimeric motif is
composed of multiple proteins or the repeats of the same structural motif [43,44], but
zinc-finger motif contains several zinc-finger peptides and α helices [45,46].
45
a b
c d
e f
Figure 4-1: The structures of RNA binding proteins. (a) The structure of arginine-rich
protein family [40,41]. (b) The structure of all-helical protein family [47]. (c) The
structure of αβ protein [42]. (d) The structure of zinc finger protein family [45,46]. (e)
The structure of multimeric protein family [43,44]. (f) The structure of RNA-targeting
enzyme [48]
46
In addition, even within the same RBP family, the RNA interaction sites need not
to be conserved. Taken together, it is difficult to identify RNA binding proteins in silico
and in vitro. In this chapter, we first start to review existing methods for the identification
of RNA binding proteins. Then, we introduce a computational assay to overcome their
disadvantages. Finally, we analyze and discuss the results in more detail.
4.1 A classification library for RNA binding proteins
We generally use the RNA electrophoretic mobility shift assay for the
identification of RNA binding protein in vitro. In principle, nucleic acid probes which the
protein binds move slowly because the speed of different molecules through the gel is
determined by their size and charge [49]. Based on this property, we can
electrophoretically separate a protein-DNA or protein-RNA mixture from other probes in
Figure 4-2. However, it takes long time to identify RNA binding proteins despite the
accuracy of the identification.
To increase the speed of the analysis, multiple algorithms such as homology-
based methods, support vector machine (SVM) and phylogenetic methods have been
developed for the identification of RNA binding proteins in silco. Among these methods,
SVM is very popular because it can be easily implemented. Bock and Gough [66] have
first shown that SVM is applicable for predicting RNA-binding proteins from protein
primary sequence. In recent studies, Yu et al. [67] have predicted functional classes on
the basis of targets such as rRNA, mRNA, tRNA and viral RNA of RNA binding proteins
47
using a variety of sequence-based information. Even though these computational methods
are reliable for the identification of RNA binding proteins, they still have the limitation of
the prediction. [37].
For example, we identified 54 RNA binding proteins which contain RRM using
Interproscan and SVM to evaluate the performances of them briefly. Shown in Table 4-1,
the accuracies of these methods are 96.3% and 62.96%. In the view of the accuracy, the
performance of Interproscan is better than that of SVM.
Figure 4-2: The overview of EMSA [50]. A protein-DNA or protein-RNA mixture is
separated from other probes using the difference between the sizes of molecules
48
However, while Interproscan can only detect the regions of RRMs, SVM can
predict the functions of these proteins in Figure 4-3. To overcome this limitation, we need
to develop a new method which can predict the regions of RNA binding domains and
annotate the functions of a protein together using quantitative measurements.
To resolve this question, we applied GDDA-BLAST to identify RNA binding
proteins. Following the procedures in Figure 4-4, we clustered 16 positive sequences with
RRM and 25 negative sequences. After drawing the dendrogram of the sequences, we
found a problem to identify RNA binding using GDDA-BLAST.
Shown in Figure 4-5, the dendrogram of the sequences analyzed by GDDA-
BLAST contains false positive sequences. Because of these sequences, we cannot predict
the function of RNA binding proteins accurately. Thus, we need to develop new
Table 4-1: The comparison of the performances between Interproscan and SVM
Interproscan SVM True positive 52 34 False negative 2 20 The accuracy 96.3% 62.96%
Figure 4-3: The problems of functional annotations in conventional programs (a) The
functional prediction using Interproscan. (b) The functional annotation from NCBI.
49
strategies to eliminate these sequences. The first is to define a threshold to filter the
sequences, and the second is to design a new phylogenetic profile.
To apply these strategies to GDDA-BLAST, we implemented a classification
library for the identification of RNA binding proteins. Shown in Figure 4-6, we first
collect real sequences from a biological database (DB) such as NCBI. Then, we generate
domain profiles from the real sequences. After aligning the sequences against domain
profiles by GDDA-BLAST, we calculate normalized scores of all residues in a query, the
average scores of each query and norms of average scores on the basis of the positive
alignments from GDDA-BLAST. Next, the false positive sequences can be filtered
Figure 4-4: The pipeline of GDDA-BLAST for the identification of RNA binding
proteins. After generating the phylogenetic profiles from the positive alignments, the
pearson correlations among these profiles are calculated. Based on these values, the
sequences are clustered by hierarchical clustering.
50
among queries by a threshold derived from the norms (average scores) for all queries.
After filtering them, we generate a residue distribution matrix of the positive sequences to
investigate the functional or structural relationships of these sequences. Finally, we
cluster the positive sequences using Hierarchical clustering and Pearson’s correlation.
To define a threshold for the elimination of false positive sequences, the
normalized scores of residues are first calculated on the basis of total scores of residues
from the positive alignments using eq. (4.1). In fact, in the alignment between a query
and a domain profile, if two resides are identical, the score for the residue in the query is
Figure 4-5: The false positive sequences in phylogenetic profiles from GDDA-BLAST.
Read boxes represent the negative sequences. The group of positive sequences contains
several negative sequences.
51
assigned 2, and, if two residues are similar, the score is assigned 1. After assigning the
scores to all residues in the query, we add all scores to calculate the total score for the
query. We then normalize the score of each residue by the average of total score.
Then, we calculate the average score of each sequence for filtering the sequences
using eq. (4.2). We finally calculate the norms of average scores to reduce the effect of
the length with eq. (4.3) because these scores are proportional to the length of the query.
Figure 4-6: A classification library for the identification of RNA binding proteins. (i-ii)
Domain profiles are generated on the basis of real sequences from NCBI database. (iii-iv)
The modified sequences are aligned against the parent profile by rps-BLAST to collect
positive alignments. (v-vi) The sequences are divided into positive and negative groups
using a threshold calculated from the average residue scores of queries. (vii-viii) The
functional dendrogram is built using a residue distribution matrix generated from the
positive sequences.
52
To investigate the functional or structural relations among the sequences, we also
designed a new residue-based phylogenetic profile. Shown in Figure 4-7, the matrix
contains the compositions of 20 amino acids and 3 descriptors of chemical features. The
composition of an amino acid is the number of an amino acid divided by the number of
total amino acids.
The 3 descriptors represent global composition of specific chemical groups, and
they consist of composition (C), transition (T), and distribution (D). The composition (C)
is the number of amino acids with a particular group divided by total number of amino
acid in all chemical groups. Transition (T) is the frequency of transition from one
chemical group to another chemical group in a sequence. Distribution (D) is the chain
length within the first, 25%, 50%, 75% and 100% of the amino acid in a specific
chemical group [51].
Figure 4-8 represents an example of the hypothetical protein sequence which has
10 As and 16 Bs. The compositions for these two amino acids are
10*100/(10+16)=38.5% for A and 16*100/(10+16)=61.5% for B. The transition of A is
(10/26)*100=38.46% and the transition of B is (16/26)*100=61.54%. The first, 25%,
50%, 75% and 100% of As are located within the first 1, 4, 12, 17, and 26 residues
respectively. Thus, the D descriptor for As is (1/26)*100=3.8%, (4/26)*100=15.4%,
the sum. of total scoresNormalized score of a residue=Raw score- The length of a query 4.1
the sum of positive normalized scoresAverage score of a query = the num. of residues with the positive scores 4.2
the average scores of a query*100N orm of average score = the length of a query*2 4.3
53
(12/26)*100=46.1%, (17/26)*100=65.4, and 100. In the same way, the D descriptor for
Bs is 7.5%, 23.1%, 53.8%, 79.9%, and 92.3% [64].
Figure 4-7: A residue-based phylogenetic profile. It consists of 20 amino acid
compositions and 19 chemical features such as composition (C), transition (T), and
distribution (D)
54
4.2 The identification of RNA binding proteins
For the first training and testing proteins, we selected RRM containing proteins
because these proteins are abundant in different species and organisms. To discover a
threshold for the identification of the RRM containing proteins, we collected 15 positive
sequences with RRM and 24 negative sequences from PDB (Protein Database) and Swiss
database as a training set. Then, we generated RRM domain profiles on the basis of real
sequences from NCBI database. These domain profiles from real sequences can support
GDDA-BLAST to amplify the weak positive alignments strongly. After calculating norm
of the average score of each query, we drew the distribution graph of the norms of the
average scores for all sequences.
Shown in Figure 4-9(a), the sequences in a training dataset are completely
separated into positive and negative groups. The minimum of the positive group is
Figure 4-8: The sequence of a hypothetic protein for describing derivation of the feature
vector of a protein. Sequence index indicates the position of an amino acid in the
sequence. The index for each type of amino acids in the sequence (A or B) indicates the
position of the first, second, third, … of that type of amino acid (The position of the first,
second, third, …, A is at 1, 3, 4, …). A/B transition indicates the position of AB or BA
pairs in the sequence [64].
55
263.4171 and the maximum of negative group is 127.2489 from the measurement for the
boundary of each group. We finally selected the minimum of the positive group as a
threshold for RRM containing proteins.
Figure 4-9: Thresholds for the positive sequences in training sets (a) The thresholds of
two groups in the training set containing 15 positive and 24 negative sequences. (b) The
thresholds of two groups in the expanded training set containing 55 positive and 151
negative sequences
56
Next, we used 55 positive sequences and 151 negative sequences from a yeast
database and PDB to extend the training dataset. We especially added the 127 sequences
which are proven not to bind nucleic acids from PDB for the measurement of an accurate
threshold [38]. Since the sequences are completely separated and the minimum and
maximum are not changed after the analysis in Figure 4-9(b), we selected the same value
of the first training dataset as a threshold. Based on this threshold, we tried to identify
RRM containing proteins in a testing dataset, which contains 20 positive and 137
negative sequences, and we calculated the accuracy of the identification using Eq. (4.4).
Shown in Figure 4-10, we identified 20 positive sequences in the testing set, and the
accuracy is 100%.
Figure 4-10: The identification of RRM containing proteins in a testing dataset containing
20 positive and 137 negative sequences.
57
Where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positive, and FN is the number of false negative.
Then, we compared our performance to two other popular algorithms such as
Interproscan and SVM. We observe that, while SVM does not perform well in either the
training or the testing dataset, phylogenetic profiles and Interproscan provide robust
measures in Table 4-2.
To extend upon these discoveries, we performed similar analyses for single-fold
RBP classes. These classes include KH1, double-stranded RNA, and zinc fingers. The
results from these experiments are provided in Table 4-3. We observe that phylogenetic
profiles have 100% accuracy for all single-fold RBPs tested. In comparison, SVM
performs poorly in all of these datasets, while Interproscan performs well.
T P + T NA c c u r a c y
T P + T N + F P + F N 4.4
Table 4-2: Comparison with the sensitivities of other methods for the identification of
RRM containing proteins in training and testing sets.
Method TP FN Sensitivity (%) Pylogenetic classifier 55 0 100
Interproscan 54 1 98.18 Training set
SVM 35 20 63.63 Pylogenetic classifier 20 0 100
Interproscan 20 0 100 Testing set
SVM 12 8 60
58
This methodology is also scalable. Thus, we screened the yeast and human
proteome for RRM, double-stranded RNA binding, KH1, and zinc-finger domains. As
shown in Table 4-4, our methods detect both known and unknown members of the
Table 4-3: Comparison with the sensitivities of other methods for the identification of the
four single-fold RNA binding protein groups in training sets.
Group Method TP FN Sensitivity (%) Phylogenetic classifier 17 0 100
Interproscan 16 1 94.11 Double-
stranded RNA binding SVM 11 6 64.71
Phylogenetic classifier 11 0 100 Interproscan 11 0 100
KH I
SVM 5 6 45.46 Phylogenetic classifier 16 0 100
Interproscan 16 0 100 zf-ccch
SVM 9 7 56.25 Phylogenetic classifier 17 0 100
Interproscan 17 0 100 zf-cchc
SVM 9 8 52.94
Table 4-4: The results of identification for five RNA binding protein groups in yeast and
human proteomes. 29 and 168 novel potential proteins for these groups are identified in
yeast and human proteomes
yeast human RRM 54 23 372 67
Double-stranded RNA binding domain
2 2 39 12
KH1 6 2 56 39 zf-ccch 8 0 41 10 zf-cchc 11 2 11 40
total 81 29 519 168
59
classes of RBPs in both proteomes. In the case of the yeast proteome, we determined that
our method detects all of the previously identified RBPs. These results predict that the
number of RBPs in both proteomes is underestimated by 31% and 34% for the yeast and
human proteome respectively.
In 2008, Shazman et al. demonstrated that SVM methods could be improved by
incorporating electrostatic surface patch information into their analyses [38]. This
provided an excellent benchmark dataset for our study, as well as another algorithm with
which to compare our performance. First, our preliminary experiment using this dataset
as a testing set was performed with class-specific domain profiles. However, these class-
specific profiles were insensitive in this dataset.
Therefore, we wondered whether merely increasing the number of domain
profiles would improve our results. To accomplish this task, we used the PROSITE
database and used a key-word search for “RNA-binding”. The results from this search
were then manually confirmed to ensure the specificity of these sequences. Importantly,
the structure of these sequences was not taken into account. Following, additional
sequences were identified and domain profiles were generated from the non-redundant
NCBI database using PSI-BLAST.
60
Using this expanded PSSM library (2695 profiles), we then analyzed a training set
containing 100 single-stranded RBPs and 127 negative sequences in Figure 4-12. Under
these conditions, we see a clear separation of positive and negative sequences. In our
* 2695 single-stranded RNA binding profiles* Training set
- 100 positive sequences- 127 negative sequences
High threshold : 228.1489
Low threshold : 196.8542
Figure 4-11: The threshold for the classification of single-stranded RNA binding
proteins. 100 positive and 127 negative sequences in a training set are classified using
2695 single-stranded RNA binding profiles. The threshold is 228.1489.
Table 4-5: The comparison with other methods. The table summarizes the results of
identification for single-stranded RNA binding proteins using three methods such as
GDDA-classifier, Interproscan, and support vector machine (SVM). The number of
single-stranded RNA binding proteins is 37, and the number of non-nucleotide binding
proteins is 118.
Sensitivity (%) Specificity (%) Accuracy (%) Pylogenetic classifier 86.5 100 96.8
SVM with features from electrostatic surface patches
80 90 88
Interproscan 78.4 100 94.8 SVM with features from
amino acid sequences 78.4 96.6 92.3
61
testing dataset, which is comprised of 37 positive and 118 negative sequences from the
Shazman et al. study, we achieve 100% specificity, 96.8% accuracy, and 86.5%
sensitivity. In comparison, they reported 90% specificity, 88% accuracy, and 80%
sensitivity in Table 4-5. Thus, with the expansion of the PSSM library, our results rival
those previously obtained.
Using the same paradigm, we generated additional profiles for double-stranded
RNA binding (dsRNA), single-stranded DNA binding (ssDNA), and double-stranded
DNA binding (dsDNA) in Table 4-6. We then compared our results to those from the
Shazman study for the classification of ssRNA and dsDNA binding domains [38]. They
obtained 50% specificity, 51% accuracy, and 53% sensitivity, for this dataset. Our results
using either our ssRNA or dsDNA PSSM libraries are much improved: 97%/83%
specificity, 91%/80% accuracy, and 86%/76% sensitivity respectively.
Table 4-6: The classification among six types of RNA binding proteins such as double-
stranded RNA binding vs. double stranded DNA binding proteins, single-stranded RNA
binding vs. double-stranded DNA binding proteins, and single-stranded RNA binding vs.
single-stranded DNA binding proteins.
profiles Sensitivity(%) Specificity(%) Accuracy (%) dsRNA binding 100 100 100 dsRNAvs.dsDNA dsDNA binding 76.47 85 79.6 ssRNA binding 86.47 97.06 91.55 ssRNAvs.dsDNA dsDNA binding 76.47 83.78 80.28 ssRNA binding 86.47 94.12 88.89 ssRNAvs.ssDNA ssDNA binding 94.12 94.59 94.44
62
We also compared our results in additional testing sets curated from PROSITE,
for proper classification of dsRNA vs. dsDNA and ssRNA vs. ssDNA. Although
attempted, Shazman et al. concluded that to accomplish these comparisons, further
refinement of their method was needed [38]. Conversely, we obtain robust measurements
for these comparisons in Figure 4-13, in particular, for dsRNA binding, where we achieve
100% accuracy.
0
50
100
150
200
250
0 200 400 600 800
The no
rm. of average
score
A query length
dsDNA binding
ssRNA binding
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 200 400 600 800
The no
rm. of average
score
A query length
dsDNA binding
ssRNA binding
Accuracy=(32+33) *100/71=91.55%
* 2695 single-stranded RNA binding profiles* Testing set
- double-stranded DNA binding : 34 sequences- single-stranded RNA binding : 37 sequences
Threshold : 228.1489
a
b
0
50
100
150
200
250
0 200 400 600 800
The no
rm. of average
score
A query length
dsDNA binding
ssRNA binding
0
500
1000
1500
2000
2500
3000
0 200 400 600 800
The no
rm. of average
score
A query length
dsDNA binding
ssRNA binding
* 2275 double-stranded DNA binding profiles* Testing set
- double-stranded DNA binding : 34 sequences- single-stranded RNA binding : 37 sequences
Threshold : 207.7592
Accuracy=(26+31) *100/71= 80.28 %
Figure 4-12: The classification between double-stranded DNA and single-stranded RNA
binding proteins. (a) The accuracy of double-stranded DNA binding proteins is 91.55%,
and (b) the accuracy of single-stranded RNA binding is 80.28%.
63
a
b
0
50
100
150
200
250
0 500 1000 1500 2000 2500
The no
rm. o
f average score
A query length
dsDNA binding
dsRNA binding
0
1000
2000
3000
4000
5000
6000
0 500 1000 1500 2000 2500
The no
rm. o
f average score
A query length
dsDNA binding
dsRNA bindingThreshold : 207.7592
* 2275 double-stranded DNA binding profiles* Testing set
- double-stranded DNA binding : 34 sequences- double-stranded RNA binding : 20 sequences
Accuracy=(26+17) *100/54=79.6%
Threshold : 366.2998
* 101 double-stranded RNA binding profiles* Testing set
- double-stranded DNA binding : 34 sequences- double-stranded RNA binding : 20 sequences
Accuracy=(34+20) *100/54=100%
0
50
100
150
200
250
0 200 400 600 800
The n
orm. of average
score
A query length
ssDNA binding
ssRNA binding
0
1000
2000
3000
4000
5000
0 200 400 600 800
The n
orm. of average
score
A query length
ssDNA binding
ssRNA binding
* 2695 single-stranded RNA binding profiles* Testing set
- single-stranded DNA binding : 17 sequences- single-stranded RNA binding : 37 sequences
Threshold : 228.1489
Accuracy=(26+31) *100/71= 80.28 %
c
d
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 200 400 600 800
The n
orm. of average
score
A query length
ssDNA binding
ssRNA binding
0
1000
2000
3000
4000
5000
6000
7000
0 200 400 600 800
The n
orm. of average
score
A query length
ssDNA binding
ssRNA binding
* 1753 single-stranded DNA binding profiles* Testing set
- single-stranded DNA binding : 17 sequences- single-stranded RNA binding : 37 sequences
Threshold : 1756.5251
Accuracy=(35+16) *100/54=94.44% Figure 4-13: The classification among other DNA and RNA binding proteins. (a) The
accuracy of double-stranded RNA binding proteins is 100%. (b) The accuracy of double-
stranded DNA binding is 79.6%. (c) The accuracy of single-stranded DNA binding
proteins is 88.89% (d) The accuracy of single-stranded RNA binding is 94.44%
64
4.3 The investigation of functional relations among RRM containing proteins
After filtering false positive sequences by the threads, we can investigate the
functional relations among true positive sequences using hierarchical clustering and
pearson correlation. Since residue-based features are more informative than domain-
based features, we generated a phylogenetic profile on the basis of features derived from
amino acid residues. These features represent structural information of a protein from the
amino acid sequence. They consist of the compositions of 20 amino acid, 3 compositions,
1 transition, and 15 distributions of special chemical groups such as hydrophobic,
positive, and negative-charged group. Using this phylogenetic profile, we first clustered
14 control sequences of RRM containing proteins. Shown in Figure 4-14, these sequences
are separately clustered into RNA binding and non-RNA binding groups.
Next, we added sequences from U2AF-homology motif (UHM) group into the
control sequences. UHM is non-canonical type of RRM, which is involved in constitutive
or alternative pre-mRNA splicing [68]. They may also bind ULMs in splicing factors [68].
6 proteins with RRM in Figure 4-15(a) are discovered from the literatures [69]. Since
these proteins bind other proteins instead of RNAs, we assume that they would be
clustered into non-RNA binding group in the dendrogram of control sequences.
Shown in Figure 4-15(b), the UHM sequences are clustered together into non-
RNA binding group. This result suggests that we can predict the functions of unknown
sequences on the basis of this dedrogram using residue-based phylogenetic profiles, if we
add unknown RRM-positive sequences into these control sequences.
65
Figure 4-14: The dendrogram of control sequences. Nucleolin, SXL, PABP, HUD,
hnRNPA1, and PTB bind RNAs, but 2J8A, 1JMT, 1RK8, 2I2Y, 1OPI and 2PE8 bind
other proteins instead of RNAs from the literature [38]. These proteins are divided into
RNA bind and non-RNA bind groups.
66
a
b
Figure 4-15: The proteins with U2AF-homology motif (UHM). (a) The domain
architectures of UHM proteins from the literature [68]. (b) The functional dendrogram of
UHM proteins. They tend to cluster together.
67
4.4 Summary
For the quantitative functional evaluation, GDDA-BLAST is applied to
investigate functional information in RNA binding proteins. However, false positive
sequences should be filtered in phylogenetic profiles generated from GDDA-BLAST to
investigate the information. To eliminate these sequences, we designed two strategies.
One is the threshold for the filtering, and the other is a new phylogenetic profile
containing accurate information of proteins. For the discovery of the threshold to filter
false positives, we proposed a computational classification assay for the identification of
RNA binding proteins. This assay contains eight procedures, and we found the thresholds
for each protein family using five RNA binding protein families, which contain: (1) RRM,
(2) double-stranded RNA binding domain, (3) K-homology domain, (4) zf-CCCH and (5)
zf-CCHC domain.
In fact, we first calculate the normalized scores of all residues in each sequence to
find the threshold to filter the false positive sequences. After calculating the norms from
the average scores for all sequences, we selected the threshold from the norms. For
example, using 55 positive and 151 negative sequences with RRMs as a training dataset, I
drew the distributions graph on the basis of the norms. Shown in Figure 4-9, the
sequences are divided into positive and negative groups. The minimum of the norm in the
positive group is 263.4171 and the maximum in the negative group is 127.2489. Among
these values, the minimum in the positive group is chosen as a threshold. Based on this
threshold, I identified 20 positive sequences from 157 sequences of a test dataset, and the
accuracy is 100%.
68
Following the same procedures, we found the thresholds for other proteins
containing dsRNA, KH, zf-CCCH, and zf-CCHC domains. Using these thresholds, we
succeeded to classify positive sequences from each testing dataset with 100% accuracy.
These results show that the false positive sequences can be filtered accurately using these
thresholds. I finally identified 82 known and 26 unknown RNA binding proteins in yeast
proteome using the classification library with the same thresholds.
In addition, we identified RBPs containing structural unique motifs by 2695
expanded PSSM profiles in a testing dataset with 37 positive and 118 negative sequences.
We achieved 100% specificity, 96.8% accuracy, and 86.5% sensitivity. For the specific
folds (dsRNA vs. dsDNA, dsRNA vs. dsDNA and ssRNA vs. ssDNA), we also
accomplished the higher accuracies than the SVM using structure features.
To implement a new phylogenetic profile, we used a variety of sequence-based
information from proteins. The new phylogenetic profile contains the compositions of 20
amino acids and 3 descriptors of specific chemical groups such as hydrophobic, negative
and positive charged groups. The composition of an amino acid is the number of an
amino acid divided by the number of total amino acids. The 3 descriptors represent global
composition of specific chemical groups, and consist of composition (C), transition (T),
and distribution (D). The composition (C) is the number of amino acids with a particular
group divided by total number of amino acid in all chemical groups. Transition (T) is the
frequency of transition from one chemical group to another chemical group in a sequence.
Distribution (D) is the chain length within the first, 25%, 50%, 75% and 100% of the
amino acid in a specific chemical group.
69
Using this new residue-based phylogenetic profile, we first clustered the control
sequences containing RRM using hierarchical clustering and Pearson correlation.
Figure 4- shows that the sequences are clustered into RNA binding and non-RNA binding
groups. To investigate functions of this phylogenetic profile, we added UHM sequences,
which bind proteins, into the control sequences. After clustering all of these sequences,
they were clustered into RNA binding and non-RNA binding groups, and the UHM
sequences were located in the non-RNA binding group. Based on these results, we
conclude that the new phylogenetic profile would be helpful to infer the functional
relationships among RNA binding proteins.
To apply this assay to annotate functional characteristics of a protein, we should
consider three issues in the development. First, we need define a format of the annotation
because the means of functions are changed by the view of the annotation. Second, we
need to develop methods to extract accurate features from a sequence for the
phylogenetic profiles. Finally, we need a statistical standard to decide the functional
relationships between proteins.
Chapter 5
Summary and Discussion
5.1 Summary
This thesis described the procedures to develop a unified computational method
for measuring the structural, functional and evolutionary characteristics of a protein from
the amino acid sequence simultaneously. As the computational and biological techniques
are advanced, a huge amount of probable proteins are recently predicted from genomes.
Despite many researches to annotate these proteins accurately, we face several obstacles
to annotate structural, functional and evolutionary properties of the proteins. First, even
though experimental methods identify many uncharacterized proteins in proteomes, the
annotation of these proteins takes longer time than the identification, and existing
erroneous annotation can generate a false annotation of a new protein in some case.
Second, the annotation requires the accurate subjective and contextual definition of
protein function because lots of proteins have multiple functions. Because of these
problems, the accurate structural and functional annotation of a protein is the challenging
task in all biological fields.
In spite of these obstacles, the structural, functional, and evolutionary
characteristics of a protein can be determined by its amino acid sequence because the
protein consists of the amino acid sequence. Many computational methods such as
homology detection, machine learning, and phylogenetic method have investigated these
71
characteristics only using the amino acid sequence. These methods are powerful for the
annotation of some proteins, but they are not enough to annotate all proteins accurately.
For example, homology-based methods usually predict the functions of proteins with
high sequence similarity accurately. However, if the pair-wise sequence similarity
between sequences is lower than 25%, they are not sensitive to identify these distant
homologous sequences. In addition, even though the similarity of some proteins such as
some enzymes is very high, the methods cannot detect their homologous relations
because some residues in the proteins are not reserved among sequences. Finally, if the
existing annotations in databases contain errors, homology-based methods allow these
erroneous annotations to amplify and propagate the errors through the databases.
Since machine learning methods can predict functional properties of proteins on
the basis of sequence derived features, they are independent of sequence similarity.
However, the biased results can be produced by the number of datasets and the sequence-
derived features because their accuracy depends on training sets and feature extracting
methods from sequences. Phylogenetic method infers functional relationships among
proteins on the basis of the presence or absence of the protein across genomes. While the
phylogenetic profiles from prokaryotic genomes describe the functional relationships of
proteins clearly, the phylogenetic profiles from eukaryotic genomes are less informative
to predict the functional relationship, despite some successful researches for the specific
functional prediction of a protein. In addition, the accuracy of the analysis is low due to
the limitation of genome and genome sequences.
In an attempt to overcome these drawbacks, we have developed a unified
computational pipeline, called GDDA-BLAST, for measuring the structural, functional
72
and evolutionary characteristics of a protein using phylogenetic profiles. Our central
hypothesis for the development is that the structural, functional and evolutionary
information can be inferred from the sequence by a unified method, even if the pair-wise
identity of sequences is below 25%. “Seeding” and “pylogenetic profile” are important
innovative processes among five procedures of GDDA-BLAST.
“Seeding” is the resampling strategy designed to amplify and encode the
alignments possible for any given query sequence. This seed can be generated from a
profile by taking any fraction of the profile from N-terminus or C-terminus. Then, it is
inserted into every position of the query at a time, creating a consistent initiation site.
This site allows rps-BLAST to extend an alignment even between highly divergent
sequences.
While a phylogenetic profile generally encodes the presence or absence of a
protein in known genomes, the phylogenetic profile from GDDA-BLAST is a vector
where each entry quantifies the existence of alignments with a domain profile. This
profile represents M (# of profiles) by N (# of queries) matrix. Based on this matrix, we
create a dendrogram of functional relationships among proteins calculating pearson
correlation or a phylogenetic tree measuring Euclidian distance.
To evaluate the performance of this computational pipeline, we measure the
number of true homologous pairs detected by the pipeline. For the performance
evaluation, we selected PDB40D-J containing 935 sequences whose pair-wise identities
are less than 40% in the superfamilies from the structural classification of proteins
(SCOP) database. Then, we extracted 289 sequences below 25% pair-wise identity from
73
PDB40D-J to evaluate the performance in twilight zone. 26374 domain profiles from
CDD and PDB are used as profiles for GDDA-BLAST.
First, we calculated the similarity scores between each test and reference sequence
for potential homologues predicted by GDDA-BLAST, PSI-BLAST, and SAM. For the
similarity score of PSI-BLAST and SAM, we used E-value, which represents the
expectation value of hits shown by chance when searching a particular size of a database.
For the similarity score of GDDA-BLAST, we used Hybrid LogWeighted scoring
scheme. Hybrid LogWeighted scoring scheme consists of two steps. First, we calculate
the scores of three phylogenetic profiles such as # of hits, % of maximum coverage,
and % of average identity. Then, their scores are adjusted on the basis of the frequency
for the domains aligned with queries. Next, test sequences are ranked in ascending order
following similarity scores. Counting the number of true and false positives and negatives
within a sliding window, we plot receiver operating characteristic (ROC) curve of each
method. Shown in Figure 3-4, the performance of GDDA-BLAST is comparable to those
of SAM and PSI-BLAST with datasets in superfamilies and twilight zone.
For the quantitative functional evaluation, GDDA-BLAST is applied to
investigate functional information in RNA binding proteins. When GDDA-BLAST is
applied to identify RNA binding proteins in a quantitative manner, false positive
sequences should be filtered in phylogenetic profiles generated from GDDA-BLAST. To
achieve this purpose, we contrived two strategies: the quantitative threshold and a
residue-based phylogenetic profile. First, we implemented the classification library to
find the quantitative thresholds for RNA binding proteins. Using this library and their
theresholds, we identified RNA binding proteins containing RRM, dsRNA, KH, zf-
74
CCCH, and zf-CCHC domains in their testing datasets with 100% accuracy. Then, we
also identified 82 known and 26 unknown RNA binding proteins in yeast proteome with
the same thresholds and classification library.
After filtering the false positive sequences, we built new phylogenetic profiles
from the true positive sequences to investigate functional relationships among the
sequences. This new phylogenetic profile consists of the compositions of 20 amino acids
and 3 descriptors of chemical features from amino acid residues. The composition of an
amino acid is the number of an amino acid divided by the number of total amino acids.
The 3 descriptors represent global composition of specific chemical groups, and consist
of composition (C), transition (T), and distribution (D). Using this new phylogenetic
profiles, we clustered RRM containing sequences by hierarchical clustering and pearson
correlation. Shown in Figure 4-12, the sequences are divided into RNA binding and non-
RNA binding classes accurately. This functional dendrogram would be good reference to
predict the functions of unknown proteins.
5.2 Discussion
Using a resampling technique and phylogentic genetic profile, we have
successfully developed a unified framework which can quantitatively measure functional,
structural, evolutionary relations among proteins. Through experiments in our researches,
this computational assay has a potential power to resolve challenging problems in
homology detection and functional prediction. However, this assay still has some
drawbacks to improve.
75
In the homology detection, we should first develop a method which can build domain
profiles to generate the best phylogenetic profiles for functional or structural
characterized proteins. We generally use domain profiles from CDD and PDB for the
analysis of proteins. Despite being useful for the functional prediction of some proteins,
these profiles are not satisfied with all requirements for the analysis of protein
characteristics because some domains in the profiles cause to generate noises in the
phylogenetic profiles.
Second, we should implement the best scoring scheme for the comparison of
sequences because the performance for homology detection depends on the score of each
sequence. Even though many homology-based methods generally use e-value and hit
score to detect homology, these methods sometimes miss to detect remote homologous
sequences with low sequence identities because e-value depends on the size of the
database and hit score is decided by the number of identical residues,. To overcome these
problems, GDDA-BLAST uses pearson correlation value to measure the similarity
among phylogenetic profiles from sequences. In spite of the independency of sequence
identities and the size of a database, the pearson correlation itself is not enough to
measure the similarity between sequences because it is too sensitive for noises in the
phylogenetic profiles. Therefore, we need to implement the score system which is
independent of noise to measure the similarity of phylogenetic profiles. Finally, we need
to design residue-based phylogenetic profiles which contain accurate information from
sequences. If we can discover key residues to determine the functional and structural
characteristics of a protein from the sequence, we would extract the unique features only
from the sequences to generate accurate phylogenetic profiles.
76
Especially, to apply this assay to investigate the functional characteristics of a
protein, we should resolve three issues in the development. First, we need to define a
format reliable for multiple contents because some proteins have multiple functions in
different organisms. Second, we should develop methods to extract accurate features
from the amino acid sequence of a protein to generate the phylogenetic profiles reliable
for the purpose of the analyses such as function, structure and evolution. Finally, we have
to develop a statistical measurement to support biological means of our results. Even
though GDDA-BLAST still has many obstacles to overcome, we expect that this pipeline
would be one of the innovative tools to approach undiscovered information in a protein
sequence.
Chapter 6
Future Perspectives
Currently, many researches are devoted to the development of the annotating
methods for the functional or structural characteristics of proteins. Even though advanced
computational technology allows researchers to analyze a huge amount of proteins
automatically with high speed, there are still many problems for accurate annotation of
proteins. One of main problems is that the definition of biological function is ambiguous
and various on the basis of the context in which the function is used [70]. For example,
the function of a protein kinase is the phosphorylation of a hydroxyl group of a specific
substrate in the aspect of biochemistry. However, when protein kinases perform their
functions in different organisms, the function of each kinases are changed following the
organisms [2]. In addition, the functions of kinases also depend on signaling pathways
because the kninases may be part of the signal pathways in a physiological aspect.
Therefore, we should define the aspects of functions before annotating functions of
proteins.
Therefore, we need to define a format of a functional annotation which is satisfied
with a variety of biological aspects. This format is also reliable for an automated
computational and human readable annotation together. Among many type annotations,
The GO (Gene Ontology) annotation serves as one of the most dominant machine-legible
annotations [2]. GO annotation contains the terms representative of three aspects such as
molecular function, biological process and cellular location [71]. Each annotation is
78
connected using DAG (Directed Acyclic Graph). Nodes represent the terms of
annotations and these nodes are assigned from the general means to the specific means in
the graph. As the nodes are connected by following this rule, this graph can describe
functions that are involved in more than a single biological process, cellular compartment
and molecular function because each node may have more than one parent [71].
Based on the concept of GO, we can develop automated functional annotation
system using phylogenetic profiles for proteins instead of gen products. In fact, we would
implement a new annotation system that can annotate proteins with multiple functions in
organisms on the basis of my computational assay. To test our idea, we first collected
6760 yeast proteins 38540 human proteins in proteome databases. With 100 RRM
domain profiles, we discovered 525 candidate proteins containing RRM over the
threshold using my computational assay. After adding these proteins into the RRM
control sequences, we generated residue-based phylogenetic profiles from them. Based
on their phylogenetic profiles, we built a functional dendrogram using hierarchical
clustering and pearson correlation values. Among the clusters in the dendrogram, we first
investigated the proteins which may contain UHM.
Shown in Figure 6-1, orange boxes represent the UHM control sequences, and the
correlation values between the control sequences and new sequences tend to be high.
Then, we annotate the properties of the sequences in the three aspects such as molecular
function, biological process and cellular location on the basis of the annotations from the
NCBI database.
79
Comparing these annotations and correlations of these proteins, we selected 13
potential UHM proteins among the proteins. After calculating the pair-wise sequence
identities between these candidate proteins and control sequences using a local alignment
tool, we finally classified them into closely related and distantly related groups in
Figure 6-2 (a). To prove our annotations, we checked the annotation of each protein in
NCBI, and we found that two sequences such as NP_061862 and NP_060316 bind other
proteins in addition to RNAs. Using the same methodology for other functional
annotations, we can predict novel functional properties of many proteins.
Figure 6-1: The functional dendrogram for the prediction of UHM proteins. Orange
boxes indicate the UHM control sequences.
80
The next example is the functional annotation of a protein, NP_005769. A general
database usually annotates exact functional property of a whole protein but do not
annotate the function of each domain in the protein. Using the dendrogram from GDDA-
BLAST, we can predict the tendency of a function for each RRM in this protein on the
a
b
Figure 6-2: The prediction of UHM candidate proteins (a) the domain architectures of 13
UHM candidate proteins. They are classified into closely related and distantly classes. (b)
The proofs of functional predictions of two proteins from NCBI.
81
basis of functional annotations of adjoining sequences. Shown in Figure 6-3, we can
calculate % of these annotations after counting the number of the same annotation of
adjoining sequences. Then, we can infer the functions of each RRM from statistical
distributions. For example, shown in Figure 6-3 (a), RRM1 might bind RNAs, be
involved in splicing, and belong to nucleus in the aspect of a biological function, process,
and component. Based on these inferences, we can predict new functional characteristics
of each RRM in the protein (Figure 6-3 (b)). To prove these predictions, we searched the
existing annotations of this protein from NCBI. In the annotations from NCBI,
NP_005769 is RNA binding motif protein and it is produced from human RBM5 gene.
This protein binds DNAs, RNAs, nucleotides and proteins with metal ion or zinc ion. It
would be involved in RNA processing, negative regulation of cell cycle and nuclear
mRNA splicing, via spliceosome. In addition, it would be component of intracellular or
nucleus. Comparing these annotations with our new annotations, all of them matched the
annotations of NCBI, and some of them were proven by the literatures. If this annotation
method is applied to study the properties of an unknown protein, we may predict new
functional characteristics of the protein.
From these results, the new functions of a protein may be predicted on the basis of
the annotations of adjoining sequences in the functional dendrogram generated by a
quantitative functional measurement. We need to develop methods to extract reliable
features from a sequence, and statistical methods to prove new annotations for the
inference of the accurate functional annotations from the adjoining-sequence annotations.
82
a
b
Figure 6-3: The inference of new annotations from reference annotation of NP_005869.
(a) The statistical distribution of functional annotations from proteins closely related to
NP_005869. (b) The domain architecture of NP_005869 and new functional annotations
of RRM domains in the protein.
Bibliography
1. Marchler-Baurer A., Panchenko A.R., Benjamin A.S., Thiessen P.A., Geer Y.G.
and Bryant, S.H. CDD: a database of conserved domain alignments with links to
domain three-dimensional structure Nucleic Acids Research, vol.30. no.1. 281-
283 2002
2. Iddo Friedberg, Automated protein function prediction—the genomic challenge
Briefing in Bioinformatics 7(3), 225-242 2006.
3. Yona G., Levitt M., Within the twilight zone: a sensitive profile-profile
comparison tool based on information theory. J Mol Biol. 315(5): 1257-1275
2002.
4. Rychlewski L., Jaroszewski L., Li W. and Godzik A., Comparison of sequence
profiles. Strategies for structural predictions using sequence information. Protein
Sci., 9: 232–241 2000.
5. Altschul S.F., et al , Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res. 25(17): 3389-3402 1997.
6. Karplus K., et al Predicting protein structure using hidden Markov models.
Proteins: Struct. Funct. Genet. 1: 134-139 1997.
7. Jaroszewski L., Rychlewski L., Li Z., Li W., Godzik A., FFAS03: a server for
profile-profile sequence alignments. Nucleic Acids Res. 33(Web Server issue):
W284-8 2005.
84
8. Burkhard Rost , Sean I. O'Donoghue , and Chris Sander, Midnight zone of protein
structure evolution, CUBIC(Web Server issue) 1998.
9. Lianyi Han, Juan Cui, Honghuang Lin, Zhiliang Ji, Zhiwei Cao, Yixue Li and
Yuzong Chen, Recent progresses in the application of machine learning approach
for predicting protein functional class independent of sequence similarity,
Proteomics 6: 4023–4037 2006.
10. Pazos, F., Ranea, J. A., Juan, D., and Sternberg, M. J., Assessing protein co-
evolution in the context of the tree of life assists in the prediction of the
interactome, J. Mol. Biol. 352(4): 1002–1015 2005.
11. Zhenran Jiang, Protein Function Predictions Based on the Phylogenetic Profile
Method, Critical Reviews in Biotechnology, 28:233–238 2008.
12. Ko K.D., Hong Y.H., Chang G.S., Bhardwaj G., Rossum D., and Patterson R.L.,
Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure,
Function and Evolution, Phys Arch arXiv:0806.239 2008.
13. Chang G.S, Hong Y.H, Ko K.D., Bhardwaj G., Holmes E.C., Patterson R.L. and
Rossum D., Phylogenetic profiles reveal evolutionary relationships within the
“twilight zone” of sequence similarity, Pro Natl Acad Sci USA 105(36): 13474-
13479 2008.
14. Su Yun Chung and S. Subbiah, A structural explanation for the twilight zone of
protein sequence homology, Structure, 15(4): 1123–1127 1996.
85
15. Russ W.P., Lowery D.M., Mishra P.,. Yaffe M.B, and Ranganathan R., Natural-
like function in artificial WW domains, Nature 437: 579-583 2005.
16. Park J., Karplus K., Barrett C., Hughey R., Haussler D., Hubbard T., and Chothia
C., Sequence comparisons using multiple sequences detect three times as many
remote homologues as pairwise methods, J Mol Biol, 284: 1201-1210 1998
17. Blake J.D. and Cohen F.E., Pairwise sequence alignment below the twilight zone,
J Mol Biol, 307:721–735 2001.
18. Taylor W.R., Identification of protein sequence homology by consensus template
alignment, J Mol Biol, 188:233–258 1986.
19. Yi T.M. and Lander E.S., Recognition of related proteins by iterative template
refinement (ITR). Protein Sci, 3:1315–1328 1994.
20. Gribskov M., McLachlan A.D., Eisenberg D., Profile analysis: Detection of
distantly related proteins. Proc Natl Acad Sci USA, 84:4355–4358 1987.
21. Luthy R., Xenarios I., and Bucher P., Improving the sensitivity of the sequence
profile method, Protein Sci, 3:139–146 1994.
22. Baldi P., Chauvin Y., Hunkapiller T., and. McClure M.A, Hidden Markov models
of biological primary sequence information. Proc Natl Acad Sci USA, 91:1059–
1063 1994.
86
23. Sonnhammer E.L., Eddy S.R., Durbin R., Pfam: A comprehensive database of
protein domain families based on seed alignments, Proteins, 28:405–420 1997.
24. Karplus K., Barrett C., and Hughey R., Hidden Markov models for detecting
remote protein homologies, Bioinformatics, 14(10):846-856 1998.
25. David T. Jones, GenTHREADER: An Efficient and Reliable Protein Fold
Recognition Method for Genomic Sequences, J. Mol. Biol., 287; 797-815 1999.
26. Kim Y. and Subramaniam S., Locally defined protein phylogenetic profiles reveal
previously missed protein interactions and functional relationships, Proteins, 62:
1115-1124 2006.
27. Kim Y., Koyuturk M., Topkara U., Grama A., and Subramaniam S., Inferring
functional information from domain co-evolution, Bioinformatics, 22: 40-49 2006.
28. Lishko P.V., Procko E., Jin X., Phelps C.B., and Gaudet R., The ankyrin repeats
of TRPV1 bind multiple ligands and modulate channel sensitivity, Neuron, 54:
905-918 2007.
29. Batrukova M.A., Betin V.L., Rubtsov A.M., Lopina O.D., Ankyrin: structure,
properties, and functions, Biochemistry (Mosc ), 65: 395-408 2000.
30. Marchler-Bauer A et al , (2005) CDD: a Conserved Domain Database for protein
classification. Nucleic Acids Res, 33 Database Issue: D192-D196.
31. Tamura K, Dudley J, Nei M, and Kumar S (2007) MEGA4: Molecular
Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol 24:
1596-1599 2007.
87
32. Chaumont F., Barrieu F., Wojcik E., Chrispeels M.J., and Jung R., Aquaporins
constitute a large and highly divergent protein family in maize. Plant Physiol 125:
1206-1215 2001.
33. Mio K., Ogura T., Kiyonaka S., Hiroaki Y., Tanimura Y., Fujiyoshi Y., Mori Y.,
and Sato C., The TRPC3 channel has a large internal chamber surrounded by
signal sensing antennas. J Mol Biol 367: 373-383 2007.
34. Letunic I., Copley R.R., Schmidt S., Ciccarelli F.D., Doerks T., Schultz J.,
Ponting C.P., Bork P., SMART 4.0: towards genomic data integration, Nucleic
Acids, Res 32 Database issue: D142-D144 2004.
35. Sonnhammer E.L., von H.G., Krogh A., A hidden Markov model for predicting
transmembrane helices in protein sequences, Proc Int Conf Intell Syst Mol Biol, 6:
175-182 1998.
36. Murzin A. G., Brenner S. E., Hubbard T., and Chothia C., SCOP: a structural
classification of proteins database for the investigation of sequences and
structures, J. Mol. Biol. 247, 536-540, 1995.
37. Chen, Y.C. and Lim, C., Predicting RNA-binding sites from the protein structure
based on electrostatics, evolution and geometry, Nucleic Acids Research, 36(5),
e29 2008.
38. Shazman, S. and Mandel-Gutfreund, Y., Classifying RNA-Binding Proteins
Based on Electrostatic Properties, 4(8), PLOS computational biology e1000146
2008.
88
39. Chen Y. and Varani G., Protein families and RNA recognition, FEBS Journal,
272:2088–2097 2005.
40. Puglisi J.D., Chen L., Blanchard S. and Frankel A.D., Solution structure of a
bovine immunodeficiency virus Tat-TAR peptide-RNA complex, Science, 270:
1200–1203 1995.
41. Ye X., Kumar R.A. and Patel D.J., Molecular recognition in the bovine
immunodeficiency virus Tat peptide-TAR RNA complex, Chem Biol, 2:827–840
1995.
42. Varani G. and Nagai K., RNA recognition by RNP proteins during RNA
processing and maturation, Ann Rev Biophys Biomol Struct, 27:407–445 1998.
43. Antson A.A., Dodson E.J., Dodson G., Greaves R.B., Chen X.P. and Gollnick P.,
Structure of the trp RNA binding attenuation protein, TRAP, bound to RNA,
Nature 401:235–242 1999.
44. Wang X., McLachlan J., Zamore P.D. and Tanaka-Hall T.M. , Modular
recognition of RNA by a human Pumilio-homology domain, Cell 110, 501–512
2002.
45. Lu D., Searles M.A. and Klug A., Crystal structure of a zinc-finger-RNA complex
reveals two modes of molecular recognition, Nature, 426:96–100 2003.
89
46. Hudson B.P., Martinez-Yamout M.A., Dyson H.J. and Wright P.E., Recognition
of the mRNA AU-rich element by the zinc finger domain of TIS11d, Nat Struc
Mol Biol, 11:257–264 2004.
47. Predki P.F., Nayak L.M., Gottlieb M.B.C. & Regan L., Dissecting RNA–protein
interactions: RNARNA Recognition by Rop, Cell 80:41–50 1995.
48. Blaszczyk J., Tropea J.E., Bubunenko M., Routzahn K.M., Waugh D.S., Court
D.L. and Ji X., Crystallographic and modeling studies of RNase III suggest a
mechanism for double-stranded RNA cleavage, Structure, 9:1225–1236 2001.
49. Garner M.M. and Revzin A., A gel electrophoresis method for quantifying the
binding of proteins to specific DNA regions: application to components of the
Escherichia coli lactose operon regulatory system, Nuc. Acids. Res., 9:3047–60
1981.
50. Promega, Protein interaction guide, 24-26.
51. Dubchak I., Muchnik I., Holbrook S.R., and Kim S.H., Prediction of protein
folding class using global description of amino acid sequence, Proc Natl Acad Sci
USA, 92:8700-8704 1995.
52. Gribskov, M., Mclachlan, A.D, and David, E. Profile analysis: Detection of
distantly related proteins Proc. Natl. Acad. Sci. USA, vol. 84. 4355-4358 1987
90
53. Sander, C. and Schneider, R., Database of homology-derived protein structures
and the structural meaning of sequence alignment, Proteins: Struct. Funct. Genet.,
9:56-68 1991.
54. Hilbert, M., Bohm, G. & Jaenicke, R., Structural relationships of homologous
proteins as a fundamental principle in homology modeling, Proteins: Struct.
Funct. Genet., 17:138-151 1993.
55. Chris Sander and Reinhard Schneider, Database of Homology-Derived Structures
and the Structural Meaning of Sequence Alignment, PROTEINS: Structure,
Function, and Genetics, 9:56-68 1991.
56. Russ, W.P., Lowery, D.M., Mishra, P., Yaffe, M.B. and Ranganathan, R.,
Natural-like function in artificial WW domains, Nature, 437: 579-583 2005.
57. Shah, I. and Hunter, L., Predicting enzyme function from sequence: a systematic
appraisal, Proc Int Conf Intell SystMol Biol;5:276–83 1997.
58. Shah, I. and Hunter, L., Identification of divergent functions in homologous
proteins by induction over conserved modules, Proc Int Conf Intell SystMolBiol;
6:157–64 1998.
59. Tian, W. and Skolnick, J., How well is enzyme function conserved as a function
of pairwise sequence identity?, JMol Biol, 333:863–82 2003.
60. Doolittle, R.F. and Bork, P., Evolutionarily mobile modules in proteins, Sci Am,
269:50–6 1993.
91
61. Doolittle RF. The multiplicity of domains in proteins, Annu Rev Biochem,
64:287–314 1995.
62. Ran, J. A., Yeats, C., Grant, A., and Orengo, C. A., Predicting protein function
with hierarchical phylogenetic profiles: the Gene3D phylotuner method applied to
eukaryotic genomes, PLoS. Comput. Biol., doi:10.1371/journal.pcbi.0030237
2007.
63. Suh, B.C. and Hille, B., Regulation of ion channels by phosphatidylinositol 4,5-
bisphosphate, Curr Opin Neurobiol, 15: 370-378 2005.
64. Inna Dubchak, Ilya muchnikt, Stephen R. Holbrook, and Sung-hou Kim,
Prediction of protein folding class using global description of amino acid
sequence, Proc. Natl. Acad. Sci. USA, 92:8700-8704, 1995.
65 Alexande,r P.A., He, Y., Chen, Y., and Orban, J., Bryan PN, The design and
characterization of two proteins with 88% sequence identity but different structure
and function. Proc Natl Acad Sci U S A, 104: 11963-11968 2007.
66 Bock, J.R.and Gough, D.A., Predicting protein–protein interactions from primary
structure. Bioinformatics 17: 455–460 2001.
67 Yu, X., Cao, J., Cai, Y., Shi, T., and Li, Y., Predicting rRNA-, RNA-, and DNA-
binding proteins from primary structure with support vector machines, J. Theor.
Biol. 240: 175–184 2006.
92
68. Corsini, L., Bonnal, S., Basquin, J., Hothorn, M., Scheffzek, K., Valca´rcel, J. and
Sattler, M., U2AF-homology motif interactions are required for alternative
splicing regulation by SPF45, Nature structural & Molecular biology, 14:260-269
2007.
69. Kielkopf C. L., Lücke S., and Green M. R., U2AF homology motifs: protein
recognition in the RRM world Genes Dev.; 18(13): 1513–1526 2004.
70. Rost, B., Liu, J., Nair, R., et al., Automatic prediction of protein function, Cell
Mol Life Sci 60:2637.50 2003.
71. Ashburner, M., Ball, C.A., Blake, J.A., et al., Gene ontology: tool for the
unification of biology. The gene ontology consortium, Nat Genet, 25:25–9 2000.
VITA
Kyung Dae Ko
EDUCATION Penn state University, University Park, PA Ph.D in Bioinformatics & Genomics of IBIOS program(August, 2009) Dissertation topic: “Quantitative functional measurement of a protein using phylogenetic profiles” Master in Computer Science and Engineering, Spring, 2005 Thesis topic: “Designing the Gestalt Detection Domain Algorithm (GDDA) for Detection of Hidden Domains” Master in Electrical Engineering, 2003 Thesis topic: “A design of Directive Photonic-Band-Gap Antennas for a Dual Band operation using CFDTD” PUBLICATION *Kyung Dae Ko, Gaurav Bhardwaj, Yoojin Hong, Gue Su Chang, Kirill Kiselyov, Damian B. van Rossum and Randen L. Patterson, “Phylogenetic profiles reveal structural/functional determinants of TRPC3 signal-sensing antennae”, Communicative & Integrative Biology, Vol. 2, issue 2, March/April 2009 *G.S Chang, *Y.H Hong, K.D. Ko, G. Bhardwaj, E.C. Holmes, R.L. Patterson and D. Rossum, “Phylogenetic profiles reveal evolutionary relationships within the twilight zone of sequence similarilty”, Pro Natl Acad Sci USA, Sept. 2008. *K.D. Ko, *Y.H. Hong, *G.S. Chang, G. Bhardwaj, D. Rossum, and R.L. Patterson, “Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution,” Physics Archives, June 2008. Young Ju Lee, Junho Yeo, Kyoung Dae Ko, Raj Mittra, Yoonjae Lee, and Wee Sang Park, “A Novel Design Techinique For Control of Defect Frequencies of An Electromagnetic Bandgap(EBG) Superstrate For Dual-Band Directivity Enhancement,” Microwave and Optical Technology Letters, Vol. 42, No. 1, July 5 2004. Y.J. Lee, J. Yeo, K.D. Ko, R. Mittra, Y. Lee, and S. Park, “Techniques for Controlling the Defect Frequencies of Electromagnetic Bandgap (EBG) Superstrates for Dual-band Directivity Enhancement of a Patch Antenna,” IEEE Antennas & Propagation Society International Symposium/URSI, Monterey, California, Volume: 2 , 20-25 June 2004.