The derivation of ungapped global protein alignment score distributions - Part1
Protein Multiple Alignment Incorporating Primary and Secondary Structure …junxie/jcb06.pdf ·...
Transcript of Protein Multiple Alignment Incorporating Primary and Secondary Structure …junxie/jcb06.pdf ·...
-
Protein Multiple Alignment Incorporating Primary and
Secondary Structure Information
Nak-Kyeong Kim1 and Jun Xie2,∗
1National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
8600 Rockville Pike, Building 38A
Bethesda, MD 20894-6075
E-mail: [email protected]
2Department of Statistics
Purdue University
150 N. University Street
West Lafayette, IN 47907-2067
E-mail: [email protected]
April 26, 2006
Running Heads: Sequence alignment with secondary structures
KEY WORDS: Gibbs sampling; Likelihood function; Protein sequence motifs; Secondary
structures; Segment overlap.
-
Abstract
Identifying common local segments, also called motifs, in multiple protein sequences
plays an important role for establishing homology between proteins. Homology is easy
to establish when sequences are similar (sharing an identity > 25%). However, for dis-
tant proteins, it is much more difficult to align motifs that are not similar in sequences
but still share common structures or functions. This paper is a first attempt to align
multiple protein sequences using both primary and secondary structure information.
A new sequence model is proposed so that the model assigns high probabilities not
only to motifs that contain conserved amino acids but also to motifs that present com-
mon secondary structures. The proposed method is tested in a structural alignment
database BAliBASE. We show that information brought by the predicted secondary
structures greatly improves motif identification. A website of this program is available
at http://www.stat.purdue.edu/∼junxie/2ndmodel/sov.html.
1 Introduction
Genome sequencing projects produce enormous sequence data. The interpretation of these
data, however, is an ongoing challenge and highly depends on efficient computational ap-
proaches. Statistical methods and probability models have been successfully used to analyze
biological sequences. In this paper, we are interested in aligning common motifs in multiple
proteins. The observed data are protein amino acid sequences, which are also called the pri-
mary structure of the proteins. Protein motifs here are referred to as local segments (10-50
amino acids) that are critical for protein structures and functions. Multiple sequence align-
ments help to characterize protein structures and functions by common sequence patterns.
Numerous multiple sequence alignment programs are proposed. Thompson et al. (1999b)
provided a comprehensive comparison of ten programs, some of which were highly ranked as
evaluated by BAliBASE (Thompson et al. 1999a) benchmark alignment database. To list a
2
-
few, ClustalW (Thompson et al. 1994) is a well-used progressive alignment method. A mul-
tiple alignment is built up gradually by aligning the closest sequences first and successively
adding in the more distant ones. Dialign (Morgenstein et al. 1996) is a local alignment
approach, which construct multiple alignments based on segment-to-segment comparisons
rather than residue-to-residue comparisons. The PRRP program (Gotoh 1996) optimizes a
progressive alignment by iteratively dividing the sequences into two groups and realigning
the groups. These three programs will be compared to our proposed alignment method in
Section 4.
In addition to the alignment programs, motifs are often modeled by the position specific
score matrix (PSSM), which corresponds to a product of multinomial distributions of amino
acids. Based on the PSSM model, Lawrence and Reilly (1990) treated the starting positions
of motifs as missing data and proposed an EM algorithm (Dempster et al. 1977) for motif
detection. An EM algorithm is known for slow convergence, and the program often converges
to a local maximum. Lawrence et al. (1993) and Liu et al. (1995) developed a Bayesian model
and a Gibbs sampling algorithm to find the motifs under the same missing-data formulation.
The method has a better chance to escape a local maximum because of its stochastic nature.
Xie et al. (2004) extended the Bayesian model by allowing insertions and deletions within
the motifs. Eddy (1998) developed a hidden Markov model to describe motifs, also allowing
gaps inside motif patterns. Considering insertions and deletions often results in intensive
computation and the program may suffer from lack of convergence. Despite the strengths, all
the above methods use only information of protein primary structures. They have limitations
in finding weak motif patterns that have a low level of similarity between sequences.
Besides sequence, protein structure provides significant information for protein function.
It is assumed that 3-dimensional (3D) structures evolve more slowly than sequences and the
function of a protein is highly influenced by its 3D structure (Silberberg 2000). However,
due to the slow and expensive experimental processes to determine protein 3D structures,
3
-
only a limited number of proteins have known 3D coordinates. Predicting 3D structure from
the sequence is one of the biggest challenges in computational biology.
Secondary structure is a simplified characteristic of a protein’s 3D structure. All success-
ful methods in the field of 3D fold recognition make use of secondary structure predictions,
showing that secondary structure is a valuable way to establish structural relationship be-
tween proteins. Three state descriptions of protein secondary structure are commonly used:
helix (which includes all helical types), strand (which includes the beta sheet), and coil
(which includes everything else, e.g. bend and turn). Many secondary structure prediction
algorithms have been proposed, for instance, score-based methods (Chou and Fasman 1974;
Garnier et al. 1978), nearest neighbor methods (Salamov and Solovyev 1995), and neural
networks (Rost and Sander 1993, Jones 1999). Several competing methods reached around
70 - 78% accuracy (fraction of correctly predicted three states), with the PSI-PRED (Jones
1999, Bryson et al. 2005) server, a neural network based algorithm, as one of the most accu-
rate tools. We will use PSI-PRED in this paper. Figure 1 shows PSI-PRED prediction for a
short protein UBIQ HUMAN (swiss-prot P02248), which belongs to one of our example data
sets 1ubi in Section 4. The contiguous segments of secondary structures are given, where H,
E, and C represent helix, strand, and coil, respectively.
[Figure 1 about here.]
A family of structurally similar proteins may have divergent amino acid compositions
because 3D structures are not affected too much by substitutions of certain amino acids.
The 3D structures, however, should be conserved to perform a certain function. If the 3D
structures are conserved, it is likely that secondary structures are conserved. Geourjon et
al. (2001) introduced the idea of using the predicted secondary structure in identifying
related proteins with weak sequence similarity. They collected distantly-related sequences
with 10-30% sequence identity and calculated the secondary structure similarity of each pair
of sequences using the SOV (Segment Overlap) measure (Zelma et al. 1999). Sequence
4
-
homology was established only when the SOV was greater than a threshold. However, this
approach is limited to pairwise protein sequence comparisons. Errami et al. (2003) used
the predicted secondary structures in multiple protein sequences. They validated existing
multiple alignments by discarding unrelated sequences. Relationship was measured by SOV
calculated for all pairs of sequences in a given multiple alignment. This approach gives gen-
eral and vague guidelines in verifying existing multiple alignments, but it does not construct
multiple alignments.
In this paper, we propose a new statistical method that models protein motifs using
both primary and secondary structure information. Segment overlap (SOV) is generalized
to measure the similarity of secondary structures for a group of multiple sequences. A mul-
tiple alignment method is proposed to maximize both amino acid and secondary structure
conservation. Section 2 defines the data structure and presents SOV measurements. Section
3 shows the probabilistic models of motifs using the predicted secondary structures. A Gibbs
sampling algorithm is derived for model inference. Convergence is studied by multiple sim-
ulations and a proposed alignment score. Section 4 evaluates the models using the database
of structural multiple alignment BAliBASE (Thompson, et al. 1999a). Section 5 concludes
with a discussion.
2 Data structure and SOV
2.1 Data structure
A given set of protein sequences can be represented as
sequence R1 : r1,1 r1,2 . . . r1,L1
Data R : sequence R2 : r2,1 r2,2 . . . r2,L2
.... . .
...
sequence RK : rK,1 rK,2 . . . rK,LK
5
-
where the residue rk,l takes values from an alphabet with 20 different letters, and Lk rep-
resents the length of the kth sequence. We seek segments of length J from each sequence,
which resemble each other as much as possible. The segments are called motifs. The motif
width J can be determined by either the user or a heuristic algorithm (Xie and Kim 2005).
Let A = {ak, k = 1, . . . , K} denote the starting positions of the motif for the K sequences.The alignment could be represented by a matrix, R{A}:
r1,a1 . . . r1,a1+J−1
.... . .
... (1)
rK,aK . . . rK,aK+J−1
When the motif has conserved amino acids, the matrix (1) is well represented by a PSSM and
the existing motif-finding algorithms would work well. When the motif sequences are not
conserved, the motif 3D structure may still be preserved. Therefore, adding the predicted
secondary structures would enhance the motif signal.
2.2 Secondary structure similarity measurement SOV
The three states for the secondary structure are helix (H), strand (E), and coil(C). Secondary
structure similarity can be measured by the Q3 measure, defined as a fraction of residues
correctly matched in the three conformational states. However, the Q3 measurement some-
times gives inappropriate values. For example, predicting the entire myoglobin chain as one
big helix gives a Q3 value of about 80%, which outperforms most of the existing prediction
methods. Alternatively, a better measurement is the Segment Overlap (SOV) by Zelma et
al. (1999). SOV considers natural variations in the boundaries of segments among homolo-
gous protein structures. It is a measure based on secondary structure segments rather than
individual residues.
[Figure 2 about here.]
6
-
Let s1 and s2 denote any two segments of secondary structure in conformational state i
(i.e. H, E, or C). Let (s1, s2) denote a pair of overlapping segments. For example, (β1, β2)
in Figure 2 is a pair of overlapping segments with strand (E). Let S(i) denote the set of all
overlapping pairs of segments (s1, s2) in state i, and let S′(i) denote the set of segments s1
for which there is no overlapping segment s2 in state i, i.e.:
S(i) = {(s1, s2) : s1 ∩ s2 6= φ,
s1 and s2 are both in the conformational state i},
S ′(i) = {s1 : ∀s2, s1 ∩ s2 = φ,
s1 and s2 are both in the conformational state i}.
Define SOVo for state i as:
SOV o(i) =1
N(i)
∑(s1,s2)∈S(i)
minov(s1, s2) + δ(s1, s2)
maxov(s1, s2)× len(s1),
where N(i) =∑
(s1,s2)∈S(i)len(s1) +
∑s1∈S′(i)
len(s1),
δ(s1, s2) = min{(maxov(s1, s2) − minov(s1, s2));
minov(s1, s2); int(len(s1)/2); int(len(s2)/2)}.
In the formula, len is the segment length, minov is the length of the actual secondary
structure overlap of s1 and s2, maxov is the maximal length of the overlapping structures s1
and s2 (See Figure 2). SOVo of all secondary states is defined as:
SOV o =1
N
∑i∈{H,E,C}
∑(s1,s2)∈S(i)
minov(s1, s2) + δ(s1, s2)
maxov(s1, s2)× len(s1),
where N =∑
i∈{H,E,C}N(i).
To illustrate the calculation of SOVo(E), let us consider the two secondary structures in
Figure 2. There are two overlapping pairs for extended sheet(E): (β1, β2) and (β1, β3). For
7
-
the first pair, minov(β1, β2) = 2, maxov(β1, β2) = 8, and δ(β1, β2) = min{(8−2); 2; 3; 2} = 2.The second pair can be calculated similarly. Then the value of SOVo(E) is calculated as:
SOV o(E) =1
6 + 6×(
2 + 2
8+
2 + 1
7
)× 6 = 0.464
Summing over all 3 states, the overall SOVo of the given structures is evaluated to be
0.629. The SOVo measure ranges from 0 to 1, where 1 is the perfect match and 0 is the
complete mismatch. The value 0.629 can be roughly interpreted as that 63% of the secondary
structures are matched.
SOVo is originally defined for similarity of an observed secondary structure and its pre-
dicted secondary structure. The asymmetric nature of S(i), N(i) and len(s1) makes SOVo
asymmetric between the two sequences s1 and s2. When this measure is used for the two
predicted structures, a symmetric measure can be defined by:
SOV =SOV o(s1, s2) + SOV
o(s2, s1)
2.
This definition will be used for our SOV calculations.
3 Methods
3.1 Model assumptions
The proposed model consists of two parts, a position-specific score matrix (PSSM) for the
amino acid sequences and a SOV measurement for the secondary structures of the motifs.
Let X = {X1, ..., XK} denote secondary structure strings for the set of K proteins, wheresecondary structure Xi of protein i is either known or predicted by PSI-PRED. PSI-PRED
employs two feed-forward neural networks which predict secondary structure of a protein
based on its similarity output obtained from PSI-BLAST (Position Specific Iterated BLAST,
Altschul et al. 1997). For the given protein, PSI-PRED uses all of its homology proteins from
the NCBI (National Center for Biotechnology Information) protein database. We assume
8
-
the predicted secondary structures X is an extra given data set in addition to the protein
set of interest R.
As many of other secondary structure prediction methods, PSI-PRED utilizes sequence
information in multiple alignments obtained by PSI-BLAST. The multiple alignment helps
to infer secondary structure. On the other hand, our goal here is to improve multiple
alignment by the predicted secondary structures. Our development could be considered as
the second step of an iterative scheme that optimizes both the quality of the secondary
structure prediction and that of the multiple alignment.
The motif width J in our approach is chosen based on the method by Xie and Kim
(2005). Starting from a short alignment width (e.g. 10), the method expands the motif to
both sides according to the Kullback-Leibler information divergence. We focus our model on
detection and correct alignment of short similar regions in very long sequence of low overall
similarity. The motif width in our problems is typically 10-20. Therefore, we do not allow
any gap within motif. The motifs identified by the proposed multiple alignment method are
ungapped blocks, which correspond to core regions in a group of proteins. On the other hand,
the regions outside of motifs are not aligned. There are insertions and deletions between the
aligned core motifs.
For simplicity, we focus on the model that assumes one motif occurring in each sequence.
Once one motif alignment is obtained, there are methods available to extend to multiple
motif alignments. For instance, we will continue searching the next best motif by a means
of masking (Xie at al. 2004).
For the amino acid frequencies at each position j in the motif, we denote the frequency
parameters θj = (θ1,j. . . . , θ20,j)T , j = 1, . . . , J . Background sequences are assumed from
another common multinomial distribution with parameter θ0 = (θ1,0. . . . , θ20,0)T . Let Θ =
(θ0,θ1, . . . ,θJ). We denote a counting function h such that h(R) = (m1, . . . ,m20)T , where
mi is the number of the ith type letter observed in R. Furthermore, let RA(j) denote the
9
-
jth column in (1), R{A}c denote the amino acids outside of the motif. Let SOV (al, am)
denote the SOV measure between two segments with width J starting at position al in the
lth sequence and position am in the mth sequence.
3.2 Probability model
Given the previous notations, the complete likelihood function with motif locations A given
is defined as
π(R, A|Θ, λ,X) ∝ θh(R{A}c )0J∏
j=1
θh(RA(j))
j exp{λJ
K
∑l
-
alignment A. The conjugate prior distribution for Θ is defined. Specifically, the prior for Θ
is a product Dirichlet distribution, denoted by g(Θ). The parameter in the prior distribution
for θj is βj = (β1,j, . . . , β20,j), j = 0, . . . , J , which is defined at the end of this section. For
notation simplicity, considering vectors a = (a1, . . . , a20)T and b = (b1, . . . , b20)
T , we write
that a + b = (a1 + b1, . . . , a20 + b20)T , ab = (ab11 . . . a
b2020 )
T , |a| = |a1| + · · · + |a20|, andΓ(a) = Γ(a1) . . . Γ(a20).
The posterior distribution for A is derived as follows:
π(A|R,X, λ) ∝ π(A,R|λ,X) =∫
π(A,R|Θ, λ,X)g(Θ)dΘ
∝ Γ(h(R{A}c) + β0)J∏
j=1
Γ(h(R(j)) + βj)
× exp{λ JK
∑l
-
as constants:
π(ak|A[−k],R,X, λ) ∝ π(A|R,X, λ)π(A[−k]|R,X, λ)
∝ Γ(h(R{A[−k]}c) + β0 − h(R{ak}))Γ(h(R{A[−k]}c + β0)
×J∏
j=1
Γ(h(RA[−k](j)) + βj + h(rk,ak+j−1))
Γ(h(RA[−k](j)) + βj)
× exp{λ JK
∑l: l 6=k
SOV (al, ak)}
By using Stirling’s formula, the (predictive) posterior distribution for ak can be simplified
as:
π(ak|A[−k],R,X, λ) ∝J∏
j=1
(θ̂j[k]
θ̂0[k]
)h(rk,ak+j−1)
× exp{λ JK
∑l: l 6=k
SOV (al, ak)}, (3)
where θ̂j[k] and θ̂0[k] are the posterior means of θj and θ0, whose calculations are specified
below. Given the current alignment defined by A[−k], the probability of updating ak depends
on both the amino acid pattern, i.e., the odds ratio of the motif probability versus the
background probability, and the similarity of the secondary structures, i.e., SOV (al, ak),
l = 1, ..., K and l 6= k.The posterior means of θj[k] = (θ1,j[k]. . . . , θ20,j[k])
T , j = 1, . . . , J , are evaluated based
on the current alignment and a pseudo-count correction. Let fi be the observed relative
frequency of amino acid i in the current alignment except sequence k. Let pi be the relative
frequency of amino acids in the background, N be the sequence number except sequence k,
N = K − 1, and B is the weight of the pseudo-count correction. A simple pseudo-countcorrection approach estimates the posterior mean by θ̂i,j[k] = (N ·fi +B ·pi)/(N +B), whereB ·pi corresponds to the Dirichlet prior parameter βi,j in our Bayesian model. Alternatively,a better approach is the Blosum pseudo-count correction method (Altschul et al. 1997).
It replaces pi in the formula by a frequency that is calculated from a Blosum (Henikoff
12
-
and Henikoff 1992) amino acid substitution matrix. Formally, the pseudo-count B · pi ismultiplied by
∑20j=1 fje
µSij , where Sij is the substitution score of amino acid pair (i, j) defined
by a Blosum matrix (e.g. BLOSUM62), and µ is the scale parameter for the matrix. This
frequency estimate uses the prior knowledge of amino acid relationships embodied in the
substitution matrix Sij. Those residues favored by the substitution matrix to align with the
residues actually observed received high pseudo-count frequencies.
3.3 Gibbs sampling algorithm with multiple simulations
A Gibbs sampling procedure is used to generate samples according to Formula (3). The
sampling approach provides a good means to characterize the posterior distribution of motif
locations A. For instance, the mode of the posterior distribution gives an optimal motif
alignment. The Gibbs sampling starts with a random initial value of A, which is chosen
uniformly from all possible locations. Then ak, k = 1, . . . , K is updated one by one sequence.
The algorithm has two basic steps:
1. Exclude sequence k and calculate the current parameters θj[k] and θ0[k] using the
Blosum pseudo-count correction method described above. The predicted secondary
structures of the motif segments, except sequence k, are ready to use.
2. The likelihood ratio between the motif model and the background model is calculated
as in Formula 3. The new motif location ak is generated according to the weight (the
likelihood ratio).
The algorithm iterates the previous two steps for all sequences k = 1, . . . , K, in thou-
sands of iterations. The most probable sample A, obtained in the Gibbs sampling iterations,
corresponds to a mode (typically a local maximum) of the posterior distribution of A. Equiv-
alently, we consider maximizing an alignment score defined as:
Score =J∑
j=1
20∑i=1
ci,jlogθ̂i,j
θ̂i,0+ λ
J
K
∑l
-
where the ci,j’s are amino acid counts from the complete alignment. The first term in the
score formula is similar to the score defined by the standard Gibbs sampling approach with
only amino acid frequency (Jensen et al. 2004). The second term is a new contribution by
secondary structures.
Our simulations indicate, starting from a given random initial location A, the Gibbs sam-
pling algorithm always converges within a thousand of iterations. However, the convergent
results may vary from simulation to simulation with different initial values A. The sampling
result of an individual Markov chain only corresponds to one of many local maxima. We
evaluate the sampling procedure using multiple simulations.
As an ad hoc guideline, we always run Gibbs sampling with several choices of the param-
eter λ, for instance, λ = 0.5, 1, 1.5, 2. In addition, 50-100 Markov chain simulations from
different random initial locations A are used for each λ value. Gelman and Rubin (1992)
noticed the importance of running multiple Gibbs sampling chains for obtaining reliable sta-
tistical inferences. Besides obtaining an over-dispersed distribution of the motif alignment A,
running multiple Markov chains solves the difficult problem of setting the unknown parame-
ter λ. Instead of setting a λ value for the given protein data, we consider the best alignment
as the one that has a high probability under several λ values. Therefore, the alignments that
repeat most frequently in these multiple simulations and also have high alignment scores are
reported as the candidate alignments.
4 Application
To evaluate the proposed alignment method using secondary structure predictions, we com-
pare it with the standard Gibbs sampling (Lawrence et al. 1993; Liu et al. 1995), as
well as the highly ranked multiple alignment programs, including ClustalW (Thompson et
al. 1994), Dialign (Morgenstein et al. 1996), and PRRP (Gotoh 1996). The programs
are tested on reference alignments from the BAliBASE (Thompson et al. 1999a) bench-
14
-
mark alignment database (http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE), which con-
tains manually-refined multiple sequence alignments. The aligned regions are defined as core
blocks, whose alignments are validated to ensure functional or structural conservation. Most
data sets in BAliBASE include a few proteins (< 10). For our program purpose, we select ten
big data sets, each of which have more than 10 sequences. These data sets are also chosen to
represent the most difficult alignment problems. Specifically, four data sets (1idy, 1r69, 1ubi,
1wit) are selected from BAliBASE Reference 3 containing divergent protein families with av-
erage sequence identity less than 22%. Two data sets (Kinase2 and 1vln) are selected from
BAliBASE Reference 4 containing sequences with large N/C terminal extensions, and four
data sets (1thm1, s51, kinase2, kinase3) are selected from BAliBASE Reference 5 containing
internal insertions.
The names and features of the four data sets from Reference 3 are listed in Table 1.
Notice that instead of using the short sequences provided in BAliBASE, we collect the whole
protein sequences from the SWISS-PROT database (Bairoch and Apweiler 1997). The input
sequences for our alignments are much longer than those in BAliBASE therefore are supposed
to be harder to correctly align the structural core blocks. Motif widths are determined by
the extension procedure (Xie and Kim 2005), with 22, 19, 19, and 16 for 1idy, 1r69, 1ubi,
and 1wit respectively.
[Table 1 about here.]
To illustrate the impact of using secondary structures, we plot the likelihood function
of motif location ak for the third sequence (RPC2 BPP22) in the set of 1r69. Except for
this sequence, we assume that all the other motif locations are known. Figure 3 (a) shows
the log-likelihood based on only the SOV part, and Figure 3 (b) shows the log-likelihood
of the full model (PSSM + SOV; black) and the log-likelihood based on only the amino
acids part (PSSM; grey). The likelihood function based on only the SOV part gives high
probabilities for a few motif locations, whereas the likelihood based on PSSM alone shows
15
-
high probability peaks at many locations. Inference for the true motif location (position
17 for this data) is not an easy job, because the likelihood function based on either PSSM
or SOV alone has no dominant mode. In contrast, combining PSSM with SOV, we obtain
a better-shaped likelihood function. The true motif location at position 17 is clearly the
global mode and the relative difference from the second mode is strong. For this type of
data, the predicted secondary structure enhances the motif pattern, therefore the true motif
is easier to be identified under the new model. As demonstrated in Table 2, the proposed
alignment method with secondary structure information finds the true motif of 1r69 much
more frequently (3.85 more times) than the standard Gibbs sampling method.
[Figure 3 about here.]
Table 2 shows comparisons of our proposed model with the standard Gibbs sampling
method. For each data set, the alignments obtained by both methods are compared to the
structural alignments in BAliBASE. A good alignment is defined when a large number of
sequences out of the total number in each data set are correctly aligned. The criteria of
determining good alignments are listed in the second column in Table 2. Multiple Markov
chain simulations are used for the proposed method (PSSM+SOV) and the standard Gibbs
sampling, where the proposed method runs 200 Markov chains, 50 runs at each of four
λ = 0.5, 1, 1.5, 2, and the standard Gibbs sampling runs 100 Markov chains. The numbers
in the table represent the number of runs that correctly found the structural core blocks
in BAliBASE. Our model (PSSM + SOV) shows better success rates in finding the true
motifs. For example, the success rate for 1idy increases from 0% to 12.5%. The rate for 1r69
increases from 10% to 38.5%.
[Table 2 about here.]
Further comparisons of the proposed method (PSSM+SOV) with ClustalW, Dialign, and
PRRP are displayed in Table 3. The reported alignments from PSSM+SOV are the most
16
-
frequent alignments in 200 Markov chain simulations as described previously. Alignments are
measured by the number of correctly aligned sequences out of the total number of sequences
in each data set. For the data set 1ubi, PSSM+SOV performs much better than the other
3 programs. For data sets 1idy and 1r69, PSSM+SOV performs as well as Dialign but
better than the other 2 programs. For the rest of the data sets, all programs work well. In
summary, PSSM+SOV is the best choice among these programs. Plots of the percents of
correctly aligned sequences for each of the programs in each of the data sets are shown in
Figure 4. The line of PSSM+SOV (dark blue) has high alignment values in all data sets.
[Table 3 about here.]
[Figure 4 about here.]
The comparisons indicate that the proposed method using secondary structure predic-
tions works at least as well as the best alignment programs using amino acid sequence
information alone, and even better in some situations. Studying the structural alignments
of these data sets in BAliBASE, we found that most of the alignments had conserved amino
acids at several positions, except the alignments of 1idy and 1ubi. Our proposed method out-
performs other alignment programs in these two data sets, because the secondary structures
greatly enhance the motif signals in addition to amino acid conservation. As an example,
the structural alignment of 1idy from BAliBASE is shown in Figure 5. The underlined
segments share common core structures and therefore are referred to as the true motif seg-
ments. Table 4 shows the alignment for 1idy by our approach using both PSSM and SOV.
This alignment corresponds to the first and second core structural regions in Figure 5. The
aligned amino acid segments show that there is no strongly conserved amino acid pattern,
except column 17. In contrast, the predicted secondary structures show a conservation. The
secondary structure of the motif can be considered as a helix-turn-helix (helix-coil-helix)
structure.
17
-
[Table 4 about here.]
[Figure 5 about here.]
5 Discussion
The currently existing methods of identifying protein motifs consider only amino acid fea-
tures of the motifs. The proposed model is the first attempt to utilize the predicted secondary
structures for a probabilistic model of motifs. It is not surprising that information brought
by the predicted secondary structures improves multiple alignments. The similarity mea-
surement of secondary structures, SOV values, are defined for the whole motif segments. The
dependence feature of adjacent amino acids is partially modeled in our approach, whereas
all existing models assume that the positions in a motif are independent.
Probability models and Bayesian methods showed great advantages in dealing with high
dimensional complicated sequence features. Our scoring function is in terms of probability,
which is defined exponentially proportional to a similarity measurement of secondary struc-
tures. Instead of directly maximizing a score function, Gibbs sampling method is employed to
simulate samples of the posterior probability, whose modes correspond to alignments of high
scores. Difficult convergence to the global maximum is a big concern in multiple sequence
alignment. We solve this problem by simulating multiple Markov chains from different ran-
dom initial values and under different parameter λ values. The most probable alignment
from multiple simulations is likely to be the true alignment.
The proposed model can be improved by including reliability indices of secondary struc-
ture predictions. PSI-PRED (Jones 1999) assigns a score of confidence level at 0-9 for each
predicted secondary state (H, E or C). The score 9 indicates the most reliable prediction,
whereas score 0 indicates the least reliable prediction. It is known that the reliability in-
dices correlate very well with prediction accuracy. A weighted SOV measurement may be
18
-
developed such that the similarity between two segments of secondary structure in a confor-
mational state (i.e. H, E or C) will be weighted by the sum of the confidence indices of the
segments. The weighted SOV can then be substituted into Formulas (2) and (3) for a better
model of secondary structures.
References
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J, Zhang, Z, Miller, W., and Lipman,
D. J. (1997), “Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs,” Nucleic Acids Research, 25, 3389-3402.
Bairoch, A., and Apweiler, R. (1997), “The SWISS-PROT protein sequence database: its
relevance to human molecular medical research,” Journal of Molecular Medicine, 75, 312-316.
Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S.& Jones, D. T.
(2005) “Protein structure prediction servers at University College London”, Nucleic Acids
Research, 33 (Web Server issue), W36-38.
Chou, P. Y., and Fasman, U. D. (1974), “Prediction of protein conformation,” Biochem-
istry, 13, 211-215.
Dempster, A. P., Laird, N. M. and, Rubin, D. B. (1977), “Maximum likelihood from
incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, Ser. B, 39,
1-38.
Eddy, S. R. (1998), “Profile hidden Markov models”. Bioinformatics, 14, 755-763.
Errami, M., Goeurjon, C., and Deléage, G. (2003), “Detection of unrelated proteins in
sequences multiple alignments by using predicted secondary structures,” Bioinformatics, 19,
506-512.
Garnier, J., Osguthorpe, D. J., and Robson, B. (1978), “Analysis of the accuracy and
implications of simple methods for predicting the secondary structure of globular proteins,”
Journal of Molecular Biology, 120, 97-120.
19
-
Gelman, A. and Rubin, D. B. (1992), “Inference from iterative simulation using multiple
sequences”, Statistical Science, 7, 457-72.
Geourjon, C., Combet, C., Blanchet, C., and Deléage, G. (2001), “Identification of Re-
lated Proteins with Weak Sequence Identity Using Secondary Structure Information,” Pro-
tein Science, 10, 788-797.
Gotoh, O. (1996), “Significant improvement in accuracy of multiple protein sequence
alignments by iterative refinement as assessed by reference to structural alignments”, J.
Mol. Biol., 264, 823-838.
Henikoff, S., and Henikoff, J. G. (1992), “Amino Acid Substitution Matrices from Protein
Blocks,” Proceedings of the National Academy of Sciences, 89, 10915-10919.
Jensen, S. T., Liu, X. S., Shou, Q., and Liu, J. S. (2004), “Computational Discovery of
Gene Regulatory Binding Motifs: A Bayesian Perspective,” Statistical Science, 19, 188-204.
Jones, D. T. (1999) “Protein secondary structure prediction based on position-specific
scoring matrices”, Journal of Molecular Biology, 292, 195-202.
Kabsch, W., and Sander, C. (1983), “Dictionary of protein secondary structure: pattern
recognition of hydrogen-bonded and geometrical features,” Biopolymers, 22, 2577-2637.
Lawrence, C. E., and Reilly, A. A. (1990), “An Expectation-Maximization (EM) Algo-
rithm for the Identification and Characterization of Common Sites in Biopolymer Sequences,”
Proteins, 7, 41-51.
Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton,
J. C. (1993), “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple
Alignment,” Science, 262, 208-214.
Liu, J. S., Neuwald, A. F., and Lawrence, C. E. (1995), “Bayesian Models for Multi-
ple Local Sequence Alignment and Gibbs Sampling Strategies,” Journal of the American
Statistical Association, 90, 1156-1170.
Morgenstein, B., Dress, A. and Werner, T. (1996), “Multiple DNA and protein sequence
20
-
alignment based on segment-to-segment comparison”, PNAS, 93, 12098-12103.
Rost, B., and Sander, C. (1993), “Prediction of Protein Secondary Structure at Better
than 70% Accuracy,” Journal of Molecular Biology, 232, 584-599.
Salamov, A. A., and Solovyev V. V. (1995), “Prediction of protein secondary structure
by combining nearest-neighbour algorithms and multiple sequence alignments,” Journal of
Molecular Biology, 247, 11-15.
Jones, D. T. (1999) “Protein secondary structure prediction based on position-specific
scoring matrices”, Journal of Molecular Biology, 292, 195-202. Silberberg, M. S. (2000),
Chemistry: The molecular nature of matter and change (2nd ed.), Boston, MA: McGraw-
Hill.
Thompson, J. D., Higgins, D. G., Gibson, T.J.(1994), “CLUSTAL W: improving the
sensitivity of progressivemultiple sequence alignment through sequence weighting, position-
specific gap penalties and weight matrix choice”, Nucleic Acids Res, 22, 4673-4680.
Thompson, J. D., Plewniak, F., and Poch, O. (1999a), “BAliBASE: a benchmark align-
ment database for the evaluation of multiple alignment programs,” Bioinformatics, 15, 87-88.
Thompson, J. D., Plewniak, F., and Poch, O. (1999b), “A comprehensive comparison of
multiple sequence alignment programs”, Nucleic Acids Research, 27, 2682-2690.
Xie, J., Li, K.-C., and Bina, M. (2004), “A Bayesian Insertion/Deletion Algorithm for
Distant Protein Motif Searching via Entropy Filtering,” Journal of the American Statistical
Association, 99, 409-420.
Xie, J., and Kim, N.-K. (2005), “Bayesian Models and Markov Chain Monte Carlo Meth-
ods for Protein Motifs with the Secondary Characteristics,” Journal of Computational Biol-
ogy, 12, 952-970..
Zelma, A., Venclovas, C., Fidelis, K., and Rost, B. (1999), “A Modified Definition of
Sov, a Segment-Based Measure for Protein Secondary Structure Prediction Assessment,”
Proteins, 34, 220-223.
21
-
Dataset Family name no. of sequences Length∗ Motif width Ave. Identity(%)1idy DNA binding 25 101-636 22 191r69 Repressor 23 71-882 19 181ubi Ubiquitin 22 70-1132 19 201wit Twitchin 19 93-250 16 22
Table 1: Four data sets from BAliBASE Reference 3. ∗The sequence lengths are longer thanthose in BAliBASE because the whole sequences were collected from the SWISS-PROTdatabase.
22
-
Dataset Correct alignments Rate of the correct alignments(correctly aligned / total # of seq) PSSM+SOV Standard Gibbs (PSSM alone)
1idy 16/25 and more 25/200 (12.5%) 0/100 (0%)1r69 21/23 and more 77/200 (38.5%) 10/100 (10%)1ubi 20/22 and more 18/200 (9%) 1/100 (1%)1wit 15/19 and more 37/200 (18.5%) 4/100 (4%)
Table 2: Comparison of the multiple Markov chain simulation results for the proposedmethod (PSSM+SOV) and the standard Gibbs sampling method. Correct alignments aredefined when the number of correctly aligned sequences are equal to or larger than thecutoff in the second column. The rate of correct alignments are obtained from multipleMarkov chain simulations, 200 Markov chains for the proposed method (PSSM+SOV) and100 Markov chains for the standard Gibbs sampling (PSSM alone). The number of Markovchains that find the correct alignments are reported in the third and fourth columns.
23
-
Dataset PSSM+SOV ClustalW Dialign PRRP1idy 16/25 10/25 16/25 8/251r69 23/23 9/23 23/23 5/231ubi 20/22 5/22 4/22 1/221wit 16/19 15/19 16/19 18/19
1thm1 11/11 10/11 11/11 11/11kinase2 16/17 15/17 15/17 15/17
kinase2 insert 11/12 12/12 12/12 11/12kinase3 insert 19/19 18/19 18/19 18/19
s51 15/15 15/15 15/15 15/151vln 13/14 14/14 14/14 14/14
Table 3: Comparison of the rate of the correctly aligned sequences for the proposed method(PSSM+SOV) with three highly ranked programs, ClustalW, Dialign, and PRRP. The num-bers are the correctly aligned sequences out of the total number of sequences in each dataset. Our proposed method performs better or as well as the other programs for all the datasets.
24
-
Sequence name Aligned AA Segment Secondary Structures
sp|P06876|MYB MOUSE RIIYQAHKRLGNRWAEIAKLLP HHHHHHHHHHCCHHHHHHHHHC
sp|P27898|MYBP MAIZE DIIIKLHATLGNRWSLIASHLP HHHHHHHHHCCCCHHHHHHHHC
sp|P20025|MYB3 MAIZE DLIVKLHSLLGNKWSLIAARLP HHHHHHHHHCCCHHHHHHHHHC
sp|P27900|GL1 ARATH DLIIRLHKLLGNRWSLIAKRVP HHHHHHHHHHCCHHHHHHHHCC
sp|P20027|MYB3 HORVU DHIVALHQILGNRWSQIASHLP HHHHHHHHHCCCHHHHHHHHHC
sp|P80073|MYB2 PHYPA NLILDLHATLGNRWSRIAAQLP HHHHHHHHHCCCHHHHHHHHHC
sp|P02259|H5 CHICK AAIRAEKSRGGSSRQSIQKYIK HHHHHHHHCCCCCHHHHHHHHH
sp|P15870|H1D STRPU SALESLKEKKGSSRQAILKYVK HHHHHHHHCCCCCHHHHHHHHH
sp|P15869|H1B STRPU AAITALKERGGSSAQAIRKYIE HHHHHHHHCCCCCHHHHHHHHH
sp|P35060|H1 TIGCA AAIKALKERNGSSLPAIKKYIA HHHHHHHHCCCCCHHHHHHHHH
sp|Q05831|H1L MYTTR AAITAMKNRKGSSVQAIRKYIL HHHHHHHHCCCCCHHHHHHHHH
sp|P02257|H1 ECHCR AAIAAQKERRGSSVAKIQSYIA HHHHHHHHCCCCCHHHHHHHHH
sp|P10771|H11 CAEEL EAIKQLKDRKGASKQAILKFIS HHHHHHHHCCCCCHHHHHHHHH
sp|P06894|H1A PLADU TAILGLKERKGSSMVAIKKYIA HHHHHHHHCCCCCHHHHHHHHH
sp|P26568|H11 ARATH DAIVTLKERTGSSQYAIQKFIE HHHHHHHHCCCCCHHHHHHHHH
sp|P54671|H1 DICDI TAIAHYKDRTGSSQPAIIKYIE HHHHHHHHCCCCCHHHHHHHHH
sp|P15282|ARGR ECOLI AFKALLKEEKFSSQGEIVAALQ HHHHHHHHHCHHHHHHHHHHHH
sp|P95721|ARGR STRCL RIVDILNRQPVRSQSQLAKLLA HHHHHHHHHCCCCHHHHHHHHH
sp|P17893|ARGR BACSU KIREIITSNEIETQDELVDMLK HHHHHHHHHCHHHHHHHHHHHH
sp|O31408|ARGR BACST KIREIIMSNDIETQDELVDRLR HHHHHHHHHCHHHHHHHHHHHH
sp|Q54870|ARGR STRPN LIKKMITEEKLSTQKEIQDRLE HHHHHHHHHCHHHHHHHHHHHH
sp|P94992|ARGR MYCTU RIVAILSSAQVRSQNELAALLA HHHHHHHHHCCCCHHHHHHHHH
sp|P03032|TRPR ECOLI VRIVEELLRGEMSQRELKNELG HHHHHHHHHCCCCHHHHHHHCC
sp|P44889|TRPR HAEIN LQIVSQLIDKNMPQREIQQNLN HHHHHHHHHCCCCHHHHHHHHC
sp|P34257|TC3A CAEEL VSLHEMSRKISRSRHCIREYLK CCHHHHHHHHCCCCHHHHHHHH
Table 4: Alignments of the data set 1idy by the proposed method. This alignment corre-sponds to the first and second core structural regions shown in Figure 6. While there is aclear conservation in the secondary structures for this motif, the aligned amino acid segmentshows no strongly conserved column, except column 17. The secondary structure of themotif can be considered a helix-turn-helix structure.
25
-
Conf: 968887179808999855874388999988874088842028812883616897600047
Pred: CEEEEEECCCEEEEEEECCCCHHHHHHHHHHHHHCCCCCCCEEECCCEEECCCCEEHHCC
AA: MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN
10 20 30 40 50 60
Conf: 8999889999763189
Pred: CCCCCEEEEEEEECCC
AA: IQKESTLHLVLRLRGG
70
Figure 1: An example of secondary structure prediction by PSI-PRED. The protein isUBIQ HUMAN (swissprot ID P02248), which is a sequence in the data set 1ubi. The linesin the order are confidence level of the secondary structure prediction, string of the predictedsecondary structure, and the original amino acid sequence.
26
-
1βStructure 1 CCCC EEEEEECCCC
2β 3βStructure 2 CC EEEECCEEECCC
-- --++++++++
+++++++
Figure 2: Illustration of minov and maxov in the SOVo(E) calculation. (- -) indicates theminov of (β1, β2) and (β1, β3). The first line of (++) indicates the maxov of (β1, β2) and thesecond line of (++) indicates the maxov of (β1, β3).
27
-
(a)
0 50 100 150 200
24
68
1012
14
position
rela
tive
valu
esof
log-
likel
ihoo
d
(b)
0 50 100 150 200
-20
-10
010
2030
position
rela
tive
valu
esof
log-
likel
ihoo
d
Figure 3: Log-likelihood plot for a sequence, RPC2 BPP22 in the data set 1r69 from BAl-iBASE. (a) The log-likelihood calculated by the SOV part; (b) The log-likelihood calculatedby the proposed model (PSSM + SOV; black) and the log-likelihood calculated by the aminoacids part (PSSM; grey).
28
-
(a)
0
0.2
0.4
0.6
0.8
1
1.2
1idy 1r69 1ubi 1wit
perc
ento
fcor
rect
lyal
igne
dse
quen
ces
PSSM+SOV
Gibbs sampling(PSSM only)
ClustalW
Dialign
PRRP
(b)
0
0.2
0.4
0.6
0.8
1
1.2
1idy
1r69 1u
bi1w
it
1thm
1
kinas
e2
kinas
e2_in
sert
kinas
e3_in
sert
s51
1vln
perc
ento
fcor
rect
lyal
igne
dse
quen
ces
PSSM+SOV
ClustalW
Dialign
PRRP
Figure 4: Comparison of the proposed method with the standard Gibbs sampling method,ClustalW, Dialign, and PRRP. The plots represent percents of the correctly aligned sequencesout of the total number of sequences for each program in each data set. The proposed method(PSSM+SOV) as demonstrated by the line of dark blue performs the best in all programs.
29
-
1idy 1 mevkktswt eeedrILYQA hkr lgnR WAEIAKLLp.........grt dnamybp_maize 1 advkrgniskeeedIIIKL hatlgnRWSLIASHL p.........grtdnemyb3_maize 1 .dlkrgnftadeddLIVKL hsllgnKWSLIAARL p.........grtdnegl1_arath 1 .nvnkgnfteqeedLIIRL hkllgnRWSLIAKRV p.........grtdnqmyb3_horvu 1 .dlkrgcfsqqeedHIVAL hqilgnRWSQIASHL p.........grtdnemyb2_phypa 1 .dlkrgifseaeenLILDL hatlgnRWSRIAAQL p.........grtdne1hstA 1 ...shpt ysemiaaAIR AEksrggsS RQSIQKYIksh ykvgh...n adlqh1d_strpu 1 ...shpkysdmiasALESL kekkgsSRQAILKYV kanftvgd...nanvhh1b_strpu 1 ...ahpsssemvlaAITAL kerggsSAQAIRKYI eknytvdi..kkqaifh1_tigca 1 ...thpptsvmvmaAIKAL kerngsSLPAIKKYI aanykvdv..vknahfh1l_myttr 1 ....kpstlsmivaAITAM knrkgsSVQAIRKYI lannkgin.tshlgsah1_echcr 1 ...ahppvidmitaAIAAQ kerrgsSVAKIQSYI aakyrcdi..nalnphh11_caeel 1 ...ahppyintikeAIKQL kdrkgaSKQAILKFI sqnyklgdnviqinahh1a_pladu 1 ...ahppvatmvvtAILGL kerkgsSMVAIKKYI aanyrvdv..arlapfh11_arath 1 ...shptyeemikdAIVTL kertgsSQYAIQKFI eekrkelp..ptfrklh1_dicdi 1 ...nhptyqvmistAIAHY kdrtgsSQPAIIKYI eanynvap..dtfktq1aoy 1 .mrssakqee lvkaFKALL keekfsS QGEIVAALqeq .gfd...nin qskARGR_STRCL 1 ........marhrrIVDIL nrqpvrSQSQLAKLL adn.gls....vtqatG3273713 1 enlnpvtrtarqalILQIL dkqkvtSQVQLSELL lde.gid....itqatAHRC_BACSU 1 .....mnkgqrhikIREII tsneieTQDELVDML kqd.gyk....vtqatARGR_BACST 1 .....mnkgqrhikIREII msndieTQDELVDRLrea.gfn....vtqatARGR_STRPN 1 .....mrkrdrhqlIKKMI teeklsTQKEIQDRL eah.nvc....vtqttARGR_MYCTU 1 gpevaanragrqarIVAIL ssaqvrSQNELAALL aae.gie....vtqat1jhgA 1 .t pderealgtrvrIIEEL lr ge.mSQRELKNELg..........ag iatTRPR_HAEIN 1 .taderdavglrlqIVSQL idkn.mPQREIQQNLn..........tsaatG3328572 1 .sfserkdvasryhIIRAL lege.lTQREIAEKY g..........vsiaq1tc3C 1 ....rgsals dterAQLDV mkll nvSLHEMSRKIs..........rs rhc
1idy 42 IKNHWNSTmrr kv.mybp_maize 42 IKNYWNSHlsrq..myb3_maize 41 IKNYWNTHvrrk..gl1_arath 41 VKNYWNTH lskk..myb3_horvu 41 IKNFWNSCikkk..myb2_phypa 41 IKNYWNTRlkkr..1hstA 45 IKLSIRRL la agv.h1d_strpu 45 IKQALKRG vtsgq.h1b_strpu 46 IKRALITG vekgt.h1_tigca 46 IKKALKSL vekkk.h1l_myttr 46 MKLAFAKG lksgv.h1_echcr 46 IRRALKNQ vksga.h11_caeel 48 HRQALKRGvtska.h1a_pladu 46 IRKFIRKA vkqtkgh11_arath 46 LLLNLKRL vasgk.h1_dicdi 46 LKLALKRL vakgt.1aoy 46 VSRMLTKFgavrt.ARGR_STRCL 38 LSRDLDELgavki.G3273713 46 LSRDLDELgarkv.AHRC_BACSU 41 VSRDIKELhlvkv.ARGR_BACST 41 VSRDIKEMqlvkv.ARGR_STRPN 41 LSRDLREIgltkv.ARGR_MYCTU 46 LSRDLEELgavkl.1jhgA 39 ITRGSNSLka apv.TRPR_HAEIN 39 ITRGSNMIktmdp.G3328572 39 ITRGSNALkgldp.1tc3C 37 IRVYLKDPvsygt.
Figure 5: The structural alignment of the data set 1idy reported in BAliBASE. The under-lined segments are core structural regions.
30