A dynamic programming algorithm for RNA structure prediction including pseudoknots

16
A Dynamic Programming Algorithm for RNA Structure Prediction Including Pseudoknots Elena Rivas and Sean R. Eddy* Department of Genetics Washington University St. Louis, MO 63110, USA We describe a dynamic programming algorithm for predicting optimal RNA secondary structure, including pseudoknots. The algorithm has a worst case complexity of O(N 6 ) in time and O(N 4 ) in storage. The descrip- tion of the algorithm is complex, which led us to adopt a useful graphical representation (Feynman diagrams) borrowed from quantum field theory. We present an implementation of the algorithm that generates the optimal minimum energy structure for a single RNA sequence, using standard RNA folding thermodynamic parameters augmented by a few parameters describing the thermodynamic stability of pseudoknots. We demonstrate the properties of the algorithm by using it to predict struc- tures for several small pseudoknotted and non-pseudoknotted RNAs. Although the time and memory demands of the algorithm are steep, we believe this is the first algorithm to be able to fold optimal (minimum energy) pseudoknotted RNAs with the accepted RNA thermodynamic model. # 1999 Academic Press Keywords: RNA; secondary structure prediction; pseudoknots; dynamic programming; thermodynamic stability *Corresponding author Introduction Many RNAs fold into structures that are import- ant for regulatory, catalytic, or structural roles in the cell. An RNA’s structure is dominated by base- pairing interactions, most of which are Watson- Crick pairs between complementary bases. The base-paired structure of an RNA is called its sec- ondary structure. Because Watson-Crick pairs are such a stereotyped and relatively simple inter- action, accurate RNA secondary structure predic- tion appears to be an achievable goal. A rather reliable approach for RNA structure prediction is comparative sequence analysis, in which covarying residues (e.g. compensatory mutations) are identified in a multiple sequence alignment of RNAs with similar structures, but different sequences (Woese & Pace, 1993). Covary- ing residues, particularly pairs which covary to maintain Watson-Crick complementarity, are indicative of conserved base-pairing interactions. The accepted secondary structures of most struc- tural and catalytic RNAs were generated by com- parative sequence analysis. If one has only a single RNA sequence (or a small family of RNAs with little sequence diver- sity), comparative sequence analysis cannot be applied. Here, the best current approaches are energy minimization algorithms (Schuster et al., 1997). While not as accurate as comparative sequence analysis, these algorithms have still pro- ven to be useful research tools. Thermodynamic parameters are available for predicting the G of a given RNA structure (Freier et al., 1986; Serra & Turner, 1995). The Zuker algorithm, implemented in the programs MFOLD (Zuker, 1989a) and ViennaRNA (Schuster et al., 1994), is an efficient dynamic programming algorithm for identifying the globally minimal energy structure for a sequence, as defined by such a thermodynamic model (Zuker & Stiegler, 1981; Zuker & Sankoff, 1984; Sankoff, 1985). The Zuker algorithm requires O(N 3 ) time and O(N 2 ) space for a sequence of length N, and so is reasonably efficient and practi- cal even for large RNA sequences. The Zuker dynamic programming algorithm was sub- sequently extended to allow experimental con- straints, and to sample suboptimal folds (Zuker, 1989b). McCaskill’s variant of the Zuker algorithm calculates probabilities (confidence estimates) for particular base-pairs (McCaskill, 1990). E-mail address of the corresponding author: [email protected] Abbreviations used: MWM, maximum weighted matching; NP, non-deterministic polynomial; IS, irreducible surfaces. Article No. jmbi.1998.2436 available online at http://www.idealibrary.com on J. Mol. Biol. (1999) 285, 2053–2068 0022-2836/99/052053–16 $30.00/0 # 1999 Academic Press

Transcript of A dynamic programming algorithm for RNA structure prediction including pseudoknots

Page 1: A dynamic programming algorithm for RNA structure prediction including pseudoknots

A Dynamic Programming Algorithm for RNA Structure

Article No. jmbi.1998.2436 available online at http://www.idealibrary.com on J. Mol. Biol. (1999) 285, 2053±2068

Prediction Including Pseudoknots

Elena Rivas and Sean R. Eddy*

Department of GeneticsWashington University

We describe a dynamic programming algorithm for predicting optimalRNA secondary structure, including pseudoknots. The algorithm has a

6 4

St. Louis, MO 63110, USA

*Corresponding author

Introduction

E-mail address of the [email protected]

irreducible surfaces.

0022-2836/99/052053±16 $30.00/0

worst case complexity of O(N ) in time and O(N ) in storage. The descrip-tion of the algorithm is complex, which led us to adopt a useful graphicalrepresentation (Feynman diagrams) borrowed from quantum ®eld theory.We present an implementation of the algorithm that generates theoptimal minimum energy structure for a single RNA sequence, usingstandard RNA folding thermodynamic parameters augmented by a fewparameters describing the thermodynamic stability of pseudoknots. Wedemonstrate the properties of the algorithm by using it to predict struc-tures for several small pseudoknotted and non-pseudoknotted RNAs.Although the time and memory demands of the algorithm are steep, webelieve this is the ®rst algorithm to be able to fold optimal (minimumenergy) pseudoknotted RNAs with the accepted RNA thermodynamicmodel.

# 1999 Academic Press

Keywords: RNA; secondary structure prediction; pseudoknots; dynamicprogramming; thermodynamic stability

tural and catalytic RNAs were generated by com-parative sequence analysis.

Many RNAs fold into structures that are import-ant for regulatory, catalytic, or structural roles inthe cell. An RNA's structure is dominated by base-pairing interactions, most of which are Watson-Crick pairs between complementary bases. Thebase-paired structure of an RNA is called its sec-ondary structure. Because Watson-Crick pairs aresuch a stereotyped and relatively simple inter-action, accurate RNA secondary structure predic-tion appears to be an achievable goal.

A rather reliable approach for RNA structureprediction is comparative sequence analysis, inwhich covarying residues (e.g. compensatorymutations) are identi®ed in a multiple sequencealignment of RNAs with similar structures, butdifferent sequences (Woese & Pace, 1993). Covary-ing residues, particularly pairs which covary tomaintain Watson-Crick complementarity, areindicative of conserved base-pairing interactions.The accepted secondary structures of most struc-

Abbreviations used: MWM, maximum weightedmatching; NP, non-deterministic polynomial; IS,

ing author:

If one has only a single RNA sequence (or asmall family of RNAs with little sequence diver-sity), comparative sequence analysis cannot beapplied. Here, the best current approaches areenergy minimization algorithms (Schuster et al.,1997). While not as accurate as comparativesequence analysis, these algorithms have still pro-ven to be useful research tools. Thermodynamicparameters are available for predicting the �G of agiven RNA structure (Freier et al., 1986; Serra &Turner, 1995). The Zuker algorithm, implementedin the programs MFOLD (Zuker, 1989a) andViennaRNA (Schuster et al., 1994), is an ef®cientdynamic programming algorithm for identifyingthe globally minimal energy structure for asequence, as de®ned by such a thermodynamicmodel (Zuker & Stiegler, 1981; Zuker & Sankoff,1984; Sankoff, 1985). The Zuker algorithm requiresO(N3) time and O(N2) space for a sequence oflength N, and so is reasonably ef®cient and practi-cal even for large RNA sequences. The Zukerdynamic programming algorithm was sub-sequently extended to allow experimental con-straints, and to sample suboptimal folds (Zuker,1989b). McCaskill's variant of the Zuker algorithmcalculates probabilities (con®dence estimates) forparticular base-pairs (McCaskill, 1990).

# 1999 Academic Press

Page 2: A dynamic programming algorithm for RNA structure prediction including pseudoknots

One well-known limitation of the Zuker algor-ithm is that it is incapable of predicting so-called

best suited to folding sequences for which a pre-vious multiple alignment exists, so that scores may

2054 RNA Pseudoknot Prediction by Dynamic Programming

RNA pseudoknots. This is the problem that weaddress here.

The thermodynamic model for non-pseudo-knotted RNA secondary structure includes somestereotypical interactions, such as stacked base-paired stems, hairpins, bulges, internal loops, andmultiloops. Formally, non-pseudoknotted struc-tures obey a ``nesting'' convention: that for anytwo base-pairs i, j and k, l (where i < j, k < l andi < k), either i < k < l < i or i < j < k < l. It is preciselythis nesting convention that the Zuker dynamicprogramming algorithm relies upon to recursivelycalculate the minimal energy structure on progress-ively longer subsequences. An RNA pseudoknot isde®ned as a structure containing base-pairs whichviolate the nesting convention. An example of asimple pseudoknot is shown in Figure 1.

RNA pseudoknots are functionally important inseveral known RNAs (ten Dam et al., 1992). Forexample, by comparative analysis, RNA pseudo-knots are conserved in ribosomal RNAs, the cataly-tic core of group I introns, and RNase P RNAs.Plausible pseudoknotted structures have been pro-posed (Pleij et al., 1985), and recently con®rmed(Kolk et al., 1998) for the 30 end of several plantviral RNAs, where pseudoknots are apparentlyused to mimic tRNA structure. In vitro RNA evol-ution (SELEX) experiments have yielded familiesof RNA structures which appear to share a com-mon pseudoknotted structure, such as RNAligands selected to bind HIV-1 reverse transcriptase(Tuerk et al., 1992).

Most methods for RNA folding which arecapable of folding pseudoknots adopt heuristicsearch procedures and sacri®ce optimality.Examples of these approaches include quasi-MonteCarlo searches (Abrahams et al., 1990) and geneticalgorithms (Gultyaev et al., 1995; van Batenburget al., 1995). These approaches are inherentlyunable to guarantee that they have found the``best'' structure given the thermodynamic model,and consequently unable to say how far a givenprediction is from optimality.

A different approach to pseudoknot predictionbased on the maximum weighted matching(MWM) algorithm (Edmonds, 1965; Gabow, 1976)was introduced by Cary & Stormo (1995) andTabaska et al. (1998). Using the MWM algorithm,an optimal structure is found, even in the presenceof complicated knotted interactions, in O(N3) timeand O(N2) space. However, MWM currently seems

Figure 1. A simple pseudoknot. In a pseudoknot,nucleotides inside a hairpin loop pair with nucleotidesoutside the stem-loop.

be assigned to possible base-pairs by comparativeanalysis. It is not clear to us that the MWM algor-ithm will be amenable to folding single sequencesusing the relatively complicated Turner thermo-dynamic model. However, we believe that this wasthe ®rst work that indicated that optimal RNApseudoknot predictions can be made with poly-nomial time algorithms. It had been widelybelieved, but never proven, that pseudoknot pre-diction would be an NP problem (NP, non-deter-ministic polynomial; e.g. only solvable by heuristicor brute force approaches).

Here, we describe a dynamic programmingalgorithm which ®nds optimal pseudoknottedRNA structures. We describe the algorithm usinga diagrammatic representation borrowed fromquantum ®eld theory (Feynman diagrams). Weimplement a version of the algorithm that ®ndsminimal energy RNA structures using the standardRNA secondary structure thermodynamic model(Freier et al., 1986, Serra & Turner, 1995), augmen-ted by a few pseudoknot-speci®c parameters thatare not yet available in the standard folding par-ameters, and by coaxial stacking energies (Walteret al., 1994) for both pseudoknotted and non-pseu-doknotted structures. We demonstrate the proper-ties of the algorithm by testing it on several smallRNA structures, including both structures thoughtto contain pseudoknots and structures thought notto contain pseudoknots.

Algorithm

Here, we will introduce a diagrammatic way ofrepresenting RNA folding algorithms. We willstart by describing the Nussinov algorithm(Nussinov et al., 1978), and the Zuker-Sankoffalgorithm (Zuker & Sankoff, 1984; Sankoff, 1985)in the context of this representation. Later on wewill extend the diagrammatic representation toinclude pseudoknots and coaxial stackings. TheNussinov and Zuker-Sankoff algorithms can beimplemented without the diagrammatic represen-tation, but this representation is essential to man-age the complexity introduced by pseudoknots.

Preliminaries

From here on, unless otherwise stated, a ¯atcontinuous line will represent the backbone of anRNA sequence with its 50-end placed in the left-hand side of the segment. N will represent thelength (in number of nucleotides) of the RNA.

Secondary interactions will be represented bywavy lines connecting the two interacting positionsin the backbone chain, while the backbone itselfalways remains ¯at. No more than two bases areallowed to interact at once. This representationdoes not provide insight about real (three-dimen-sional) spatial arrangements, but is very con-venient for algorithmic purposes. When necessary

Page 3: A dynamic programming algorithm for RNA structure prediction including pseudoknots

for clari®cation, single-stranded regions will bemarked by dots, but when unambiguous, dots willbe omitted for simplicity. Using this representation

Roughly speaking a surface is any alternatingsequence of continuous and wavy lines that closes

Figure 3. The wx and vx matrices.

RNA Pseudoknot Prediction by Dynamic Programming 2055

(Figure 2), we can describe hairpins, bulges, stems,internal loops and multiloops as simple nestedstructures; a pseudoknot, on the other hand, corre-sponds with a non-nested structure.

Diagrammatic representation ofnested algorithms

In order to describe a nested algorithm we needto introduce two triangular N � N matrices, to becalled vx and wx. These matrices are de®ned in thefollowing way: vx(i, j) is the score of the best fold-ing between positions i and j, provided that i and jare paired to each other; whereas wx(i, j) is thescore of the best folding between positions i and jregardless of whether i and j pair to each other ornot. These matrices are graphically represented inthe form indicated in Figure 3. The ®lled innerspace indicates that we do not know how manyinteractions (if any) occur for the nucleotidesinside, in contrast with a blank inner space whichindicates that the fragment inside is known to beunpaired. The wavy line in vx indicates that i and jare de®nitely paired, and similarly the discontinu-ous line in wx indicates that the relation between iand j is unknown. Also part of our convention isthat for a given fragment, nucleotide i is at the 50-end, and nucleotide j is at the 30-end, so that i 4 j.

The purpose of the nested dynamic program-ming algorithm is to ®ll the vx and wx matriceswith appropriate numerical weights by means ofsome sort of recursive calculation.

The recursion for vx includes contributions dueto: hairpins, bulges, internal loops, and multiloops.But what is special about hairpins, bulges, internalloops, and multiloops in this diagrammatic rep-resentation? To answer this question we have tointroduce two more de®nitions: surfaces and irre-ducible surfaces (IS).

Figure 2. Diagrammatic representation of the most relevaThe nucleotides of the sequence are represented by dots. Singary structure. A hairpin (H) is a sequence of unpaired baseinternal loops (IL) are all nested structures bounded by twoat both ends. In a bulge, the two base-pairs are contiguousare not contiguous at all. Multiloops (M) refer to any structustructure is referred to as a pseudoknot.

on itself. An irreducible surface is a surface suchthat if one of the H-bonds (or secondary inter-actions) is broken, there is no other surface con-tained inside, that is, an IS cannot be ``reduced'' toany other surface. The order O, of an IS is given bythe number of wavy lines (secondary interactions),which is equal to the number of continuous-lineintervals. It is easy to see that hairpin loops consti-tute the IS of O(1); stems, bulges and internal loopsare all the IS of O(2), and what are referred to inthe literature as ``multiloops'' are the IS of O > 2.

For nested con®gurations, our ISs are equivalentto the ``k-loops'' de®ned by Sankoff (1985); how-ever, the ISs are more general and also includenon-nested structures. A technical report aboutirreducible surfaces is available from http://www.genetics.wustl.edu/eddy/publications/.

The actual recursion for vx is given in Figure 4,and can be expressed as:

vx�i; j� � optimal

EIS1�i; j�EIS2�i; j : k; l� � vx�k; l�EIS3�i; j : k; l : m; n� � vx�k; l� � vx�m; n�EIS4�i; j : k; l : m; n : r; s� � vx�k; l��vx�m; n� � vx�r; s�O�5�

�1�

8>>>>>>>>><>>>>>>>>>:�8k; l;m; n; r; s; i4 k 4 l4m4 n4 r4 s4 j �

nt RNA secondary structures, including a pseudoknot.le-stranded regions (SS) are not involved in any second-s bounded by one base-pair. Stems (S), bulges (B) andbase-pairs. In a stem, the two base-pairs are contiguousonly at one end. In an internal loop, the two base-pairsre bounded by three or more base-pairs. Any non-nested

Page 4: A dynamic programming algorithm for RNA structure prediction including pseudoknots

Figure 4. General recursion for vx in the nestedalgorithm.

Each line gives the formal score of one of the dia-grams in Figure 4. The diagram on the left is calcu-

pin, bulge, internal loop etc.) are given any special-ized scores. We only have to provide a speci®cscore for a base-pair, B. The recursion for vx then

Figure 5. Recursion for vx truncated at O(0).

2056 RNA Pseudoknot Prediction by Dynamic Programming

lated as the score of the best diagram on the right.The initialization conditions are:

vx�i; i� � �1; 8i 14 i4N �2�The recursion (1) for vx is an expansion in ISs ofsuccessively higher order.

Here EISn(i1, j1 : i2, j2 : . . . : in, jn). represents thescoring function for an IS of order n, in which ik ispaired to jk. This general algorithm is quite imprac-tical, because an ISg which has order g, O(g), addsa complexity of O(N2(g ÿ 1)) to the calculation. (AnISg requires us to search through 2g independentsegments in the entire sequence of N nucleotides.To make it useful, we have to truncate the expan-

8

1 1

sion in ISs at some order in the recursion for vx inFigure 4. The symbol O(g) indicates the order of ISg

vx�i; j� � optimalEIS �i; j� � ISEIS2�i; j : k; l� � vx�k; l� � IS2

PI �M� wxI�i� 1; k� � wxI�k � 1; jÿ 1� � multiloop

<: �4�

�8k; l i 4 k 4 l 4 j�

M stands for the score for generating a multiloop.The Turner thermodynamic rules also penalize an

at which we truncate the recursion.

� paired

These recursions are equivalent to those pro-posed by Sankoff (1985) in theorem 2. Notice alsothat in de®ning the recursive algorithm we havenot yet had to specify anything about the particu-lar manner in which the contribution from differ-ent ISs are calculated in order to obtain the mostoptimal folding.

The simplest truncation is to stop at order zero,O(0). In this approximation none of the ISs (hair-

P� vx�i; j�8

�>>>>< wx�i; j� � optimal

Q� wx�i� 1; j�Q� wx�i; jÿ 1� single-stranded

wx�i; k� � wx�k � 1; j� �8k; i4 k 4 j�: � bifurcation

>>>>: �5�

simpli®es to Figure 5, and can be cast into theform:

vx�i; j� � B� wxI�i� 1; jÿ 1� �3�If we set B � �1, then we have the Nussinovalgorithm (Nussinov et al., 1978). The matrix wxI

is similar to wx de®ned before, with the speci®ca-tion of appearing inside a base-pair. This simplealgorithm calculates the folding with the maxi-mum number of base-pairs.

The next order of complexity we explore corre-sponds with a truncation at ISs of O(2). Hairpinloops, bulges, stems, and internal loops are treatedwith precision by the scoring functions EIS1 andEIS2. The rest of ISs, collected under the name ofmultiloops, which are much less frequent than theprevious, are described in an approximate form.The diagrams of this approximation are given inFigure 6, and correspond with:

amount for each closing pair in a multiloop. Bystarting a multiloop we are specifying already oneof its closing pairs; this closing-pair score is rep-resented here by PI.

The recursion relations used to ®ll the wx matrixinclude: single-stranded nucleotides, external pairs,and bifurcations. The actual recursion is easier tounderstand by looking at the diagrams involved(given in Figure 7) and the recursion can beexpressed as:

Page 5: A dynamic programming algorithm for RNA structure prediction including pseudoknots

With the initialization condition:

wx�i; i� � 0; 8i 14 i4N �6�

given by the difference between the assigned scoreto multiloops and the precise score that one of

Figure 6. Recursion for vx truncated at O(2).

RNA Pseudoknot Prediction by Dynamic Programming 2057

Note that we have two independent matrices, wxand wxI, which have structurally identical recur-sions, but completely different interpretations. Thematrix wxI, used to truncate the recursion for vx inequation (4), is used exclusively for diagramswhich will be incorporated into multiloops,whereas wx is only used when there are no exter-nal base-pairs. Therefore, the parameters control-ling these two recursions will, in general, havevery different values because they have very differ-ent meanings. QI is the penalty for an unpairednucleotide in a multiloop, and PI is the penalty fora closing base-pair (e.g. per stem) in a multiloop.On the other hand, Q represents the score for asingle-stranded nucleotide, and P represents thescore for an external base-pair. In Turner's thermo-dynamic rules both Q and P are approximated byzero.

Note also that the recursions for wx and wxI

always remain the same, independent of the orderof irreducible surface to which the recursion for vxhas been truncated.

This is the nested algorithm described bySankoff (1985) in theorem 3, and is the approxi-mation that MFOLD (Zuker & Stiegler, 1981) andViennaRNA (Schuster et al., 1994) implement.Higher orders of speci®city of the general algor-ithm are possible, but are certainly more time con-suming, and they have not been explored so far.One reason for this relative lack of development isthat there is little information about the energeticproperties of multiloops. The generalized nestedalgorithm provides a way to unify the currentlyavailable dynamic algorithms for RNA folding. Ata given order, the error of the approximation is

Figure 7. Recursion for wx in the nested algorithm.

those higher-order ISs deserves.

Description of the pseudoknot algorithm

Pseudoknots are non-nested con®gurations andclearly cannot be described with just the wx and vxmatrices we introduced in the previous section.The key point of the pseudoknot algorithm is theuse of gap matrices in addition to the wx and vxmatrices. Looking at the graphical representationof one of the simplest pseudoknots, Figure 8, wecan see that we could describe such a con®gurationby putting together two gap matrices with comp-lementary holes.

The pseudoknot dynamic programming algor-ithm uses one-hole or gap matrices (Figure 9) as ageneralization of the wx and vx matrices (cf.Table 1). Let us de®ne whx(i, j : k, l) as the graphthat describes the best folding that connects seg-ments [i, k] with [l, j], i 4 k 4 l 4 j, such that therelation between i and j and k and l is undeter-mined. Similarly, we de®ne vhx(i, j : k, l) as thegraph that describes the best folding that connectssegments [i, k] with [l, j], i 4 k 4 l 4 j, such that iand j are base-paired and k and l are also base-paired. For completeness we have to introduce also

Figure 8. Construction of a simple pseudoknot usingtwo gap matrices.

Page 6: A dynamic programming algorithm for RNA structure prediction including pseudoknots

matrix yhx(i, j : k, l) in which k and l are paired, butthe relation between i and j is undetermined, andits counterpart zhx(i, j : k, l) in which i and j are

lish a consistent and complete recursion relation?Here is where the analogy between the gap

Figure 9. Representation of the gap matrices used inthe algorithm for pseudoknots.

Table 1. Speci®cations of the matrices used in thepseudoknot algorithm

Matrix Relationship Relationship(i 4 k 4 l 4 j) i, j k, l

vx(i, j) Paired -wx(i, j) Undetermined -

vhx(i, j : k, l) Paired Pairedzhx(i, j : k, l) Paired Undeterminedyhx(i, j : k, l) Undetermined Pairedwhx(i, j : k, l) Undetermined Undetermined

Figure 10. Recursion for vx in the pseudoknot algor-ithm truncated at O(whx � whx � whx). (Contiguousnucleotides are represented with explicit dots.)

� IS�1�

2058 RNA Pseudoknot Prediction by Dynamic Programming

paired, but the relation between k and l is undeter-mined.

The non-gap matrices wx, vx are contained as aparticular case of the gap matrices. When there isno hole, k � l ÿ 1, then by construction:

whx�i; j : k; k � 1� � wx�i; j� �7�

zhx�i; j : k; k � 1� � vx�i; j� 8k; i4 k 4 j

We have introduced the gap matrices as the build-ing blocks of the algorithm, but how do we estab-

EIS1�i; j�8>

The additional parameters for pseudoknotse

vx�i; j� � optimal

EIS2�i; j : k; l� � vx�k; l� � IS�2�

PI �M� wxI�i� 1; k� � wxI�k � 1; jÿ 1� � nestedmultiloop

ePI � ~M� GwI � whx�i� 1; r : k; l��whx�K � 1; jÿ 1 : lÿ 1; r� 1�

�non-nestedmultiloop

>>>>>>>>><>>>>>>>>>>:�8�

�8i; k; l; r; j i4 k 4l4r4j �

{ More precisely, the analogy is more cleanly

expressed in terms of Schwinger-Dyson diagrams whichin QFT are used to represent full interacting verticesand propagators recursively in terms of elementaryinteractions.

matrices and the Feynman diagrams of quantum®eld theory was of great help (Bjorken & Drell1965).{

Let us start with the generalization of the recur-sions for vx and wx in the presence of gap matrices.A non-gap matrix can be obtained by combiningtwo gap matrices together, therefore the recursionsfor vx and wx add one more diagram with two gapmatrices to recursions (4) and (5). Again the dia-grammatic representation (Figures 10 and 11) ismore helpful than words in explaining the recur-sions. (When possible, individual bases are labeledin the diagrams. Otherwise contiguous nucleotidesare depicted with dots.) Note that the new termintroduced in both recursions involves two gapmatrices. In fact, the recursion is an expansion inthe number of gap matrices.

The recursion for the non-gap matrix vx is givenby (cf. Figure 10):

are: PI , the score for a pair in a non-nested multi-

loop; eM, a generic score for generating a non-nested multiloop; and GwI the score for generatingan internal pseudoknot.
Page 7: A dynamic programming algorithm for RNA structure prediction including pseudoknots

Figure 11. Recursion for wx in the pseudoknot algor-ithm truncated at O(whx � whx � whx). (Contiguousnucleotides are represented with explicit dots.)

Similarly for wx (cf. Figure 11):

Wp

a solvable con®guration can be decomposed intoa sum of gap matrices according to the rules pro-vided by our recursions. A non-solvable con®gur-

RNA Pseudoknot Prediction by Dynamic Programming 2059

wx�i; j� � optimal

P� vx�i; j� � paired

Q� wx�i� 1; j�Q� wx�i; jÿ i�

�single-stranded

wx�i; k� � wx�k � 1; j��

nestedbifurcation

Gw � whx�i; r : k; l��whx�k � 1; j : lÿ 1; r� 1�

�non-nestedbifurcation

8>>>>>>>>>>>>><>>>>>>>>>>>>>:�9�

here Gw denotes the score for introducing aseudoknot. We should also remember that the

ABCABC (the topology present in Escherichia coli amRNA; Gluick et al., 1994), and others. However,

ichoup

Table 2. The parameters for whmation provided by the Turner gr

algorithm uses two different wx matrices depend-ing on whether the subset i . . . j is free-standing(wx) or appears inside a multiloop (in which casewe use wxI). The two recursions are identicalapart from having different parameter values asdescribed in Table 2.

Practical considerations make us truncate theexpansion at this stage; we will not include dia-grams that require three or more gap matrices.This statement should not mislead one into think-ing that we cannot deal with complicated pseu-doknots. We de®ne a solvable con®guration asone that can be parsed by our algorithm. That is,

Symbol Scoring parameter for

EIS1 Hairpin loopsEIS2 Bulges, stems and internal loC Coaxial stackingP External pairQ Single-stranded baseR, L Base dangling off an externaPI Pair in a nested multiloopQI Non-paired base inside multRI, LI Base dangling off a multilooM Nested multiloop

These parameters are identical with thowustl.edu/Ä zuker/rna).

ation is one that requires diagrammatictopologies that involve three or more gapmatrices. That is, a non-solvable con®gurationrequires us to go to a higher orders in the expan-sion of the pseudoknot algorithm.

Our algorithm can solve ``overlapping pseudo-knots'' (de®ned as those pseudoknots for which aplanar representation does not require crossinglines) such as ABAB, ABACBC, ABACBDCD, etc.The algorithm can also ®nd some ``non-planarpseudoknots'' (pseudoknots for which a planarrepresentation requires crossing lines) such as

there is thermodynamic infor-

the algorithm is not able to solve all possibleknotted con®gurations, as for instance a parallelb-sheet protein interaction ABCADBECD (seeFigure 12 for some details.) For a given con®gur-ation we can decide unambiguously whether it issolvable or not by parsing it according to themodel. However, we still lack a systematic a prioricharacterization of the class of con®gurations thatthis algorithm can solve.

Note that two approximations are involved inthe algorithm. Apart from that just mentioned(truncating the in®nite expansion in gap matricesto make the algorithm polynomial), we also use

Value (kcal/mol)

Variesops Varies

Varies00

l pair Dangle � Q0.1

iloop 0.4p pair Dangle � QI

4.6

se used in MFOLD (http://www.ibc.

Page 8: A dynamic programming algorithm for RNA structure prediction including pseudoknots

the approximation previously introduced for thenested algorithm (that ISs of O. > 2 or multiloopsare described in some approximated form). Despite

con®guration than when the two hairpins justcoexist without interaction of any kind.

The algorithm implements coaxial energies for

Figure 12. Top, the non-planarpseudoknot (ABCABC) presentedin a mRNA and how to build itwith gap matrices. The Romannumbers correspond with the num-bering of stems introduced byGluick et al. (1994). Bottom, anexample of a pseudoknot that thealgorithm cannot handle; interlacedinteractions as seen in proteins inparallel b-sheet (ABCADBECDE).The assembly of this interactionusing gap matrices would requireus to use four gap matrices at oncewhich is not allowed by theapproximation at hand.

2060 RNA Pseudoknot Prediction by Dynamic Programming

these limitations, this truncated pseudoknot algor-ithm seems to be adequate for the currently knownpseudoknots in RNA folding.

The algorithm is not complete until we providethe full recursive expressions to calculate the gapmatrices. For a given gap matrix, we have to con-sider all the different ways that its diagram can beassembled using one or two matrices at a time.(Again, Feynman diagrams are of great use here.)The full description of those diagrams is quiteinvolved and the many technical details will notadd to the clarity of this exposition. In order togive the reader a feeling for the kind of topologiesthe pseudoknot algorithm allows, we provide inthe Appendix a simpli®ed version of the recursionsfor the gap matrices in which coaxial stacking ordangles are excluded (see below).

Coaxial stacking and dangles

It is quite frequent in RNA folding to create amore stable con®guration when two independentcon®gurations stack coaxially. This occurs, forinstance, when two hairpin loops with theirrespective stems are contiguous. Then one of themcan fall on top of the other, creating a more stable

both nested and non-nested structures. We adoptthe coaxial energies provided by Walter et al.(1994) for coaxial stacking of nested structures. Forcoaxial stacking of non-nested structures wemultiply these previous energies by an estimated(ad hoc) weighting parameter g < 1.

Using our diagrammatic representation it ispossible to be systematic in describing the poss-ible coaxial stacking that can occur. In the gener-al recursion one has to look for contiguousnucleotides, and allow them to be explicitlypaired (but not to each other). This is best under-stood with an example. Consider the recursionfor wx in Figure 11, in particular the bifurcationdiagram:

wx�i; j� ÿ!wx�i; k� � wx�k � 1; j�; 8k; i4 k 4 j

�10�In order to allow for the possibility of coaxialstacking, this bifurcation diagram has to be com-plemented with another one in which the nucleo-tides of the bifurcation are base-paired:

wx�i; j� ÿ! vx�i; k� � vx�k � 1; j� � C�k; i : k � 1; j�;8k; i4 k 4 j �11�

Page 9: A dynamic programming algorithm for RNA structure prediction including pseudoknots

This new diagram (Figure 13) indicates that ifnucleotides k and k � 1 are paired to nucleotides

Our pseudoknot algorithm implements bothdangles and coaxial stackings. MFOLD currently

Figure 13. Coaxial stacking. Two base-pair inter-actions are energetically more favorable when they arecontiguous with each other. Here, we indicate how tocomplement the regular bifurcation diagram in wx (left)with an additional diagram (right) to take into accountsuch a coaxial stacking con®guration. The coaxial scor-ing function depends on both base-pairs. (Coaxial dia-grams can be recognized by the empty dotsrepresenting the contiguous coaxially stacking nucleo-tides.)

Figure 14. Dangles. The ®gures represent three typesof dangling bases that can contribute to the ungappedmatrix wx. The dangle score function associated witheach of these diagrams depends both on the danglingbases and the base-pair adjacent to them.

RNA Pseudoknot Prediction by Dynamic Programming 2061

i and j, respectively, that con®guration isspecially favored by an amount C(k, i : k � 1, j)(presumably negative in energy units) becauseboth sub-structures, vx(i, k) and vx(k � 1, j), willstack onto each other.

Similarly, unpaired nucleotides contiguous to apaired base seem to have a different thermodyn-amic contribution than other unpaired nucleotides.In order to take this fact into account, we have tosystematically add dangle diagrams to the variousrecursions.

For instance, the dangle diagrams that we haveto add for the recursion of the wx matrix are givenin Figure 14, and correspond with the followingterms in the recursion for wx:

wx�i; j� ÿ!

Lii�1; j � vx�i� 1; j�

Rji; jÿ1 � vx�i; jÿ 1�

Lii�1; jÿ1 � R

ji�1; jÿ1 � vx�i� 1; jÿ 1�

8>>><>>>:�12�

The dangle scoring functions, (R, L), depend both onthe dangling bases and the contiguous base-pair.These dangle energies have been well characterizedby the Turner group (Freier et al., 1986). Danglingbases can also appear inside multiloop diagrams.Notice also that the coaxial diagram in equation (11)really corresponds with four new diagrams becauseonce we allow pairing, dangling bases also have tobe considered, so the full nearest-neighbour inter-action is taken into account.

{ Since the implementation of the pseudoknot

algorithm, the Turner group has produced a newcomplete and more accurate list of parameters(Mathews et al., 1998) which we have not yetimplemented.

implements only dangles, but will soonimplement coaxials (Mathews et al., 1998). Forpurposes of clarity we will not explicitly showany of the additional diagrams to be included inthe recursions to take care of coaxial stackingsand dangles.

Minimum-energy implementation:thermodynamic parameters

We have implemented the pseudoknot algorithmusing thermodynamic parameters in order to ®llthe scoring matrices, both gapped and ungapped.For the relevant nested structures, hairpin loops,bulges, stems, internal loops and multiloops, wehave used the same set of energies as used inMFOLD.{ Free energies for coaxial stacking, C,were those obtained by Walter et al. (1994). Table 2provides a list of the parameters used for nestedconformations.

For the non-nested con®gurations, there is notmuch thermodynamic information available(Wyatt et al., 1990; Gluick et al., 1994). This is notan untypical situation; there is very little thermo-dynamic information available for regular multi-loops, let alone for pseudoknots. We had to tuneby hand the parameters related to pseudoknots.For some non-nested structures we multiplied thenested parameters by an estimated weighting par-ameter g < 1. It would be very useful, in order toimprove the accuracy of this thermodynamicimplementation of the pseudoknot algorithm, tohave more accurate, experimentally, based deter-minations of these parameters. Table 3 provides alist of the parameters we used for pseudoknot-related conformations.

Results

The main purpose of this work is to present analgorithm that solves optimal pseudoknotted RNAstructures by dynamic programming. RNA struc-ture prediction of single sequences with the nestedalgorithm already involves some approximationand inaccuracy (Zuker, 1995; Huynen et al., 1997).

Page 10: A dynamic programming algorithm for RNA structure prediction including pseudoknots

We expect this inaccuracy to increase in our case,since the algorithm now allows a much larger con-

(Figure 5). In both cases, the result is almost uni-versal pairing because there is enough freedom to

Table 3. The new thermodynamic parameters speci®c for pseudoknotcon®gurations which we had to estimate

Symbol Scoring parameter for Value (kcal/mol)gEIS2 IS2 in a gap matrix EIS2 � g(0.83)~C Coaxial stacking in pseudoknots C � g~P Pair in a pseudoknot 0.1ePI Pair in a non-nested multiloop ~P � g

QÄ Non-paired base in pseudoknot 0.2RÄ , ~L Base dangling off a pseudoknot pair dangle � g � ~QMÄ Non-nested multiloop 8.43Gw Generating a new pseudoknot 7.0GwI

Generating a pseudoknot in a multiloop 13.0Gwh Overlapping pseudoknots 6.0

2062 RNA Pseudoknot Prediction by Dynamic Programming

®guration space. Therefore, our limited objectivehere is to show that on a few small RNAs that arethought to conserve pseudoknots, our program (aminimal-energy implementation of the pseudoknotalgorithm using a thermodynamic model) willactually ®nd the pseudoknots; and for a few smallRNAs that do not conserve pseudoknots, our pro-gram ®nds results similar to MFOLD, and does notintroduce spurious pseudoknots.

tRNAs

Almost all transfer RNAs share a common clo-verleaf structure. We have tested the algorithmon a group of 25 tRNAs selected at random fromthe Sprinzl tRNA database (Steinberg et al., 1993).The program ®nds no spurious pseudoknot forany of the tested sequences. All but one (DT5090)of the tRNAs fold into a cloverleaf con®guration.Of the 24 cloverleaf foldings, 15 are completelyconsistent with their proposed structures (that is,each helical region has at least three base-pairs incommon with its proposed folding). The remain-ing nine cloverleaf foldings misplace one (sixsequences) or two (three sequences) of the helicalregions. On the other hand, MFOLD's lowestenergy prediction for the same set of tRNAsequences includes only 19 cloverleaf foldings, ofwhich 14 are completely consistent with theirproposed structures. Performance for our pro-gram is, therefore, at least comparable withMFOLD; the inaccuracies found are the result ofthe approximations in the thermodynamic model,not a problem with the pseudoknot algorithmper se. The relevant result in relation to the pseu-doknot algorithm is that its implementation pre-dicts no spurious pseudoknots for tRNAs.

One should not think of this result as a trivialone, because when knots are allowed, the con®gur-ation space available becomes much larger thanthe observed class of conformations. This problemis particularly relevant for ``maximum-pairing-like'' algorithms, such as the MWM algorithm pre-sented by Cary & Stormo (1995) or a Nussinovimplementation of our pseudoknot algorithm

be able to coordinate any position with anotherone in the sequence.

Another important aspect of tRNA folding iscoaxial energies. Most tRNAs gain stability bystacking coaxially two of the hairpin loops, and thethird one with the acceptor stem. This aspect oftRNA folding is very important and in some casescrucial to determine the right structure. There aresituations like tRNA DA0260 in which MFOLDdoes not assign the lowest energy to the correctstructure (the MFOLD 3.0 prediction for DA0260misses the acceptor stem, and has a free energy ofÿ22.0 kcal/mol). Our algorithm, on the otherhand, implements coaxial energies; as a result, thecloverleaf con®guration becomes the most stablefolding for tRNA DA0260 (�G � ÿ24.3 kcal/mol).The implementation of coaxial energies explainswhy we found more cloverleaf structures fortRNAs than MFOLD does.

HIV-1-RT-ligand RNA pseudoknots

High-af®nity ligands of the reverse transcriptaseof HIV-1 isolated by a SELEX procedure by Tuerket al. (1992) seem to have a pseudoknot consensussecondary structure. These oligonucleotides havebetween 34 and 47 bases, and fold into a simplepseudoknot. Of a total of 63 SELEX-selected pseu-doknotted sequences available from Tuerk et al.(1992), we found 54 foldings that agreed exactlywith the structures derived by comparative anal-ysis (�G � ÿ9 kcal/mol for sequence pattern I (3-2)). As expected, MFOLD predicts only one of thetwo stems (�G � ÿ7.5 kcal/mol for the samesequence).

Viral RNAs

Some virus RNA genomes (such as turnipyellow mosaic virus, TYMV; Guiley et al., 1979)present a tRNA-like structure at their 30-end thatincludes a pseudoknot in the aminoacyl acceptorarm very close to the 30-end (Kolk et al., 1998; Pleij

Page 11: A dynamic programming algorithm for RNA structure prediction including pseudoknots

et al., 1985; Dumas et al., 1987). Our program cor-rectly predicts the TYMV tRNA-like structure with

0

ducible surfaces). This is a further generalization ofthe results reported by Sankoff (1985). The calcu-

RNA Pseudoknot Prediction by Dynamic Programming 2063

its pseudoknot for the last 86 bases at the 3 -endwith �G � ÿ30.4 kcal/mol (the MFOLD 3.0 pre-diction for TYMV has a free energy of�G � ÿ28.9 kcal/mol). The tRNA-like 30 terminalstructure is conserved among tymoviruses, andalso for the tobacco mosaic virus cowpea strain,another valine acceptor. Of the seven valine-accep-tor tRNA-like structures proposed to date (VanBelkum et al., 1987), we reproduce six of them,except for kennedya yellow mosaic virus.

Another interesting pseudoknot appears in thelast 189 bases of the 30 terminus of the tobaccomosaic virus (TMV; Van Belkum et al., 1985). TMValso has a tRNA-like pseudoknot structure at theend, but it may have additional upstream pseudo-knots, up to a total of ®ve, forming a long quasi-continuous helix. We folded the upstream anddownstream regions separately in a piece of 84nucleotides (the folding requires 47 minutes and9.8 Mb) and 105 nucleotides (the folding requires235 minutes and 22.5 Mb), respectively. Our pro-gram predicts the 105 nucleotides downstreamregion exactly with �G � ÿ32.5 kcal/mol. For the84 nucleotides upstream region we ®nd four of the®ve helical regions with �G � ÿ19.0 kcal/mol.

Finally we have considered the recently crystal-lized ribozymes of the hepatitis delta virus (HDV;FerreÂ-D'Amare et al., 1998). Our program predictscorrectly the structure of the 91 nt antigenomicHDV ribozyme (�G � ÿ36.7 kcal/mol). Our pro-gram also predicts the pseudoknot present in the87 nt genomic ribozyme (�G � ÿ43.9 kcal/mol; inthis case the prediction misses the short two-stemhairpin between positions 17-30).

Discussion

Here, we present an algorithm able to predictpseudoknots by dynamic programming. Thisalgorithm demonstrates that using certain approxi-mations consistent with the accepted Turnerthermodynamic model, the prediction of pseudo-knotted structures is a problem of polynomial com-plexity (although admittedly high). Having anoptimal dynamic programming algorithm willenable extending other dynamic programmingbased methods that rigorously explore the confor-mational space for RNA folding (McCaskill, 1990;Bonhoeffer et al., 1993) to pseudoknotted struc-tures.

Apart from the usefulness of the algorithm inpredicting pseudoknots, we also include coaxialenergies (when two stems stack coaxially), a verycommon feature of RNA folding. We expectMFOLD will also include coaxial energies in thenear future (Mathews et al., 1998).

Our algorithm is presented in the context of ageneral framework in which a generic folding isexpressed in terms of its elementary secondaryinteractions (which we have identi®ed as the irre-

lation of an optimal folding becomes an expansionin ISs of increasingly higher order. Our formaliza-tion incorporates all current dynamic program-ming RNA folding algorithms in addition to ourpseudoknot algorithm. It also establishes the limi-tations of each approximation by determining atwhich order the expansion is truncated.

As for the thermodynamic implementation pre-sented here, one of our major problems is thealmost complete lack of thermodynamic infor-mation about pseudoknot con®gurations. The ther-modynamic algorithm is also sensitive to theaccuracy of the existing thermodynamic par-ameters. We expect to improve this aspect byimplementing the more complete set of parametersprovided by the Turner group (Mathews et al.,1998).

The principal drawback is the time and memoryconstraints imposed by the computational com-plexity of the algorithm. At this early stage, wecannot analyze sequences much larger than 130-140 bases. For now, the program is adequate forfolding small RNAs. A 100 nt RNA takes aboutfour hours and 22.5 Mb to fold on an SGI R10KOrigin200.

Due to practical limitations, at a given point inthe recursion we only allow the incorporation oftwo gap matrices. However, since each of thosegap matrices can in turn be assembled by othertwo of those matrices, it implies that the algor-ithm includes in its con®guration space a largevariety of knotted motifs. The limitations of thistruncation appeared when we considered apply-ing this approach to describe pairwise residueinteractions in protein folding. A parallel b-sheetcon®guration in protein structure provides anexample of a complicated knotted folding thatcannot be handled by the pseudoknot algorithmpresented here. However, all known RNA pseu-doknots can be handled by the algorithm, whichmakes the approximation useful enough for RNAsecondary structure.

Although we implemented the algorithm forenergy minimization, extending MFOLD to pseu-doknotted structures, the algorithm is not limitedto energy minimization. Our algorithm can be con-verted into a probabilistic model for pseudoknot-containing RNA folding. Probabilistic models ofRNA second structure based on ``stochastic contextfree grammar'' (SCFG) formalisms (Eddy et al.,1994; Sakakibara et al., 1994; Lefebvre, 1996) havebeen introduced both for RNA single-sequencefolding and for RNA structural alignment andstructural similarity searches. The Inside and CYKdynamic programming algorithms used for SCFG-based structural alignment are fundamentally simi-lar to the Zuker algorithm (Durbin et al., 1998), andhave consequently also been unable to deal withpseudoknots. Heuristic approaches to applyingSCFG-like structural alignment models to pseudo-knots have been introduced (Brown, 1996;

Page 12: A dynamic programming algorithm for RNA structure prediction including pseudoknots

Notredame et al., 1997), and the maximumweighted matching algorithm has been applied to

(1987). 3-D graphics modeling of the tRNA-like30 end of turnip yellow mosaic virus RNA: struc-

2064 RNA Pseudoknot Prediction by Dynamic Programming

®nd optimal alignments (Tabaska & Stormo, 1997).An SCFG-like probabilistic version of our pseudo-knot algorithm could be designed to obtain opti-mal structural alignment of pseudoknot-containingRNAs.

Methods

The algorithm was implemented in ANSI C on a Sili-con Graphics Origin200. The algorithm has a theoreticalworst-case complexity of O(N6) in time and O(N4) in sto-rage. At its present stage, the program is empiricallyobserved to run O(N6.8) in time and O(N3.8) in memory.For instance, a tRNA of 75 nt takes 20 minutes and uses6.6 Mb of memory. The 30-end of tobacco mosaic virushas 105 nucleotides and takes 235 minutes and uses22.5 Mb. The program empirically scales above thetheoretical complexity in time of the algorithm. Thiseffect seems to have to do with the way the machineallocates memory for larger RNAs. The software andparameter sets are available by request from E. Rivas([email protected]). A technical report giving thefull algorithm is available from http://www.genetics.-wustl.edu/eddy/publications/.

Acknowledgments

This work was supported by NIH grant HG01363 andby a gift from Eli Lilly. E.R. acknowledges the support ofa fellowship by the Sloan Foundation. The idea for thealgorithm came from a discussion with Gary Stormo at ameeting at the Aspen Center for Physics. Tim Hubbardsuggested parallel b-strands in proteins as an example ofa set of pairwise interactions that the algorithm cannothandle. We wish to thank the anonymous reviewers forvery useful comments.

References

Abrahams, J. P., van der Berg, M., van Batenburg, E. &Pleij, C. W. A. (1990). Prediction of RNA secondarystructure, including pseudoknotting, by computersimulation. Nucl. Acids Res. 18, 3035-3044.

Bjorken, J. D. & Drell, S. D. (1965). Relativistic QuantumFields, McGraw-Hill, New York, NY.

Bonhoeffer, S., McCaskill, J. S., Stadler, P. F. & Schuster,P. (1993). Statistics of RNA secondary structure.Eur. Biophys. J. (EHU), 22, 13-24.

Brown, M. (1996). RNA pseudoknot modeling usingintersections of stochastic context free grammarswith applications to database search. Paci®c Sym-posium on Biocomputing 1996.

Cary, R. B. & Stormo, G. D. (1995). Graph-theoreticapproach to RNA modeling using comparativedata. In ISMB-95 (Rawling, C., et al., eds), pp. 75-80,AAAI Press.

Dumas, P., Moras, D., Florentz, C., GiegeÂ, R.,Verlaan, P., van Belkum, A. & Pleij, C. W. A.

tural and functional implications. J. Biomol. Struct.Dynam. 4, 707-728.

Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. J.(1998). Biological Sequence Analysis: ProbabilisticModels of Proteins and Nucleic Acids, Cambridge Uni-versity Press, Cambridge UK.

Eddy, S. R. & Durbin, R. (1994). RNA sequence analysisusing covariance models. Nucl. Acids Res. 22, 2079-2088.

Edmonds, J. (1965). Maximum matching and polyhe-dron with 0, 1-vertices. J. Res. Nat. Bur. Stand. 69B,125-130.

FerreÂ-D'AmareÂ, A. R., Zhou, K. & Doudna, J. A. (1998).Crystal structure of a hepatitis delta virus ribozyme.Nature, 395, 567-574.

Freier, S., Kierzek, R., Jaeger, J. A., Sugimoto, N.,Caruthers, M. H., Neilson, T. & Turner, D. H.(1986). Improved free-energy parameters for predic-tions of RNA duplex stability. Proc. Natl Acad. Sci.USA, 83, 9373-9377.

Gabow, H. N. (1976). An ef®cient implementation ofEdmonds' algorithm for maximum matching ongraphs. J. Asc. Com. Mach. 23, 221-234.

Gluick, T. C. & Draper, D. E. (1994). Thermodynamicsof folding a pseudoknotted mRNA fragment. J. Mol.Biol. 241, 246-262.

Guilley, H., Jonard, G., Kukla, B. & Richards, K. E.(1979). Sequence of 1000 nucleotides at the 30 end oftobacco mosaic virus RNA. Nucl. Acids Res. 6, 1287-1308.

Gultyaev, A. P., van Batenburg, F. H. & Pleij, C. W. A.(1995). The computer simulation of RNA foldingpathways using a genetic algorithm. J. Mol. Biol.250, 37-51.

Huynen, M., Gutell, R. & Konings, D. (1997). Assessingthe reliability of RNA folding using statistical mech-anics. J. Mol. Biol. 267, 1104-1112.

Kolk, M. H., van der Graff, M., Wijmenga, S. S., Pleij,C. W. A., Heus, H. A. & Hilbers, C. W. (1998).NMR structure of a classical pseudoknot: interplayof single- and double-stranded RNA. Science, 280,434-438.

Lefebvre, F. (1996). A grammar-based uni®cation ofseveral alignments and folding algorithms. ISMB-96 (Rawlings, C., et al., eds), pp. 143-154, AAAIPress.

Mathews, D. H., Andre, T. C., Kim, J., Turner, D. H. &Zuker, M. (1998). An updated recursive algorithmfor RNA secondary structure prediction withimproved free energy parameters. In Molecular Mod-eling of Nucleic Acids (Leontis, N. B. & SantaLucia,J., Jr, eds), American Chemical Society.

McCaskill, J. S. (1990). The equilibrium partition func-tion and base pair bindings probabilities forRNA secondary structure. Biopolymers, 29, 1105-1119.

Notredame, C., O'Brien, E. A. & Higgins, D. G. (1997).RAGA: RNA sequence alignment by genetic algor-ithm. Nucl. Acids Res. 25, 4570-4580.

Nussinov, R., Pieczenik, G., Griggs, J. R. & Kleitman,D. J. (1978). Algorithms for loop matchings. SIAM J.Appl. Math. 35, 68-82.

Pleij, C. W., Rietveld, K. & Bosch, L. (1985). A new prin-ciple of RNA folding based on pseudoknotting.Nucl. Acids Res. 13, 1717-1731.

Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S.,SjoÈ lander, K., Underwood, R. C. & Haussler, D.

Page 13: A dynamic programming algorithm for RNA structure prediction including pseudoknots

(1994). Stochastic context-free grammars for tRNAmodeling. Nucl. Acids Res. 22, 5112-5120.

Zuker, M. & Sankoff, D. (1984). RNA secondary struc-ture and their prediction. Bull. Math. Biol. 46,

Figure A1. Recursion for the vhx matrix.

RNA Pseudoknot Prediction by Dynamic Programming 2065

Sankoff, D. (1985). Simultaneous solution of the RNAfolding, alignment and protosequence problems.SIAM J. Appl. Math. 45, 810-825.

Schuster, P., Fontana, W., Stadler, P. F. & Hofacker, I. L.(1994). From sequences to shapes and back: a casestudy in RNA secondary structure. Proc. Roy. Soc.ser. B, 255, 279-284.

Schuster, P., Fontana, W., Stadler, P. F. & Renner, A.(1997). RNA structures and folding: from conven-tional to new issues in structure predictions. Curr.Opin. Struct. Biol. 7, 229-235.

Serra, M. J. & Turner, D. H. (1995). Predicting the ther-modynamic properties of RNA. Methods Enzymol.259, 242-261.

Steinberg, S., Misch, A. & Sprinzl, M. (1993). Compi-lation of RNA sequences and sequences of tRNAgenes. Nucl. Acids Res. 21, 3011-3015.

Tabaska, J. E. & Stormo, G. D. (1997). Automated align-ment of RNA sequences to pseudoknotted struc-tures. ISMB-97, 5, 311-318.

Tabaska, J. E., Cary, R. B., Gabow, H. N. & Stormo,G. D. (1998). An RNA folding method capable ofidentifying pseudoknots and base triples. Bioinfor-matics, 8, 691-699.

ten Dam, E., Pleij, K. & Draper, D. (1992). Structural andfunctional aspects of RNA pseudoknots. Biochemis-try, 31, 11665-11676.

Tuerk, C., MacDougal, S. & Gold, L. (1992). RNA pseu-doknots that inhibit human immunode®ciency virustype 1 reverse transcriptase. Proc. Natl Acad. Sci.USA, 89, 6988-6992.

van Batenburg, F. H. D., Gultyaev, A. P. & Pleij, C. W. A.(1995). An APL-programmed genetic algorithm forthe prediction of RNA secondary structure. J. Theor.Biol. 174, 269-280.

Van Belkum, A., Abrahams, J. P., Pleij, C. W. A. &Bosch, L. (1985). Five pseudoknots are present atthe 204 nucleotides long 30 non coding region oftobacco mosaic virus RNA. Nucl. Acids Res. 13,7673-7686.

Van Belkum, A., Bingkun, J., Pleij, C. W. A. & Bosch, L.(1987). Structural similarities among valine-accept-ing tRNA-like structures in tymoviral RNAs andelongator tRNAs. Biochemistry, 26, 1144-1151.

Walter, A., Turner, D., Kim, J., Lyttle, M., MuÈ ller, P.,Mathews, D. & Zuker, M. (1994). Coaxial stackingof helixes enhances binding of oligoribonucleotidesand improves predictions of RNA folding. Proc.Natl Acad. Sci. USA, 91, 9218-9222.

Woese, C. R. & Pace, N. R. (1993). Probing RNA struc-ture, function, and history by comparative analysis.The RNA World (Gesteland, R. F. & Atkins, J. F.,eds), pp. 91-117, Cold Spring Harbor LaboratoryPress, Cold Spring Harbor, NY.

Wyatt, J. R., Puglisi, J. D. & Tinoco, I., Jr (1990). RNApseudoknots: stability and loop size requirements.J. Mol Biol. 214, 455-470.

Zuker, M. (1989a). Computer prediction of RNA struc-ture. Methods Enzymol. 180, 262-288.

Zuker, M. (1989b). On ®nding all suboptimal foldings ofan RNA molecule. Science, 244, 48-52.

Zuker, M. (1995). ``Well-determined'' regions in RNAsecondary structure prediction: analysis of smallsubunit ribosomal RNA. Nucl. Acids Res. 23, 2791-2798.

591-621.Zuker, M. & Stiegler, P. (1981). Optimal computer fold-

ing of large RNA sequences using thermodynamicsand auxiliary information. Nucl. Acids Res. 9,133-148.

Appendix: Recursions for the GapMatrices in the Pseudoknot Algorithm

Here we provide simpli®ed recursion relationsfor the gap matrices used in the pseudoknot algor-ithm, without including dangling and coaxial dia-grams. (As before, contiguous nucleotides aregiven explicit dots in the diagrams.)

The recursion for the vhx matrix in the pseudo-knot algorithm is given by (Figure A1):

vhx�i; j : k; l� � optimal

gEIS2�i; j : k; l�gEIS2�i; j : r; s� � vhx�r; s : k; l�gEIS2�r; s : k; l� � vhx�i; j : r; s�2 � ~P� eM� whx�i� 1; jÿ 1 : k ÿ 1; l� 1�

�1A�

8>>>>>><>>>>>>:�8i; r; k; l; s; j i4 r4 k 4 l4 s4 j �

Here ~P is the score for creating a pair in a pseudo-knot, and eMs; corresponds to the score given to anon-nested multiloop. ~P and eM could be equal to Pand M, the score for a pair in a nested structureand the score assigned to nested multiloopsrespectively, but it does not have to be. Similarly,the score for an irreducible surface of O(2), gEIS2;could be the same as the score given for nestedstructures, EIS2, but again, it does not have to be.We found the best ®ts by giving them valuesdifferent from those used for nested foldings (cf.Tables 2 and 3).

Page 14: A dynamic programming algorithm for RNA structure prediction including pseudoknots

Figure A2. Recursion for the zhx matrix. Figure A3. Recursion for the yhx matrix.

The recursions for the gap matrices zhx and yhx in the pseudoknot algorithm are complementary andgiven by (cf. Figures A2 and A3):

2066 RNA Pseudoknot Prediction by Dynamic Programming

Finally, the recursion for the gap matrix whx appears in Figure A4, and is given by:

Page 15: A dynamic programming algorithm for RNA structure prediction including pseudoknots

Figure A4. Recursion for the whxmatrix.

.

Page 16: A dynamic programming algorithm for RNA structure prediction including pseudoknots

Here Gwh stands for the score given for ®ndingoverlapping pseudoknots, that is pseudoknots that

(Received 27 July 1998; received in revised form 20November 1998; accepted 22 November 1998)

Suppl ®le is

2068 RNA Pseudoknot Prediction by Dynamic Programming

appear within already existing pseudoknots.The initialization conditions are:

whx�i; j : i; j� � �1vhx�i; j : k; k� � �1yhx�i; j : k; k� � �1whx�i; j : k; k� � whx�i; j : k; k � 1� � wx�i; j�zhx�i; j : k; k� � zhx�i; j : k; k � 1� � vx�i; j�

�A5�

�8i; k; j 14i4k4j4N�

Edited by I. Tinoco availa

http://www.academicpress.com/jmb

ementary material comprising 1 pdfble from JMB Online