UDC 575.852 Modeling Evolution of the Bacterial …lab6.iitp.ru/ru/pub/en_mb_2009_glal.pdfated (in...

ISSN 0026-8933, Molecular Biology, 2009, Vol. 43, No. 3, pp. 485–499. © Pleiades Publishing, Inc., 2009.Original Russian Text © K.Yu. Gorbunov, E.V. Lyubetskaya, E.A. Asarin, V.A. Lyubetsky, 2009, published in Molekulyarnaya Biologiya, 2009, Vol. 43, No. 3, pp. 527–541.

485

PROBLEM STATEMENT

Two problems.

The problem to be considered(hereinafter, problem 1) is to model the ancestralstates of noncoding genomic regions, namely, theregions (sites) of RNA-mediated regulation, in ananalogous way as this problem is stated for the mod-eling of coding genomic regions (structural genes).

This problem is connected, yet does not coincide,with the known problem (hereinafter, problem 2) ofconstructing a multiple alignment of a specified set ofnucleotide sequences taking into account an assumed(but unknown beforehand) common secondary struc-ture in these sequences. There are algorithms for con-structing multiple alignments that do not take intoaccount the common secondary structure and, if thelist of conserved helices constituting this presumablecommon secondary structure is known, then thesealgorithms or their improved versions can be used forsolving problem 2. However, such set of conservedhelices is usually unknown beforehand, and it is natu-ral to connect the determination of this set (at least,this is how we do) with solving problem 1: the con-served helices searched for are the helices that are sta-

ble during the evolution of a primary structure. In thissense, the solution of problem 1 gives the desired listof conserved helices of the secondary structures in aspecified set of extant RNA-mediated regulation sites,thereby leading to the solution of problem 2. Thispaper deals with problem 1.

A review of the modern state of the problem.

The problem of constructing a phylogenetic tree for aprotein family is well known and has been activelystudied [1–5]. The associated issues belong to theeven wider field of the modeling the evolution of spe-cies, described in a large body of literature [6–8].The evolutionary models comprise modeling ofmolecular level events, such as gene duplications,originations, horizontal transfer, and losses, as well asconstructing and estimating various evolutionary sce-narios (for example, see [9–14]). In particular, suchmodeling can be performed concurrently with deter-mining the lengths of tree branches by maximum like-lihood estimation [15] or concurrently with construct-ing ancestral sequences provided that the initial datahave been previously multiply aligned (see alsohttp://evolution.genetics.washington.edu/phylip/software.

Modeling Evolution of the Bacterial RegulatorySignals Involving Secondary Structure

K. Yu. Gorbunov, E. V. Lyubetskaya, E. A. Asarin, and V. A. Lyubetsky

Kharkevich Institute for Information Transmission Problems,

Russian Academy of Sciences,

Moscow, 127994 Russia;

e-mail: [email protected]

Received july 8, 2008accepted for publication October 21, 2008

Abstract

—An algorithm for modeling the evolution of the regulatory signals involving the interaction withRNA secondary structure is proposed. The algorithm implies that the species phylogenetic tree is known and isbased on the assumption that the considered signals have a conserved secondary structure. The input data arethe extant primary structure of a signal for all leaves of the phylogenetic tree; the algorithm computes the signalprimary and secondary structures at all the nodes. Concurrently, the algorithm constructs a multiple alignmentof the extant (in leaves) sites of a regulatory signal taking into account its secondary structure. The results ofsuccessful testing of the algorithm for three main types of attenuation regulation in bacteria—classic attenua-tion (threonine and leucine biosyntheses in Gammaproteobacteria), T-box (in Actinobacteria), and RFN-medi-ated (in Eubacteria) regulations—are described.

DOI:

10.1134/S0026893309030170

Key words

: evolutionary scenario, regulatory signal, evolution along a tree, species tree, secondary structure

UDC 575.852

MATHEMATICAL AND SYSTEM BIOLOGY

486

MOLECULAR BIOLOGY

Vol. 43

No. 3

2009

GORBUNOV et al.

serv.html, where the available software for construct-ing phylogenetic trees and the reconstruction ofancestral sequences are listed). The correspondingstudies are connected with modeling the evolutionaryevents themselves (for example, see [17–21]).

On the other hand, a considerable part of the bioin-formatics research during the last two decades hasbeen connected with the mechanisms involved in theregulation of gene expression, especially, in bacteria.The most important among these mechanisms are theregulation via protein–DNA interactions; classicalattenuation-based regulation; T-box regulation; and theregulation involving RNA switches (riboswitches), cer-tain special proteins, such as TRAP, or LEU elements.The main problems here are the search for regulatoryelements (regulatory sites), functional and evolution-ary analyses of the detected signals, and the modelingof regulatory processes (for example, see [22–27]).This problem is elucidated in hundreds of papers.The biological aspects of the evolution of regulatorystructures are described by McAdams et al. [28].As for the algorithmic aspects, ancestral sequenceshave been modeled until recently without taking intoaccount the secondary structures or only partially tak-ing them into account. Savill et al. [29] took intoaccount the secondary structures via the modificationof the standard model of nucleotide substitutions,namely, by adding the event of simultaneous substi-tution of two complementary nucleotides in con-nected positions. Note that genetic algorithms wereapplied to the simulation of helix evolution forassessing the rate of nucleotide substitutions andlengths of tree branches [30] as well as that the evo-lution of RNA secondary structures was taken intoaccount to construct a more probable phylogenetictree [31].

Our approach.

We assume that a natural continu-ation of these two directions is modeling of the evolu-tion of a regulatory signal along the phylogenetic treeof species or proteins (for example, a transcriptionfactor) or along another tree corresponding to the evo-lution of a particular regulation. We have studied suchevolutionary processes for the regulatory signals ofamino acid metabolism when the mechanism of thesesignals involves alternative RNA secondary struc-tures. We are developing two approaches to modelingsuch evolution. In both approaches, the phylogenetictree and regulatory sites in its leaves (sites of one type,for example, sites of classical attenuation-based regu-lation) are given and the ancestral states of these sitesare searched for in all the tree nodes, including thesecondary structures in ancestral and extantsequences.

Our first approach is detailed in [32] and was pre-sented at several conferences [33, 34; (see alsohttp://lab6.iitp.ru, item 4)]. We will just relate brieflythis approach for the sake of completeness. A Gibbsfunctional,

H

(

σ

)

, is put down in an explicit form, itsargument

σ

is a configuration, i.e., a function thatascribes a nucleotide sequence (assumed ancestralstate of a regulatory signal in this node) to each innertree node. Thus, the configuration is a set of ancestralstates of the regulatory signal that is specified in theleaves. This functional

H

(

σ

) =

H

1

(

σ

) +

H

2

(

σ

)

comprisestwo conditions (constraints) for the desired configura-tion

σ

: (1) for each sequence

σ

(

k

)

(i.e., for the value of

σ

in the

k

th tree node) and along each edge leaving thisnode at each position

i

= 1, …,

n

of this sequence

σ(

k

)

,an independent substitution of letters occurs accord-ing to the substitution matrix

R

as well as insertionsand deletions and (2) the sequences

σ(

k

)

from config-uration

σ

, when possible, preserve the secondarystructure from the edge beginning to its end andalong the pathway in the tree (for many generations);in this process, the longer and more numerous thepathways, the smaller is the functional. The first con-dition is represented in the term

H

1

(

σ

)

, whichdescribes a standard dynamics of regulatory site pri-mary structure, and the second, in the term

H

2

(

σ

)

,which describes the dynamics of site primary struc-ture that provides for maintaining a high degree ofthe secondary structure conservation. The globalminimum of the

H

(

σ

)

functional is searched for;according to the first approach, it corresponds to thebiological evolution of the regulatory signal speci-fied in tree leaves by only primary structure. Thus,

H

(

σ

)

represents the biologically motivated principlesof a common evolution (dynamics) of primary struc-ture and, concurrently, secondary structure conserva-tion. This approach has been efficiently realized as asoftware and tested using the same examples asdescribed below [32].

In this paper, we describe our second approach.It is based on two other biologically motivatedrequirements: (1) the secondary structure of a sitemust be as much conserved as possible along the path-ways in the tree (from leaves to the root, as in the firstapproach; see below step 1 of the algorithm) and(2) the primary structure of a regulatory site must per-mit the smallest number of evolutionary events as pos-sible (parsimony principle; see below step 2 of thealgorithm).

Thus, the first approach is based on a conventionalmodel of nucleotide substitutions and the requirementof conservation, while the second approach, on theparsimony principle and the same requirement of con-servation. These requirements reflect the generally

MOLECULAR BIOLOGY

Vol. 43

No. 3

2009

MODELING EVOLUTION OF THE BACTERIAL REGULATORY SIGNALS 487

accepted concept of evolution. Correspondingly, thesetwo approaches conceptually differ in their firstrequirements. However, they fundamentally differ inthe implementation of these requirements—in the firstapproach, this is a minimization by annealing of thefunctional (exponent of the substitution matrix,indels, and some complex mathematical expressionsdescribing the degree of conservation) and in the sec-ond, it is alternate alignments and minimizations ofquite a different functional.

These two requirements of the second approachcan be implemented in different ways, for example, itis possible to determine an integral (for the overallalgorithm) functional with its minimum correspond-ing to the final configuration of ancestral sites or toreach this final configuration by iterations, asdescribed in the algorithm below.

The purpose of several approaches to problem 1 isin comparing the results they give. This is a methodfor testing their adequacy, because experimental dataon the ancestral states of bacterial regulatory signalsare even more difficult to obtain than the data onancestral states of genes. Note that these twoapproaches gave amazingly similar computationalresults for biological and artificial data.

We have implemented the model and the algorithmas a software and tested the algorithm using manyexamples of classical attenuation-mediated regula-tion, T-box regulation, riboswitches, LEU regulation,and so on. Note that our model is similarly applicableto the evolutionary analysis of these quite differentregulations.

Multiple alignment of the initial sequences inleaves is, in general, unknown and in no way assumedgiven in our model. When testing our algorithm, wenoticed that the output alignments and secondarystructure were close to the known or postulated(according to independent studies) alignments for thesequences in leaves input to the algorithm. In addition,the algorithm outputs the multiple alignment of all thefound sites at all the tree nodes taking into account acoevolution of their secondary structures. Actually,the algorithm also outputs as a side product the set ofevolutionarily conserved helices, which is a funda-mental step in solving problem 2.

As mentioned above, we detail here our secondapproach to solving problem 1; i.e., constructing theevolution of regulatory signals (with their secondarystructure) in bacteria using an iteration procedurewith configurations of the next new type. Such con-figuration ascribes a sequence of distributions ratherthan a nucleotide sequence (as in our first approach)to each tree node; the distribution specifies the fre-

quencies of all nucleotides and, possibly, the symbolof a gap.

The sections Model Description, Algorithm, andExamples of Applying the Algorithm to BiologicalData Analysis detail our model of evolution of the reg-ulatory signal with secondary structure, describe thealgorithm used for constructing the ancestral states ofthe signal specified in the leaves without specifyingthe secondary structure, and demonstrate the applica-tion of this algorithm to biological examples of vari-ous regulation types.

MODEL DESCRIPTION

As mentioned above, the model is based on twonatural requirements: (1) the secondary structure of asite must be as much conserved as possible along thepathways within the tree (from leaves to root); seebelow step 1; and (2) the primary structure of a regu-latory site must contain as low number of evolutionaryevents as possible; see below step 2.

A specific feature of the proposed model is in thefollowing. A sequence is ascribed to each tree node;each position of this sequence is the frequencies offive symbols—four letters (A, C, T, and G) and a gapsymbol (

d

). Thus, ascribed to the nodes are thesequences of various lengths composed of the vectors(distributions) with a length of 5; each vector consistsof the numbers from zero to one totally giving unity;and these numbers correspond to the probabilities ofA, C, T, G, and

d

(hereinafter, the numbers corre-spond to these letters in the indicated order). Let thissequence of vectors be named the sequence of distri-butions. Then the sequence ascribed to each node ofthe same tree is naturally transformed into a commonnucleotide sequence with gaps (step 3, see below).Each sequence characterizes or directly representsthe assumed site at a given tree node for one fixedregulation, the mechanism of which requires thatRNA secondary regulatory structure is maintainedduring the evolution. In addition to the sequences ofdistributions, nucleotide sequences with gaps areascribed to the tree leaves. The algorithm operationcommences with the initial regulatory sites, which inthe further process will be filled with gaps.

Designate as

σ

a sequence of distributions with avarying length

n

, where

σ

i

is its

i

th term (

1

≤

i

≤

n

) andname

i

the

i

th position in

σ

. Designate

X

(

i

,

σ

)

thefraction of letter

X

at the

i

th position in sequence

σ

.Hereinafter, we do not distinguish between thesequence of distributions that has a distribution withone unity and the remaining zeros at each positionand the corresponding nucleotide sequence withgaps. Note that for each leaf, the algorithm gives two

488

MOLECULAR BIOLOGY

Vol. 43

No. 3

2009

GORBUNOV et al.

variants of nucleotide sequence: one of them is des-ignated with leaf number in the figures with evennumbers and the other is designated with speciesname. Recollect that the initial sequence withoutgaps is given for each leaf at the beginning of algo-rithm operation.

In this model, we consider alignments of the pri-mary structures of the sequences of distributions, i.e.,the sequences themselves as words in an infinitealphabet, and their secondary structures as the wordscomposed of “helices” in the sequences of distribu-tions. The alignment is regarded as an insert of gapsinto two words in a certain alphabet. For this purpose,one of the standard alignment algorithms utilizing thedetermined weights is applied to the words. The newfeature is that the secondary structures in thesequences of distributions are taken into account.It is necessary to define the term helix, given above inquotation marks.

Let two positions,

i

and

j

, in the sequence of distri-butions

σ

be named complementary, if the followingsum

[min

(

A

(

i

,

σ

),

T

(

j

,

σ

)) +

min

(

T

(

i

,

σ

),

A

(

j

,

σ

))]

+ [

min

(

C

(

i

,

σ

),

G

(

j

,

σ

)) +

min

(

G

(

i

,

σ

),

C

(

j,σ))]

+ 0.5 [min(G(i,σ)-min(G(i,σ),C(j,σ)), min(T(j,σ)-

min(T(j,σ),A(i,σ))) + min(T(i,σ)-min(T(i,σ),A(j,σ)),

min(G(j,σ)-min(G(j,σ),C(i,σ)))]

+ 0.25 [min(d(i,σ),d(j,σ)]

is larger than a certain threshold, termed the comple-mentarity threshold. It amounts to 0.5 in examples1−4. The definition of complementarity reflects theinterconnection of two positions where the corre-sponding distributions are located. The definitions ofhelix and loop in the sequence of distributions are nat-urally derived from the definition of complementarity(for example, see [26, 35]). For example, a helix isformed by the pairing of the maximally long segmentsof complementary positions in the sequence of distri-butions. As usual, we name these segments arms.Thus, the definitions of complementarity and helix forthe sequence of distributions is a natural generaliza-tion of the common definitions for a nucleotidesequence.

To define the energy of helix, we generalize a stan-dard definition (for example, see [26]), where theenergy is summed for quadruplets of nucleotides(pairs of neighboring paired nucleotides). In our case,the summing is performed for analogous quadruplets

of distributions. In the case of such a fixed quadrupletF of distributions, we designate as X(i) the frequencyof nucleotide X in the ith component of the quadruplet.Then the energy components corresponding to F is

where the sum-

ming is performed over all the possible quadruplets fof nucleotides, Ef is the contribution of quadruplet f toenergy calculated in a standard manner, and normaliz-

ing divisor S =

As usual, the energy thus calculated is supplementedwith the terms that take into account the inner andouter loops. The loop length is defined as the sum ofall nucleotide frequencies in it. The additional terms(entropy) are calculated for all rational arguments bya linear interpolation of the known values from [26]for the integer arguments. For example, if f(x, y) is astandard function determining the entropy from aninner loop with the side lengths of x = 2.75 and y =3.25, then

The cost of matching of two distributions or a dis-tribution and a gap in a pairwise alignment of twosequences of distributions is determined as follows.Let X(1) and X(2) be the frequencies of symbol X inthe first and second distributions, respectively; gap isconventionally designated d. Then this cost is

where C = R is a

standard bonus for a match of two nucleotides, P1 is astandard penalty for matching a nucleotide and a gap,and P2 is a standard penalty for the mismatch of twonucleotides. In certain more complex situations, wetook smaller penalties for matching the nucleotides Aand G or C and T, respectively, than for matchinganother pair of nucleotides.

E fX 1( )Y 2( )Z 3( )V 4( )

S------------------------------------------------,

f X Y Z V, , ,⟨ ⟩=∑

X 1( )Y 2( )Z 3( )V 4( ).f X Y Z V, , ,⟨ ⟩=∑

f 2.75 3.25,( ) 14--- f 2 3.25,( ) 3

4--- f 3 3.25,( )+=

= 14--- 3

4--- f 2 3,( ) 1

4--- 2 4,( )+⎝ ⎠

⎛ ⎞ 34--- 3

4---⎝

⎛ f 3 3,( )+

+14--- f 3 4,( ) ) =

316------ f 2 3,( ) 1

16------ f 2 4,( )+

+916------ f 3 3,( ) 3

16------ f 3 4,( ).+

RC P1 d 1( ) d 2( )–– P2 1 C– max(d 1( ),d 2( )–( )),–

X 1( ) X 2( ),( ),minX A C T G, , , =∑

MOLECULAR BIOLOGY Vol. 43 No. 3 2009


ALGORITHM

Let one initial nucleotide sequence be ascribed toeach tree leaf. The iterations consist in a successivealternation of steps 1 and 2. The proposed algorithmascribes the sequences of distributions to all the treenodes from leaves to root by the methods describedbelow and concurrently performs the multiple align-ment of all these sequences. In this process, it is actu-ally necessary to align only two words, which is a sim-ple task.

Step 1. One sequence of distributions produced inthe course of computations in the previous iteration,step 2, and named end sequence of distributions isascribed to each tree leaf. At the beginning of algo-rithm operation, the end sequence of distributions in aleaf is specified as a mere copy of the initial nucleotidesequence in this leaf, where each distribution is com-posed of unity and zeros. We also need the concept ofaltered nucleotide sequence; in general, this is theresult of step 1 and, at the beginning, a copy of the ini-tial nucleotide sequence. Then, we align these end andaltered nucleotide sequences for each leaf without tak-ing into account their secondary structures. In thisprocess, the sequences can acquire a certain numberof gaps. At the very beginning, these sequences ineach leaf coincide and, correspondingly, their align-ment is trivial. The result of step 1, which is con-veyed to step 2, is an altered nucleotide sequence ineach leaf. The result of step 2, which is conveyed tostep 1, is the end sequence of distributions in eachleaf. During the overall algorithm operation, thealtered nucleotide sequence in a leaf differs from theinitial one specified in this leaf by only the additionof gaps.

Let two sons ν1 and ν2 of the node ν to alreadyascribe sequences of distributions σ1 and σ2. The algo-rithm must determine the sequence of distributions σin ν. For this purpose, it first aligns the secondarystructures of these sequences in the following way.According to σ1 and σ2, we determine two sets Ω1 andΩ2, respectively, which consist of the helices (in thefirst and second sequences, respectively) with theenergy exceeding a certain threshold, which in exam-ples 1–4 is assumed as 10 kcal/mol. We name it theenergy threshold.

From the sets Ω1 and Ω2, the algorithm proceedsto the linear orderings of arms of the helices belong-ing to these sets. More precisely, the algorithm lin-early orders the arms: roughly speaking, it arrangesthe arms in the same order as they are located in thesequence, namely, at the middle of the arm and, if themiddle points coincide, at the beginning of the arms.Each arm is included into this ordering as a letter

from a certain new alphabet, which reflects the infor-mation about the beginning and end of the armtogether with its neighborhood of a certain size,nucleotide composition of the arm and its neighbor-hood, the number of helix to which the arm belongs,and the information about whether it is right or left.We name these orderings words in the nodes ν1 andν2, respectively, and designate also Ω1 and Ω2 (it is alsopossible to take trees as Ω1 and Ω2; the obtained resultis close to the variant of words). These words arealigned. Then the sequences σ1 and σ2 themselves arealigned taking into account the obtained alignmentof secondary structures in σ1 and σ2. This procedureis described in detail in [36]. We determine theweights for such alignments: for the matching of twodistributions or two arms and for the matching of adistribution or an arm to a gap, which comply with acommon model of nucleotide substitutions. Here wedo not show the weights; we note only the case oftwo arms. In this situation, the weight is determinedas follows: the quality obtained when aligning thearms with their neighborhoods as nucleotidesequences is summed to the bonus for each matchingin alignment of the left and right arms of the samehelix; here, only the matching of left to left or rightto right arms are allowed.

Then for each position i, we take as the distributionσi the weighted mean of the distributions σ1i and σ2iwith the weights that are determined by the ratio oflengths of two edges from ν to ν1 and ν2, i.e.,

where l(ν,νi) is the length of the edge coming fromnode ν to node νi.

This gives the sequence of distributions σ in thenode ν; then the former σ1 and σ2 in ν1 and ν2 arereplaced with the sequences obtained by alignment;the latter are also designated as σ1 and σ2. Evidently,the new σ1 and σ2 are obtained from the formersequences of distributions by a coordinated additionof a certain number of gaps into the former sequences.The addition of gaps into all the sequences of distribu-tions ascribed below to σ, including the altered andinitial nucleotide sequences ascribed to the treeleaves, is continued. Now all these descendants of thesequence σ have an equal length.

When step 1 reaches the root, the altered nucle-otide sequences are formed in the leaves; thesesequences are the only result of step 1, and it is con-veyed to step 2. All the sequences (of distributions andnucleotides) ascribed to the nodes at the end of step 1

σi

l ν ν2,( )l ν ν1,( ) l ν ν2,( )+------------------------------------------σ1i

l ν ν1,( )l ν ν1,( ) l ν ν2,( )+------------------------------------------σ2i,+=

490


GORBUNOV et al.

have the same lengths; we name it the length of thezone. Note that both the sequence of distributions andthe altered nucleotide sequence with gaps are ascribedto each leaf. During the algorithm operation, thealtered nucleotide sequence in any leaf differs fromthe initial input sequence in this leaf only by the addedgaps. Now proceed to step 2.

Step 2. The following procedure is performed foreach position of the altered nucleotide sequences gen-erated during step 1. Each node ν of the evolutionarytree is ascribed with the set δ(ν) of the frequencies offive possible symbols at the considered position.At step 2, all these values are considered variables;and the below described functional F is minimized bythese variables (search for the nearest local mini-mum).

Let ρ be a certain measure of closeness of vectors(in the simplest case, the sum of squares of the differ-ences between components); for each leaf ν, let σ(ν)denote the constant vector where unity corresponds tothe symbol of the altered nucleotide sequenceascribed to this leaf, let zero denote the remaining ele-ments, and let eb and ee be the ends of edge e.The functional F is defined by the following equation(the first summing is performed over all the tree edgesand the second, over all its leaves):

Here, w(e) and w(ν) are weight coefficients. Theweight w(e) is the larger, the shorter is the edge e; theweight w(ν) is equal to the weight w(e) of the edge ecoming from the leaf ν to its parent multiplied by aspecial parameter regulating the closeness of the dis-tributions in leaves to the initial data relative to theconservation of distributions in the inner tree nodes.In examples 1–4, this parameter is 1.5.

Minimization (in turn for each position) is per-formed over all variables, i.e., over the fractions of thefive symbols in all tree nodes. The natural constraintsare imposed: all the variables are non-negative and thesum of the corresponding five variables at each nodeis unity. In the mentioned simplest case, the functionalis quadratic with only one minimum; thus, this mini-mum point is easily found by quadratic programming.The end (in leaves) sequences of distributionsobtained by this minimization is the only result ofstep 2, which is conveyed to step 1.

F ρ(δ en), δ eκ( )( ) w e( )⋅e

∑=

+ ρ δ ν( ) σ ν( ),( ) w ν( ).⋅ν

∑

We considered a more complex variant of thisalgorithm with the additional step aimed to take intoaccount the mutations of letters according to one ofthe substitution models as well as insertions and dele-tions in the primary structure. However, this does notlead to a considerable difference for the classic atten-uation regulation. For other regulation types, the situ-ation is more intricate, and is beyond our consider-ation here.

The algorithm terminates the alternation of steps 1and 2 (at step 1), if the length of the zone stops grow-ing. Such a choice of the criterion for terminating thealteration of steps 1 and 2 results from a heuristicobservation that the algorithm cycled and the length ofthe zone stopped growing when a good multiple align-ment of the nucleotide sequences in leaves, taking intoaccount the secondary structure, was found. We do notstate that this will be true for any input data; however,this situation was observed for the considered exam-ples. Then the algorithm proceeds to step 3, where ituses only the final multiple alignment of thesequences of distributions in all the tree nodesobtained after steps 1 and 2.

Step 3. In the final multiple alignment, we proceedfrom the sequences of distributions to new nucleotidesequences with gaps, which are ascribed to all the treenodes including leaves. All these new sequences havethe same constant lengths as in the final multiplealignment of the sequences of distributions. Theseparticular sequences are shown in the figures witheven numbers, where they are denoted by the num-bers of the corresponding tree nodes; the initialnucleotide sequences with the gaps added during thealgorithm operation are designated with the name ofthe corresponding species. It was performed in thefollowing way: if a position is not contained in ahelix, we place there the symbol (nucleotide or gap)that has the highest frequency in the distribution cor-responding to these position and tree node. If thisposition is contained in one of several helices, weplace there the letter that paired with the highestenergy within the maximal number of helices and yetretains a sufficient frequency. Then we examine whichof the four letters displays the best value of the follow-ing characteristic—the sum of the energies of allhelices containing this position, where each term ismultiplied by the weight equal to the fraction of thefrequency of this letter that is involved in this helix;if none of the four letters reaches a certain thresholdvalue of this characteristic, then a gap is put down.The figures with even numbers show the final multi-ple alignment of new nucleotide sequences obtainedafter step 3 together with the altered nucleotidesequences with gaps generated at the last step 1 pro-



cedure, the same sequences as the initial ones withthe gaps added during the algorithm operation.In figures, the former sequences are indicated withthe numbers of nodes, and the latter are indicatedwith the name of species.

Step 4. According to a certain threshold, we selectthe most conserved helices from the final multiplealignment of the new nucleotide sequences in leaves;according to the alignment, we indicate the corre-sponding sequences in the altered nucleotidesequences (see the figures mentioned above); we out-put the helices induced by these in the initial nucle-otide sequences. Thus, the evolutionarily induced sec-ondary structure, which was beforehand unknown tothe algorithm, is indicated in the initial data. This pro-vides for an independent multiple alignment of theinitial nucleotide sequences taking into account thissecondary structure. The result is close to that afterstep 3.

EXAMPLES OF APPLYINGTHE ALGORITHM

TO BIOLOGICAL DATA ANALYSIS

Due to the preliminary character of this paper andthe insufficient volume, we confined ourselves to fourexamples (one–two for each main type of regulationbased on mRNA secondary structure). For these

examples, the known PAML and PAUP programs (seeDiscussion) gave considerably worse results as com-pared with our algorithm (data not shown). More com-prehensive testing results are available on the site ofour Laboratory (http://lab6.iitp.ru, item 9).

Example 1. Consider the classic attenuation regu-lation of threonine biosynthesis in Gammaproteobac-teria. The initial sites in leaves are taken from [22] (thealgorithm uses neither the information about second-ary structure known beforehand nor multiple align-ment). A standard species tree (Fig. 1) is taken; in thistree, the nodes are numbered from 1 to 27 (the num-bers are under rectangles), and each edge is ascribedwith its phylogenetic length in arbitrary units, asmaller number. The following abbreviations areused: EC for Escherichia coli, TY for Salmonellatyphi, KP for Klebsiella pneumoniae, EO for Erwiniacarotovora, YP for Yersinia pestis, HI for Haemophi-lus influenzae, VK for Pasterella multocida, AB forActinobacillus actinomycetemcomitans, PQ for Man-nheimia haemolytica, VC for Vibrio cholerae, VV forVibrio vulnificus,, VP for Vibrio parahaemolyticus,SON for Shewanella oneidensis, and XCA for Xanth-omonas campestris.

The algorithm outputs the final multiple alignmentof nucleotide sequences—the assumed regulatorysites and altered nucleotide sequences (see Fig. 2).The terminator is shown dark gray, and the antitermi-

1 2

3

4

5

6

7

8

9 10

11

12

13

14

15

16

1718

19

20

21

221

1

1

1 11

1

11

1

1

1

11

1 1

1

2

3

33

3

3

4 8

23 24

25

26 27

VK

PQ

EC

HIAB

TYEO YP

KP

VV VC

VPSON

XCA

5

Fig. 1. Phylogenetic tree for the classic attenuation regulation of threonine biosynthesis in Gammaproteobacteria.

492


GORBUNOV et al.

1 :2 :XCA :3 :4 :SON :5 :6 :7 :VP :8 :9 :VV :10 :VC :11 :12 :13 :VK :14 :15 :PQ :16 :17 :HI :18 :AB :19 :20 :KP :21 :22 :23 :EO :24 :YP :25 :26 :EC :27 :TY :

Fig. 2. The multiple alignment (with secondary structure) of assumed regulatory sites generated by the proposed algorithm for theclassic attenuation regulation of threonine biosynthesis in Gammaproteobacteria.

nator is underlined. The terminator in altered nucle-otide sequences is shown in light gray. The secondarystructures obtained in leaves (Fig. 2) only slightly dif-fer from the secondary structures either predicted byother bioinformatics methods or determined experi-mentally [26].

Example 2. Consider the classic attenuation regu-lation of leucine biosynthesis in Gammaproteobacte-ria. The initial sites in leaves are taken from [22].A standard species tree (Fig. 3) is taken.

The algorithm outputs the final multiple alignmentof nucleotide sequences—the assumed regulatory



2

3

4

5

6

7 8

9

10

11

12

13

14

15

16

17

18

19 20

21

22

1

1

1

11

1

111

1

1

1

1

1

1

3

33

3

3

4

23

VK

PQ

EC

HI

TYEO YP

KP

VV VC

VPSON

1

1

Fig. 3. Phylogenetic tree for the classic attenuation regulation of leucine biosynthesis in Gammaproteobacteria.

1 :2 :SON :3 :4 :5 :VP :6 :7 :VV :8 :VC :9 :10 :11 :VK :12 :13 :PQ :14 :HI :15 :16 :KP :17 :18 :19 :EC :20 :TY :21 :22 :EO :23 :YP :

Fig. 4. The multiple alignment (with secondary structure) of assumed regulatory sites generated by the proposed algorithm for theclassic attenuation regulation of leucine biosynthesis in Gammaproteobacteria.

494


GORBUNOV et al.

sites and altered nucleotide sequences (see Fig. 4).The designations are the same as in the previousexample.

Example 3. Consider the T-box regulation of theileS gene in Actinobacteria. The initial sites are takenfrom [27]. A standard species tree (Fig. 5) is used.Abbreviations: An for Actinomyces naeslundii, Cd forCorynebacterium diphtheriae, Ce for Corynebacte-rium efficiens, Cg for Corynebacterium glutamicum,Ma for Mycobacterium avium, Ml for Mycobacteriumleprae, Mm for Mycobacterium marinum, Mt forMycobacterium tuberculosis, Nf for Nocardia farcin-ica, Pa for Propionibacterium acnes, Rx for Rubro-bacter xylanophilus, Sa for Streptomyces avermitilis,Sc for Streptomyces coelicolor, and Tf for Thermobi-fida fusca.

The algorithm outputs the final multiple alignmentof nucleotide sequences—the assumed regulatorysites and altered nucleotide sequences (see Fig. 6).The antisequester is shown dark gray, and the seques-ter is underlined. Two variants of regulation areobserved at node 5; in the alternative variant the anti-sequester is italicized and the sequester is shown on alight gray background. An analogous situation isobserved at node 21.

Example 4. Consider one more type of regula-tion—RFN-mediated regulation of the expression of

genes involved in riboflavin biosynthesis and trans-port in Eubacteria (gene ribB in BC, EC, PP, and YPand gene ribD in the remaining species; abbreviationsare below). The initial sites are taken from [37].A standard species tree (Fig. 7) is used. Abbrevia-tions: BQ for Bacillus anthracis, BH for Bacillus halo-durans, BS for Bacillus subtilis, BP for Burkholderiapseudomallei, CA for Clostridium acetobutylicum,DF for Clostridium difficile, DR for Deinococcusradiodurans, EC for Escherichia coli, LL for Lacto-coccus lactis, PP for Pseudomonas putida, SA for Sta-phylococcus aureus, TM for Thermotoga maritima,and YP for Yersinia pestis.

RFN structure is presented as the structure com-prising a helical stem and four helices in its loop,numbered clockwise (helices 1, 2, 3, and 4).Our algorithm outputs a multiple alignment, shownin Fig. 8. The target RFN structure is indicated: thestem is double underlined, helices 1 and 3 are shownon a light gray fine background, helices 2 and 4 areshown on a light gray background, and variable heli-ces are shown on a dark gray or black background.Two helices (1 and 2) can be replaced with one helix;the remaining two helices (3 and 4) can be alsoreplaced with one helix; these alternative helices areunderlined. Note that the conserved nucleotidescharacteristic of RFN structure were mainly aligned

1

2

3

4

5

6

78

9

10

11 12

13

14

15

16

17

19

18

20

21

22

23

24

25

26 27

15 15

10

10

10

10

10

1010

33

3

3

3

3

1 1

1 1

11.5

52 2

RxPa

Sa Sc

An

Tf

Mt

Ma Ml

Mm

Cd

Ce Cg

Nf

Fig. 5. Phylogenetic tree for the T-box regulation of the gene ileS in Actinobacteria.



1 :2 :Rx :3 :4 :Pa :5 :6 :7 :Tf :8 :9 :An :10 :11 :Sa :12 :Sc :13 :14 :15 :Mm :16 :17 :Mt :18 :19 :Ma :20 :Ml :21 :22 :Nf :23 :24 :Cd :25 :26 :Ce :27 :Cg :

Fig. 6. The multiple alignment (with secondary structure) of assumed regulatory sites generated by the proposed algorithm for theT-box regulation of the gene ileS in Actinobacteria.

by columns. In ancestor 19, helix 4 has an alternative(italicized), which continues in the progenies.

DISCUSSION

The goal of this work was to present a new methodfor the modeling of evolution and multiple alignmentof related RNA sequences taking into account theirassumed common and coevolving secondary structureas well as the initial testing of this method. Thismethod is based on two assumptions: (1) the initialsequences in leaves have common coevolving second-ary structure and (2) the phylogenetic tree in the

leaves of which these sequences are specified withoutthe secondary structure or using the secondary struc-ture (it does not assumed known) is consideredknown.

In the examples described here and for other exam-ples of regulations, our algorithm constructs a reason-able secondary structure at ancestral nodes. It is closeto the regulatory structure predicted by bioinformaticsand determined experimentally. The secondary struc-tures generated by the algorithm based on the initialsequences in leaves also practically coincide with theknown structures. The multiple alignment of primary

496


GORBUNOV et al.

1 :2 :Rx :3 :4 :Pa :5 :6 :7 :Tf :8 :9 :An :10 :11 :Sa :12 :Sc :13 :14 :15 :Mm :16 :17 :Mt :18 :19 :Ma :20 :Ml :21 :22 :Nf :23 :24 :Cd :25 :26 :Ce :27 :Cg :

Fig. 6. Contd.

1

2

3

45

6

7

89

10

11

1213

14

15

1617

18

19

20

21

22 2324

25

0.10

0.05

0.360.44

0.21

0.250.10

0.150.19

0.03

0.22

0.170.24

0.09

0.04

0.11

0.070.12

0.02

0.06

0.03

0.19

0.11

0.09

BH BS

BQ

LLSA

CA DF

TMDR

BP

PP

ECYP

Fig. 7. Phylogenetic tree of Eubacteria.



structures in leaves and even along the overall tree hasa good quality. We have tested the model adding noiseto both artificial and biological examples andobserved a stable operation of the algorithm.

The first approach, mentioned in the ProblemStatement section and detailed in [32–34, 38], gavethe same or very similar secondary structures for theinitial data of examples 1–4, despite the fact that theseapproaches are different. We have tested the obtainedancestral signals using the model of classic attenua-tion regulation from [26, 35] and the software avail-able at http://lab6.iitp.ru, item 3. This testing has con-firmed the functionality of ancestral signals for thisregulation.

Comparison with other methods and programs.To compare the result given by our algorithm withthe results of other standard algorithms, we appliedthe known programs, such as PAML and PAUP(http:// evolution.genetics.washington.edu/phylip /software.serv. html), to the same data. When only initial pri-mary structures (without alignment) were input,none of these programs was able to reconstruct theancestral regulatory elements of the type present inleaves. When the alignment that took into accountthe secondary structure (for example, given in [22]for classic attenuation regulation) was also input, theresult depended on the program used. PAML wasunable to predict the secondary structure of therequired type in ancestral sequences. PAUP was able

1 :2 :

TM:

3 :4 :

DR :5 :

6 :7 :

DF:8 :

9 :CA :10 :11 :

SA:12 :

LL :13 :

14 :15 :

BS :16 :

17 :BH :18 :

19 :BQ :

20 :

YP :

21 :22 :

EC:23 :

24 :PP :25 :BP :

Fig. 8. The multiple alignment (with secondary structure) of assumed regulatory sites generated by the proposed algorithm for theRFN-mediated regulation of expression of the genes involved in riboflavin biosynthesis and transport in Eubacteria.

498


GORBUNOV et al.

to predict such structure; however, it was not con-served along the tree edges.

ACKNOWLEDGMENTS

We are grateful to the reviewer, whose profoundcriticism allowed the paper to be considerablyimproved.

REFERENCES

1. Nei M., Kumar S. 2000. Molecular Evolution and Phy-logenetics. Oxford: Oxford Univ. Press.

2. Gascuel O., Steel M. 2007. Reconstructing Evolution:New Mathematical and Computational Advances.Oxford: Oxford Univ. Press.

3. Page R.D.M., Holmes E.C. 1998. Molecular Evolution:A Phylogenetic Approach. Oxford: Blackwell Publish-ing.

4. Wolf Y., Rogozin I., Grishin N., Tatusov R., Koonin E.2001. Genome trees constructed using five differentapproaches suggest new major bacterial clades. BMCEvol. Biol. 1, 1–22.

5. Durand D., Haldorsson B.V., Vernot B. 2006. A hybridmicro-macroevolutionary approach to gene tree recon-struction. J. Comput. Biol. 13, 320–335.

6. Gascuel O. (Ed.). 2004. Mathematics of Evolution andPhylogeny. Oxford: Oxford Univ. Press.

7. Felsenstein J. 2004. Inferring phylogenies. Sunderland,MA: Sinauer Assoc.

8. Nakhleh L., Warnov T., Linder C.R. 2004. Reconstruct-ing reticulate evolution in species: Theory and practice.In: Proc 8th Annual Conference on Research in Compu-tational Molecular Biology. ACM, pp. 337–346.

9. Mirkin B.G., Fenner T.I., Galperin M.Y., Koonin E.V.2003. Algorithms for computing parsimonious evolu-tionary scenarios for genome evolution, the last univer-sal common ancestor and dominance of horizontal gene

1 :2 :

TM:

3 :4 :

DR :5 :

6 :7 :

DF:8 :

9 :CA :10 :11 :

SA:12 :

LL :13 :

14 :15 :

BS :16 :

17 :BH :18 :

19 :BQ :

20 :

YP :

21 :22 :

EC:23 :

24 :PP :25 :BP :

Fig. 8. Contd.



transfer in the evolution of prokaryotes. BMC Evol. Biol.3, 1–34.

10. Guigo R., Muchnik I., Smith T. 1996. Reconstruction ofancient molecular phylogeny. Mol. Phylog. Evol. 6, 189–213.

11. Page R.D.M., Charlstone M.A. 1997. From gene toorganismal phylogeny: Reconciled trees and genetree/species tree problem. Mol. Phylog. Evol. 7, 231–240.

12. Page R.D.M. 1998. GeneTree: Comparing gene and spe-cies phylogenies using reconciled trees. Bioinformatics.14, 819–820.

13. Zmasek C.M., Eddy S.R. 2001. A simple algorithm toinfer gene duplication and speciation events on a genetree. Bioinformatics. 17, 821–828.

14. Chauve C., Doyon J.-P., El-Mabrouk N. 2007. Inferringa duplication, speciation and loss history from a genetree (extended abstract). In: Comparative Genomics,RECOMB 2007 International Workshop. Eds. Tesler G.,Durand D., LNCS (LNBI), vol. 4751, Heidelberg:Springer, pp. 45–57.

15. Elias I., Tuller T. 2007. Reconstruction of ancestralgenomic sequences using likelihood. J. Comput. Biol.14, 216–237.

16. Hudek A.K., Brown D.G. 2005. Ancestral sequencealignment under optimal conditions. BMC Bioinformat-ics. 6, 1–14.

17. Hallett M.T., Lagergren J. 2000. New algorithms for theduplication-loss model. In: Proc. 4th Annual Interna-tional Conference on Computational Molecular Biology,RECOMB 2000. ACM, pp. 138–146.

18. Berglung A.-C., Lagergren J., Sennblad B. 2004. Genetree reconstruction and orthology analysis based on anintegrated model for duplications and sequence evolu-tion. In: Proc 8th Annual International Conference onResearch in Computational Molecular Biology,RECOMB. Eds Bourne P.E., Gusfield D. ACM, pp. 326–335.

19. Bonizzoni P., Della Vedova G., Dondi R. 2005. Recon-ciling a gene tree to a species tree under the duplicationcost model. Theor. Comput. Sci. 347, 36–53.

20. Gorecki P., Tiutyn J. 2006. DLS-trees: A model of evo-lutionary scenarios.Theor. Comput. Sci. 359, 378–399.

21. Lyubetsky V.A., Gorbunov K.Yu., Rusin L.Y., V’yugin V.V.2005. Algorithms to reconstruct evolutionary events atmolecular level and infer species phylogeny. In: Bioin-formatics of Genome Regulation and Structure II. SpringerScience & Business Media, Inc., pp. 189–204.

22. Vitreschak A.G., Lyubetskaya E.V., Shirshin M.A., Gel-fand M.S., Lyubetsky V.A. 2004. Attenuation regulationof amino acid biosynthetic operons in proteobacteria:Comparative genomics analysis. FEMS Microbiol. Lett.234, 357–370.

23. Gelfand M.S., Gerasimova A.V., Kotelnikova E.A.,Laikova O.N., Makeev V.Y., Mironov A.A., Panina E.M.,Ravcheev D.A., Rodionov D.A., Vitreschak A.G. 2005.Comparative genomics and evolution of bacterial regula-tory systems. In: Bioinformatics of Genome Regulation

and Structure II. Springer Science & Business Media,Inc., pp. 111–119.

24. Seliverstov A.V., Putzer H., Gelfand M.S., Lyubetsky V.A.2005. Comparative analysis of RNA regulatory elementsof amino acid metabolism genes in Actinobacteria. BMCMicrobiol. 5, 1–14.

25. Seliverstov A.V., Lyubetsky V.A. 2006. Translation reg-ulation of intron containing genes in chloroplasts. J. Bio-inform. Comp. Biol. 4, 783–793.

26. Lyubetsky V.A., Pirogov S.A., Rubanov L.I., Seliver-stov A.V. 2007. Modeling classic attenuation regulationof gene expression in bacteria. J. Bioinform. Comp. Biol.5, 155–180.

27. Vitreschak A.G., Mironov A.A., Lyubetsky V.A., Gel-fand M.S. 2008. Comparative genomic analysis of T-boxregulatory systems in bacteria. RNA. 14, 717–735.

28. McAdams H.H., Srinivasan B., Arkin A.P. 2004. Theevolution of genetic regulatory systems in bacteria.Nature Rev. Genet. 5, 169–178.

29. Savill N.J., Hoyle D.C., Higgs P.G. 2001. RNA sequenceevolution with secondary structure constraints: Compar-ison of substitution rate models using maximum-likeli-hood methods. Genetics. 157, 399–411.

30. Kosakovsky Pond S.L., Mannino F.V., Gravenor M.B.,Muse S.V., Frost S.D.W. 2007. Evolutionary modelselection with a genetic algorithm: A case study usingstem RNA. Mol. Biol. Evol. 24, 159–170.

31. Fischer W., Geard N. Reconstructing phylogeny fromRNA secondary structure via simulated evolution.http://www.itee.uq.edu.au/nic/papers/csss-rna.pdf.

32. Lyubetsky V., Zhizhina E., Rubanov L. 2008. Gibbs fieldapproach to the problem of evolution of biologicalsequences. Probl. Pered. Inform. (in press).

33. Gorbunov K.Yu., Lyubetsky V.A. 2007. Modeling evolu-tion of the nucleotide sequence with secondary structure.In: Proc. Computational Phylogenetics and MolecularSystematics: CPMS'2007. Moscow: KMK ScientificPress, pp. 68–75.

34. Lyubetsky V.A., Seliverstov A.V., Gorbunov K.Yu. 2007.Models of gene expression regulation and evolution ofregulatory elements. In: Proc. Computational Phyloge-netics and Molecular Systematics: CPMS'2007. Mos-cow: KMK Scientific Press, pp. 158–165.

35. Asarin E., Cachat Th., Seliverstov A.V., Touili T.,Lyubetsky V.A. 2007. Attenuation regulation as a termrewriting system. In: Algebraic Biology. LNCS (LNBI),vol. 4545, Springer, pp. 81–94.

36. Gorbunov K.Yu., Mironov A.A., Lyubetsky V.A. 2003.Search for conserved secondary structures of RNA. Mol.Biol. 37, 850–860.

37. Vitreschak A.G., Rodionov D.A., Mironov A.A., Gel-fand M.S. 2002. Regulation of riboavin biosynthesis andtransport genes in bacteria by transcriptional and transla-tional attenuation. Nucleic Acids Res. 30, 3141–3151.

38. Gorbunov K.Yu., Lyubetsky V.A. 2007. Reconstructionof ancestral regulatory signals along a transcription fac-tor tree. Mol. Biol. 41, 918–925.

UDC 575.852 Modeling Evolution of the Bacterial …lab6.iitp.ru/ru/pub/en_mb_2009_glal.pdfated (in...

Documents

Transcript of UDC 575.852 Modeling Evolution of the Bacterial …lab6.iitp.ru/ru/pub/en_mb_2009_glal.pdfated (in...