The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter...

56
The RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center for Bioinformatics, University of Leipzig RNomics Group, Fraunhofer Institute IZI, Leipzig Institute for Theoretical Chemistry, Univ. of Vienna (external faculty) The Santa Fe Institute (external faculty) FOGA07, Cd. M´ exico, Jan 2007

Transcript of The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter...

Page 1: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

The RNA World, Fitness Landscapes, and all that

Peter F. Stadler

Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center forBioinformatics, University of Leipzig

RNomics Group, Fraunhofer Institute IZI, LeipzigInstitute for Theoretical Chemistry, Univ. of Vienna (external faculty)

The Santa Fe Institute (external faculty)

FOGA07, Cd. Mexico, Jan 2007

Page 2: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Why RNA?

until relatively recently:Central Dogma of Molecular BiologyDNA → RNA → ProteinDNA = “genetic memory”, RNA = working copy, proteins dothe work

around 1980: discovery of catalytic RNAs (Nobelprize for TomCech and Sidney Altman)nevertheless long considered “exotic” remnants from theancient RNA world=⇒ RNA World Hypothesis for the Origin of Life

around 2000: structure of the ribosome showns that theribosome is an “RNA enzyme”

around 2000: microRNAs are discovered as a large class ofregulatory RNAs that inhibit translation of proteins

2006: the ENCODE project shows that human geneexpression is quite different from textbook knowledge

Page 3: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

RNA Bioinformatics

RNA Secondary Structures are an appropriate level of description

explain the thermodynamics of RNA Structures

often highly conserved in evolution

can be computed efficiently

Page 4: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Many Functional RNAs are Structured

(a) Group I intron P4-P6 domain(b) Hammerhead ribozyme(c) HDV ribozyme

(d) Yeast tRNAphe

(e) L1 domain of 23S rRNA

Hermann & Patel, JMB 294, 1999

Page 5: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

The RNA Model

GCGGGAAU

AGCUC

AGUUGG U A

G A G CA

CGA

CC

UU

GC C

AAGGUCGGGGU

CG C G A G

U U CGA

GUCUCGU

UUCCCGC

UC

CA

GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA

Page 6: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Formal Definition

A secondary structure on a sequence s is a collection of pairs (i , j)with i < j such that

Base pairing rules are respected, i.e., (i , j) ∈ Ω implies (si , sj)form an allowed pair (GC, CG, AU, UA, GU, UG)

Each base is involved in at most one pair, i.e., Ω is amatching, (i , j), (i , k) ∈ Ω implies j = k and (i , k), (j , k) ∈ Ωin implies i = j .

(i , j)Ω implies |j − i | > 3 (sterical constraint)

No-crossing rule: (i , j), (k, l) ∈ Ω and i < k implies eitheri < k < l < j or i < j < k < l .This excludes so-called pseudoknots

Page 7: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Let’s count the structures . . .

Counting secondary structures. Given a sequence of length n.Πkl = 1 if sequence positions k, l can form a pair GC, CG, AU,

UA, GU, UG and Πkl = 0 otherwise.Nkl = number of structures of the subsequence from k to l .Basic recursion:

• xxxxxxx +∑

(xxxx)xxxx

Nkl = Nk+1,l +l

j=k+m

ΠkjNk+1,j−1Nj+1,l

Page 8: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

RNA Folding in a nutshell

i jj i i+1 j i i+1 k−1 k k+1|=

Nij = Ni+1,j +∑

k(i,k)pair

Ni+1,k−1Nk+1,j

Eij = min

Ei+1,j + mink

(i,k)pair

(Ei+1,k−1 + Ek+1,j + εik)

Zij = Zi+1,j +∑

k(i,k)pair

Zi+1,k−1Zk+1,j exp(−εik/RT )

Partition function: Z =∑

Ω exp(−E (Ω)/RT )

Page 9: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

A word on the Partition Function

The partition function is the link between the combinatorics of thestructures (in general: states in an ensemble) and thethermodynamic properties of the physical ensemble, e.g.:

Free energy G = −RT lnZ

Expected Energy 〈E 〉 = RT 2 ∂ lnZ∂T

Heat Capacity Cp = −T ∂2G∂T 2

0 20 37 100

0

1

2

3

4

5

CUGUAUUGUUGUAUAGCCCGUGUGGUAAUAUGG

T [C]

C(T

) [k

cal•(

mol

•K)-1

]

Page 10: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Realistic Energy Model

GC

U

A

C

G

closing base pair

A

5

3

3

5

A

5

3

closing base pair

interior base pairsclosing base pair

3

C

U

A

A

U G

C

closing base pair

multi-loop

interior base pair

A U

interior base pair

interior base pair

hairpin loop

bulge

C

G

C U

A

closing base pair

stacking pair

interior loop

G

G

G

3

5A

G

C

CA

5

Parameters from large number of melting experiments by Douglas

Turner, David Matthews, John Santa Lucia, and others

Page 11: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Recursions for Linear RNAs

M1 M1

M1

i u u+1|

MC

= |

= | |

FC

i j i+1 j i

hairpin Cinterior

i j i i k l j

k k+1 j

=

C

F F

i j

M

=i ij j−1

j

j| C

i j

M

ui+1 u+1i j−1 j

j i

M

j−1 ji u u+1|C

Page 12: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Folding Kinetics

RNA molecules may have kinetic traps which prevent them from reachingequilibrium within the lifetime of the molecule. Long molecules are oftentrapped in such meta-stable states during transcription.Possible solutions are

Stochastic folding simulations can predict folding pathways and finalstructures. Computationally expensive, few programs available.

Predicting structures for growing fragments of the sequence canshow whether large scale re-folding will occur during transcription.Cheap but inaccurate.

Analysis of the energy landscape based on complete suboptimalfolding can identify possible traps (local minima).

Page 13: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Kinetic Folding Algorithm

Simulate folding kinetics by a Monte-Carlo type algorithm:Generate all neighbors using the move-setAssign rates to each move, e.g.

Pi = min

1, exp

(

−∆E

kT

)

Select a move with probability proportional toits rateAdvance clock 1/

i Pi .

P4

P3

P5P6

P7

P8

P1 P2

Page 14: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Characterization of Landscapes

A landscape consists of a configuration space V , a move set within thatconfiguration space and an energy function f : V → R.Simplest move set for secondary structure: opening and closing of basepairs.Speed of optimization depends on the roughness of the Landscape.Measures of roughness suggested in the literature:

Number of local optima

Correlation lengths (e.g. along a random walk)

Lengths of adaptive walks

Folding temperature vs. glass temperature Tf /Tg

Energy barriers between the local optima. Especially, themaximum barrier height (“depth” in SA literature)

Page 15: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Ruggedness

Bryce Canyon, UT Capulin Volcano, NMrugged “smooth”

Page 16: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Measures of Ruggedness

Number of Local Minima and Maxima

Length of Adaptive Walks

Correlation Functions:Random walk on V as defined through X : x0, x1, x2, . . .(Markov process on V ).=⇒ “Time Series” f (x0), f (x1), f (x2), . . .=⇒ Autocorrelation Function

r(s) =

(

f (xt+s ) − 〈f 〉) (

f (xt) − 〈f 〉)

t⟨

(

f (xt) − 〈f 〉)2

t

Page 17: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

For simplicity:

f (x) → f (x) − f f =1

|V |

x∈V

f (x)

(Subtract the landscape average)Random Walk = Markov process with transition matrix T

Theorem. r(s) = 〈f ,Ts f 〉/

〈f , f 〉.Theorem. r(s) is an exponential function iff f is an eigenfunctionof T. We say (V , f ,T) is an elementary landscape.For d-regular graphs: Relation with graph Laplacians

−∆ = A − D = d(T − I )

Elementary landscapes are eigenfunctions of the Graph Laplacian

Page 18: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Amplitude Spectra

Let ϕk be an orthonormal basis of eigenvectors of T. We will beinterested in the “Fourier Transform” ak where

f (x) =∑

k

akϕk(x)

Theorem. var[f ] := 1|V |

x f (x)2 =∑

k 6=0 |ak |2.

Collect terms that belong to the same eigenspace:

Bp =1

var[f ]

k:−∆ϕk=Λpϕk

|ak |2

“Amplitude Spectrum of f ”.

Page 19: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Quantities derived from Amplitude Spectra

Autocorrelation Function:

r(s) =∑

p 6=0

Bp(1 − Λp/D)s

Correlation length

ℓ =

∞∑

s=0

r(s) = D∑

p 6=0

Bp/Λp

Elementary landscapes:f is elementary if and only if Bp = 1 for a single p, and 0otherwise

Page 20: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Some Elementary Landscapes

Problem Move Set D Λ ℓ/n

NAES Hamming n 4 1/4p-spin Hamming n 2p 1/(2p)WP Hamming n 4 1/4GC Hamming (α − 1)n 2α (1 − 1/α)/2XY-Hamiltonian Hamming (α − 1)n 2α (1 − 1/α)/2

cyclic 2n 8 sin2(π/α) 1/[4 sin2(π/α)]

GBP Exchange n2/4 2(n − 1) 1/8 · n/(n − 1)symmetric TSP Transposition n(n − 1)/2 2(n − 1) 1/4

Inversions n(n − 1)/2 n 1 − 1/n)/4GMP Transposition n(n − 1)/2 2(n − 1) n/4

NAES = Non-All-Equal-Satisfiability, WP = Weight Partition, GC = Graph Coloring with α colors, GBP = Graph

Bipartitioning, TSP = Traveling Salesman Problem, GMP = Graph Matching Problem XY-Hamiltonian:

P

i<j Jij cos( 2π

α(xi − xj ) )

Page 21: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Amplitude Spectra of RNA Landscapes

5 10 15 20 25Interaction Order p

0

0.1

0.2

0.3

0.4

0.5

Am

plitu

de S

pect

rum

Bp

Ensemble Free Energies

12345678910111213141516171819202122232425

5 10 15 20 250

0.1

0.2

0.3

0.4

0.5A

mpl

itude

Spe

ctru

m B

p

Ground State Energies

5 10 15 20 25Interaction Order p

Distance to Maximal Hairpin5 10 15 20 25

Distance to Open Structure

Page 22: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Amplitude Spectra of RNA Landscapes

6 8 10 12

Interaction Order p

0

0.1

0.2

0.3

0.4

0.5

Am

plitu

de S

pect

rum

B(p

)

Ensemble Free Energies6 8 10 12

0

0.1

0.2

0.3

0.4

0.5A

mpl

itude

Spe

ctru

m

B(p

)

Groundstate Energies

6 8 10 12

Interaction Order p

123456789101112Distance to Maximal Hairpin

6 8 10 12

Distance to Open Structure

Page 23: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Neutrality

Degree of neutrality E[νx ]is the expected number of neutral neighbors of x .Suppose µ0 > 0. Then

E[νx ] =∑

y :y∈∂x

µcx (y)

wherecx(y) =

∣j |θj (x) 6= θj(y)∣

∣ y ∈ ∂x

Similar equations can be obtained for E[νxνy ].=⇒ Correlation length ℓ and degree of neutrality νx can be tunedindependently of each other in short range p-spin models.

Page 24: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Energy barriers

E [s,w ] = min

max[

f (z)∣

∣z ∈ p]

p : path from s to w

,

B(s) = min

E [s,w ] − f (s)∣

∣w : f (w) < f (s)

Depth and Difficulty(borrowed from simulated annealing theory)

D = max

B(s)∣

∣s is not a global minimum

ψ = max

B(s)

f (s) − f (min)

s is not a global minimum

Page 25: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Energy Barriers and Barrier Trees

Some topological definitions:A structure is a

local minimum if its energy is lowerthan the energy of all neighbors

local maximum if its energy ishigher than the energy of all

neighbors

saddle point if there are at leasttwo local minima that can bereached by a downhill walk startingat this point

Page 26: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Calculating barrier trees

M1

M2

M3

S12

S23

M3

M3

M1

M1

M2

M2

S12

S12

S23

S23

M3

M1

M2

M3

M1

M2

S23

M3

M1

M2

S12

S23

The flooding algorithm:Read conformations in energy sortedorder.For each confirmation x we havethree cases:

(a) x is a local minimum if it hasno neighbors we’ve alreadyseen

(b) x belongs to basin B(s), if allknown neighbors belong toB(s)

(c) if x has neighbors in severalbasins B(s1) . . .B(sk) then it’sa saddle point that merges

these basins.Basins B(s1), . . . ,B(sk) arethen united and are assigned tothe deepest of local minimum.

Page 27: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Information from the Barrier Trees

Local minima Saddle points Barrier heights Gradient basins Partition functions and free energies of (gradient) basins Depth and Difficulty of the landscape

N.B.: A gradient basin is the set of all initial points from which a

gradient walk (steepest descent) ends in the same local minimum.

Page 28: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Energy Landscape of a Toy Sequence

G C U A U U A

GC

GC

G

UG

A

CG

UG

CG

U

UU

A

G C U A U UCG

UG

CG

U

UU

A

C

C

G

G

G

UG

A

A G C U A U UCG

UG

CG

U

UU

A

C C G G A

G

UG

A

G C U A U UCG

UG

CG

U

UU

A

C

C

A

G

UGGGA

G C U A G UCG

UG

CG

U

UU

A

G G G

AC

CU

U

A

G C G U G GCG

UG

CG

U

UU

A

G A

AU

C

CU

U

A

G U G G G ACG

UG

CG

U

UU

A

GC

AU

C

CU

U

A

G U

UG

CG

U

UU

A GC

AU

C

CU

U

A

C G G GG

G U

CG

U

UU

A

GC

AU

C

CU

U

A

C G

G GUGG

G U

GC

AU

C

CU

U

A

C G

GU

C GUUUAGGG

8 9 10

1 2 3

4567

A A

G U

GC

AU

C

CU

U

A

C G

GU

C G

UUUAGGG

11

U AA

Steps [arbitrary]

−8

−6

−4

−2

0

2

Ene

rgy

[kca

l/mol

]

1

2

3

4

5

6

7

8

9

10

11

2.45[5]

[9]3.80 1 [11]

1.60 3 [7]

13

141.30 9

1.40 192.10 11

3.80 2 [1][3] 12

1.20 8 [4]

173.60 4

2.20 73.61 5

1.40 161.90 10

1.50 152.00 205.02 6

3.90 18E

Page 29: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Folding Kinetics

Transition rates from x to y :

ryx = r0e−

E6=yx−E(x)

RT for x 6= y

rxx = −∑

y 6=x

ryx

Kinetics as a Markov process:

dpx

dt=

y∈X

rxypy (t) .

Transition states:E 6=

yx = maxE (x),E (y)

or more complex models (Tacker et al 1994, Schmitz et al 1996)

Page 30: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Reduced Description of the Folding Dynamics

Macrostates = Classes of a partition of the state space.Partition function for a macro state:

Zα =∑

x∈α

exp(−E (x)/RT )

Free energy of a macro state:

G(α) = −RT ln Zα

rβα =∑

y∈β

x∈α

ryxProb[x |α] for α 6= β

=1

y∈β

x∈α

ryxe−E(x)/RT

rβα “on flight” while executing the barriers program.Transition state free energy:

G6=βα = −RT ln

y∈β

x∈α

e−E6=xy

RT

Page 31: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

12

3

4

5

6

7

8

910

11

12

13

14

0.0

2.0

4.0

6.0

1.2

1.5

2.1

2.8

1.1

0.9

2.4

2.8

0.5 1.3

4.7

0.9

1.2

0.8

2.0

lillyA simple model sequence

10-1

100

101

102

103

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

mfe23456

10-1

100

101

102

103

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

mfe23456

10-1

100

101

102

103

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

mfe23456

Page 32: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

103

104

105

106

107

108

109

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

101

102

103

104

105

106

107

108

time

0

0.2

0.4

0.6

0.8

1

popu

latio

n pr

obab

ility

mfe255680

Refolding of a tRNA molecule.

Page 33: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Summary I:

RNA structures can be computed efficiently by means ofdynamic programming

Computations are based on a set of carefully measures energyparameters and an additive energy model

Algorithms exist for ground state energy and structure, fullpartition functions, density of states, interacting structures,. . .

The folding kinetics of a given RNA Sequence can also beinvestigated as the level of secondary structures

VIENNA RNA PACKAGE

Page 34: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

PART II: How Do RNAs Evolve

Basic Assumption

Selection Acts on Secondary Structures, Mutations acts on theunderlying sequences⇒ We need to understand the sequence-to-structure map of RNAs(hang on, we’ll discuss the empirical evidence for that a bit later)

Page 35: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Sewall Wright’s Fitness Landscapes

Fitn

ess

Phenotype

How do realistic fitness landscapes look like?

Page 36: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Biological Landscapes

The RNA case is a special case of a very general paradigm:

genotype 7→ phenotype 7→ fitness

What is the relationship between Genotyp and Phenotype?

Central topic in any theory of evolutionbecause:* Selection acts on the Phenotype* Mutation/Recombination acts on the GenotypeBiopolymers as the simplest model:The molecule is both genotype (sequence) and phenotype (structure).

The map from genotype to genotype is determined by physical chemistry:

⇐⇒ folding problem

Page 37: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Computational Analysis of the RNA Map

There are many more sequences than structures.(.)-string: 3-letters (with constraints)

=⇒ less than 3n structures

but 4n sequences.

=⇒ Redundancy

How are sequences folding into the same structure distributed insequence space?Neutral Set S(ψ) = x ∈ Qn

α|f (x) = ψ

Page 38: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Sensitivity and Neutrality

GCGGGAAU

AGCUC

AGUUGG U A

G A G CA

CGA

CC

UU

GC C

AAGGUCGGGGU

CG C G A G

U U CGA

GUCUCGU

UUCCCGC

UC

CA

GCGGGUAUA

GCUCAGU

UGG U A

G A G CA C G

A C CUU G CC A A

G GU

C G G G GU CG C G A G

U U CGA

GUCUCGU

UUCCCGCUCC

A

Effect of a single

point mutation0 100 200 300

Structure Distance

10-4

10-3

10-2

10-1

100

Fre

quen

cy

Distribution of structure distances

Page 39: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

The Random Graph Model

Approach:Model S(ψ) as a random induced subgraph Γ with a given value

λ =〈#neutral neighbors〉

(α− 1)n

Threshold value:

λ∗ = 1 −

(

1

α

)1

α−1

Theorem. [Reidys, Stadler, Schuster]If λ > λ∗ then Γ is a.s. dense and connected,if λ < λ∗ then Γ is a.s. neither dense nor connected

Page 40: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

A complication: Base Pairing Rules

Unpaired bases:Alphabet A = A,U,G,C

Paired bases: 5’ and 3’ side correlated:Alphabet: B = AU,UA,GC,CG,GU,UG, .

Thus consider only the set of compatible sequences C (ψ):S(ψ) ⊆ C (ψ) ≡ Qnu

4 ×Qnp

6 .=⇒ Two neutrality parameters λu and λp

Page 41: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Connected Components of Neutral Networks

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0λu

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

λp

gray many small components red 1 connected componentgreen 2 equal sized components yellow 3 components size 2:1blue 4 equal sized components

Explanation: for this deviation from the random graph model in terms of the energy model. Some structures can

be made only with a significant bias in the G/C ratio.

Page 42: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

x??P (d)

Length of Neutral Path: d !0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

0.250.200.150.100.050.00

0 20 40 60 80 100 120

Chain lenght n

0

5

10

15

20

25

30

Cov

erin

g ra

dius

from enumeration from inverse folding lower bound

Distance to Target structure Covering radiuslength neutral paths

Page 43: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Closest Approach

Intersection Theorem. For any two secondary structures φ, andψ holds

C (φ) ∩ C (ψ) 6= ∅

What is the distance of neutral networks

δ(φ,ψ) = mind(x , y)|f (x) = φ and f (y) = ψ

Random graph Theory: If λ > λ∗ then δ(φ,ψ) ≈ 2.Computer simulations: upper bounds on δ(φ,ψ):

n GC AU AUGC

50 5.6 2.6 2.170 9.3 4.6 3.4100 13.0 7.8 5.6

Page 44: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

AccessibilityFontana & Schuster 1998

Idea: The “interface” between two structures is large is they are“similar”.More precisely: Structure ψ is accessible for φ if x ∈ S(φ) is like tohave neighbor (mutant) x ′ ∈ S(ψ).Structural characterization of “easy” (continuous) transitions:

Shorteningof stacks

Elongationof stacks

Opening ofconstrained stacks

Closing ofconstrained stacks

Page 45: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Summary II: Sequence-Structure Map of RNA

1. Redundancy: Many more sequences than structures

2. Sensitivity: Small changes in the sequences may lead to largechanges in the structure

3. Neutrality: A substantial fraction of mutations does not alter thestructure.

4. Isotropy: S(ψ) is “randomly” embedded in C (ψ).

Implications:

1. Neutral Networks: S(ψ) forms a connected “percolating” network insequence space for all “common” structures.

2. Shape Space Covering: Almost all structures can be found in arelatively small neighborhood of almost every sequence.

3. Mutual Accessibility: The neutral networks of any two structuresalmost touch each other somewhere in sequence space.

Page 46: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Simulated Trajectories

0.5 1.0 1.5time (arbitrary units)

5

15

25

35

45

stru

ctur

e di

stan

ce

0 510

40

proj

ectio

n co

ordi

nate

Punctuated equilibra = diffusion of neutral networks +constant rate of innovation +exponential selection of rare mutants

Proc.Natl.Acad.Sci. 93: 397-401 (1996)

Page 47: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Diffusion Constant

. . . can be deduced from Moran model:

D = λ6Anp

3 + 4Np(1 + 1/N) ∼

(3/2)A(n/N) p ≫ 0 orN ≫ 12Anp p ≪ 1

A . . . replication raten . . . sequence lengthN . . . population sizep . . . mutation rateλ . . . neutrality of network

Page 48: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Dynamics of Interacting Replicators

Ik + Ij −→ Il + Ik + Ij

With mutation:

xk = xk

j

Akjxj −∑

i ,j

Aijxixj

+∑

l ,j

(QklAljxjxl − QlkAkjxkxj)

where

Qkl = (1 − p)n−d(k,l)

(

p

α− 1

)d(k,l)

How does this behave in sequence space?

Page 49: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Simplest case: Simplest case: Akl = A0(1 − d(Ik , Il )/n):

0 1000 2000 3000 4000 5000time

0.0

0.2

0.4

0.6

0.8di

vers

ity

0 2×105

4×105

6×105

8×105

time lag τ

0

10

20

30

40

50

60

g(τ)

g(τ) =1

T2 − T1 + 1

T2∑

t=T1

‖p(t + τ) − p(t)‖2

B.M.R. Stadler, Adv. Complex Syst. (2003)

Page 50: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

10-6

10-4

10-2

100

p

10-6

10-4

10-2

100

D

100

101

N/n

10-3

10-2

10-1

100

D

Left: Diffusion coefficient D as a function of the mutation rate for N = 10, 20, 30, 40, 80 and

n = 10, 20, 30, 40, 80 such that N/n = 1 after equilibration for 105 timesteps. Right: Dependence of the ratio

D/p on N/n.

Page 51: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

An RNA-Based Model in the Plane

Target hypercycle with 8 mem-bers.

Model:Hypercyclically coupledspecies, each sequence hasa function that depends onits structure.

Page 52: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Spatial Extension: CA Model

Possible Catalysts Actual CatalystsPossible Replicators

Rules of replication. For each of the neighbors (•) of the empty cell (marked by a bold outline) the replication rate

ρz is computed taking into account their neighbors in the direction of the replication () as potential catalysts.

The neighbor with the largest values of ρz invades the empty position. In this example, for the chosen replicator,

only three of its neighbours are catalysts according to the hypercycle topology.

Page 53: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Spirals formed after 3000 generations in an evolution experimentstarted with 300 random sequences in the absence of parasites.see also Borlijst & Hogeweg (1993)

Page 54: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Diffusion in Sequence Space

2000 4000 6000 8000 10000time lag (tau)

2

4

6

8

10

12

g(ta

u)

0.0005 0.0010 0.0015 0.0020Mutation rate

0.002

0.003

0.004

Diff

usio

n co

nsta

nt

Page 55: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Summary III: Dynamics of RNA Evolution

Neutrality of the Sequence-Structure Map impliesdiffusion/drift-like motion in sequence independent of detailsof the selection/mutation mechanisms and whether spatialextension is taken into account or not.

=⇒ The basic assumption of molecular phylogenetics, namelya dominating influence of drift in sequence evolution, holdstrue even when phenotypic evolution is dominated byinteractions(co-evolution).

TODO Development of a rigorous mathematical theorydescribing the motion in sequence space of a population withstrong interactions.

Page 56: The RNA World, Fitness Landscapes, and all thatThe RNA World, Fitness Landscapes, and all that Peter F. Stadler Bioinformatics Group, Dept. of Computer Science & Interdisciplinary

Acknowledgments: It’s not my fault . . .

Peter Schuster, Walter Fontana, Wolfgang SchnablChristoph Flamm, Ivo L. HofackerChristian M. ReidysCamille Stephan-Otto Attolini, Barbel Stadler