Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of...

68
Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee David Mathews Kevin Murphy
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of...

Page 1: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

Computational approaches for RNA energy parameter estimation

Mirela AndronescuDepartment of Computer Science

Supervisors• Anne Condon• Holger Hoos

Committee• David Mathews• Kevin Murphy

Page 2: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

2

RNA structure

RNA sequence

5’ ACGUAGCGA…3’

Tertiary structure

a set of base pairs: A-U,C-G, G-U

Secondary structure

Page 3: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

3

Overview

accuracy60%

5’ ACUGCUAGCUGCGUUGC… 3’

inputoutput

Energy modelPrediction algorithm

71% accuracyNew energy modelPrediction algorithm

predict

Page 4: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

4

Roles of RNA structures and thermodynamics

Translation Catalysis Splicing

Gene silencing

Page 5: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

5

Determining RNA secondary structure

• Experimentally– X-ray crystallography, NMR, chemical &

structure probing -- expensive

• Computationally– Comparative sequence analysis, given many

homologous sequences– Thermodynamic approaches, using an energy

model

Page 6: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

6

Thermodynamic RNA secondary structure prediction

• Assumption– RNAs fold into their minimum free energy

structures

• Common approach– dynamic programming algorithm O(n3)

[Zuker & Stiegler, 1981; Lyngso et al, 1999]

• Based on an energy model– the Turner model [Mathews et al, 1999, 2004]

Page 7: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

7

The Turner model[Mathews et al, 1999, 2004]

Energy model:• Features (stacked pair AG/CU)• Parameters θ (-2.1 kcal/mol)• Energy function ΔG(θ) = cT θ

[3’ UTR protein-binding RNA from Rfam]

Page 8: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

8

The Turner model[Mathews et al, 1999, 2004]

• Obtained by– Linear regression from experimental data– Biological knowledge

• Limitations– No thorough computational method was used– Many parameters have been extrapolated– Large amounts of data were not exploited

• Accuracy on our data set: 60%• Our goal: Improve the RNA energy model

Page 9: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

9

• Parameter esti-mation for models with pseudoknots (Ch 7)

• Parameter esti-mation for models without pseudoknots (Ch 5)

Contributions

• Model selection and feature relationships (Ch 6)

• Databases (Ch 3)

• Parameter estimation algorithms (Ch 4)

Page 10: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

10

RNA STRAND

• Structural data from 8 public databases– RNA sequences with

• known secondary structures• unknown free energies

– Determined by• comparative sequence analysis• X-ray crystallography• NMR

– 4600 RNAs, avg. length 530 nucleotides [Andronescu et al, BMC Bioinformatics 2008]

Page 11: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

11

RNA THERMO

• Thermodynamic data from 58 papers– RNA sequences with

• known secondary structures• measured free energies

– Determined by • optical melting experiments

[Turner lab & collaborators]

– 1300 RNAs, avg. length 17 nucleotides

Page 12: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

12

• Parameter esti-mation for models with pseudoknots (Ch 7)

• Parameter esti-mation for models without pseudoknots (Ch 5)

Outline

• Model selection and feature relationships (Ch 6)

Databases: RNA STRAND and RNA THERMO (Ch 3)

• Parameter estimation algorithms (Ch 4)

Page 13: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

13

Parameter estimation problem

• Given– A structural set S (seq + str)– A thermodynamic set T (seq + str + free energy)– A model with

• a fixed set of features (e.g. Turner99 with 363 features)• a free energy function (e.g. linear in the parameters θ)

• Estimate (learn) parameters θ that maximize avg. accuracy when measured on reference set

Sn = #correctly predicted bp / # true bpPPV = #correctly predicted bp / # predicted bp F-measure = harmonic mean (Sn, PPV)

= 2*Sn*PPV/(Sn+PPV)

Page 14: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

14

Constraint Generation (CG)

• Idea: for all (x,yknown) in S, yknown should have lower free energy than all other structures y

Predict low energy structures with the current θ

Solve a constrained quadratic opt. problem

min (Σδ2 + Σ (free energy error for T)2 + regularizer)subject to

ΔG(x,yknown,θ) < ΔG(x,y,θ) + δ, for all (x,yknown) in S

Repeat until convergence [Andronescu et al, Bioinformatics 2007]

Page 15: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

15

Boltzmann Likelihood (BL)

• The probability of a structure y is a Boltzmann function:

• Solve a non-linear optimization problem with unique optimum max (P(structural data) P(thermo data) regularizer)

• Similar approach (CONTRAfold) proposed by [Do et al, 2006]– no thermo data was used– free energies are not predicted correctly

P(structural data) =

Page 16: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

16

• Parameter esti-mation for models with pseudoknots (Ch 7)

• Parameter esti-mation for models without pseudoknots (Ch 5)

Outline

• Model selection and feature relationships (Ch 6)

Databases: RNA STRAND and RNA THERMO (Ch 3)

Parameter estimation algorithms: CG and BL (Ch 4)

Page 17: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

17

Parameter estimation for models without pseudoknots

Sensitivity = #correctly predicted bp / # true bpPPV = #correctly predicted bp / # predicted bp

BL*, trained on STrain+T, F=0.69, RMSE=1.34

CG*, trained on STrain+T, F=0.68, RMSE=0.98

CONTRAfold 2.0,trained on SProcF=0.68, RMSE=6.02

CONTRAfold 1.1,trained on 151RfamF=0.61, RMSE=9.17

Turner99F=0.60, RMSE=1.24

CG 07 [Andr. 2007],trained on SProc+TF=0.65, RMSE=1.03

BL* gives the highest accuracy on average, an increase of 9% from the Turner99 parameters.

Set from RNA STRAND,# str: 2500Avg len: 330

Std len: 500

Page 18: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

18

Runtime analysis

Parameter estimation algorithm CPU time

Boltzmann Likelihood (BL) 1-8 months

Constraint Generation (CG) 1-3 days

BL is at least 10 times slower than CG, but slightly more accurate.

Reference machine: a 3GHz Intel Xeon CPU (1MB cache and 2GB RAM)

Page 19: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

19

• Parameter esti-mation for models with pseudoknots (Ch 7)

Parameter esti-mation for models without pseudoknots (Ch 5)

9% better F-measure

Outline

• Model selection and feature relationships (Ch 6)

Databases: RNA STRAND and RNA THERMO (Ch 3)

Parameter estimation algorithms: CG and BL (Ch 4)

Page 20: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

20

Model selection

• Explore parsimonious and lavish models

• For lavish models, use feature relationships

Model #features BL F-measure

Parsimonious 79 0.646

Turner99 363 0.684

Lavish 7802 0.683

Page 21: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

21

Feature relationships

• Link features not covered by thermo set T with those that are covered

BL: max (P(structural data) P(thermo data) regularizer)

Page 22: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

22

Model selection and feature relationships

BL-FR*, trained on STrain+T, #features=7726,F=0.71, RMSE=1.51

BL*, trained on STrain+T, F=0.69, RMSE=1.34

CG*, trained on STrain+T, F=0.68, RMSE=0.98

CONTRAfold 2.0,trained on SProcF=0.68, RMSE=6.02

CONTRAfold 1.1,trained on 151RfamF=0.61, RMSE=9.17

Turner99F=0.60, RMSE=1.24

CG 07 [Andr. 2007],trained on SProc+TF=0.65, RMSE=1.03

Modeling feature relationships improves prediction by an additional 1.3% (10.6% from the Turner99 parameters).

Page 23: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

23

• Parameter esti-mation for models with pseudoknots (Ch 7)

Parameter esti-mation for models without pseudoknots (Ch 5)

9% better F-measure

Outline

Model selection and feature relationships (Ch 6)

11% better F-measure

Databases: RNA STRAND and RNA THERMO (Ch 3)

Parameter estimation algorithms: CG and BL (Ch 4)

Page 24: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

24

Parameter estimation for models with pseudoknots

• Models (Turner features + additional features for pseudoknots)– Dirks & Pierce [Dirks and Pierce, 2003]

– Cao & Chen [Cao and Chen, 2006]

• Prediction algorithm– HotKnots [Ren et al, 2005]

• Parameter estimation algorithm– CG modified for this problem

• BL was much harder to implement

Page 25: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

25

Parameter estimation for models with pseudoknots

Params

With pknots Without pknots All

Short Long Short Long#str=78

Len=48

#str=20

Len=170

#str=261

Len=58

#str=87

Len=124

#str=446

Len=74

Initial D&P 0.62 0.51 0.71 0.69 0.68

New D&P 0.80 0.56 0.81 0.68 0.77

Initial C&C 0.77 0.54 0.71 0.68 0.71

New C&C 0.75 0.54 0.81 0.71 0.77

* Short means at most 100 nucleotides

Improvements on average:• Dirks & Pierce parameters by 9% • Cao &Chen parameters by 6%

Page 26: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

26

Parameter esti-mation for models with pseudoknots (Ch 7)

9% and 6% better F

Parameter esti-mation for models without pseudoknots (Ch 5)

9% better F-measure

Conclusions

Model selection and feature relationships (Ch 6)

11% better F-measure

Databases: RNA STRAND and RNA THERMO (Ch 3)

Parameter estimation algorithms: CG and BL (Ch 4)

Page 27: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

27

Applications

• CG 07 [Andr 2007] is part of RNA Vienna WebSuites

• Many other software packages benefit from this work– MFE and suboptimal secondary structure

prediction– Simulation of folding pathways, sampling and

clustering– Prediction of hybridization efficiency, target

availability of siRNA

Page 28: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

28

Directions for future work

• No single parameter set (or algorithm) results in better accuracy for all structures– Combine parameter sets and algorithms

• Explore other models– Models for multi-loops are not accurate

• Accuracy of data is questionable– Obtain / generate / pre-process data more

accurately

Page 29: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

29

Acknowledgments

• Supervisors: – Anne Condon, Holger Hoos

• Committee: – Dave Mathews, Kevin Murphy

• Collaborators: – Vera Bereg, Cristina Pop, Alex Brown

• Members of the BETA lab and CS department

• UBC and IBM Research for funding

Page 30: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

30

Additional slides

Page 31: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

31

RNAs play diverse roles

• Messenger RNA

• Ribosomal RNA

• Transfer RNA

[contexo.info]

Page 32: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

32

RNA structure plays role in splicing

[Bruce R. Korf, Human Genetics and Genomics]

[Rogic et al, 2008]

Page 33: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

33

RNAs can act as catalysts (ribozymes)

[James & Al-Shamkhani]

Page 34: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

34

RNA hybridization thermodynamics

[Lu and Mathews, 2008]

Page 35: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

35

RNA STRAND

Database (source) RNA type No. Median len

Gutell DB rRNA, intron 1056 1500

tmRDB tmRNA 726 360

Sprinzl tRNA DB tRNA 622 76

RNase P DB RNase P RNA 454 330

SRP DB SRP RNA 383 270

Rfam Various 313 60

PDB, NDB Various 1112 50

RNA STRAND All of the above 4666 300

Page 36: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

36

Design of optical melting experiments

• 16% of multi-loops in RNA STRAND have 5 or more branches

• 30% of internal loops have ≥7 unpaired bases

• 13% of internal loops have asymmetry ≥ 3

• Pseudoknots (22 experiments, only 4 features out of the 11 DP are covered)

Page 37: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

37

Analysis of RNA THERMO

Page 38: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

38

Analysis of RNA THERMO

Page 39: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

39

Schematic representation of data

Page 40: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

40

Other BL results (M363)

Set Train RMSE S-Test S-STR

BL* rho=1 S-Full-Train 1.34 0.679 0.694

BL rho=5 S-Full-Train 1.07 0.677 0.687

BL rho=1 S-Full-Train-nopkstr 1.16 0.668 0.679

Page 41: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

41

Accuracy on classes

Class # Len BL-FR* CG* CF2 T99

tRNA 582 80 0.79 0.81 0.77 0.60

RNaseP 387 332 0.61 0.60 0.67 0.55

SRP RNA 357 223 0.74 0.69 0.64 0.71

tmRNA 269 363 0.59 0.50 0.52 0.39

16S rRNA 187 1276 0.50 0.48 0.48 0.39

5S rRNA 117 118 0.88 0.78 0.73 0.73

Ham. riboz. 114 52 0.64 0.67 0.66 0.65

GI intron 78 362 0.60 0.61 0.62 0.56

23S rRNA 52 2684 0.55 0.53 0.59 0.47

All 2518 331 0.71 0.68 0.68 0.60

Page 42: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

42

Correlations between parameters

Page 43: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

43

Accuracy vs length, no pseudoknots

Page 44: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

44

Accuracy vs length, no pseudoknots

Page 45: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

45

Correlation accuracies, all

Page 46: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

46

Correlation accuracies, all

Page 47: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

47

Correlation accuracies, 0-200

Page 48: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

48

Correlation accuracies, 200-700

Page 49: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

49

Correlation accuracies, 700-2000

Page 50: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

50

Correlation accuracies, 2000-4000

Page 51: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

51

Sensitivity to the structural set size

Page 52: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

52

Feature relationships

Page 53: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

53

Feature relationships

Page 54: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

54

Feature relationships

Page 55: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

55

Feature relationships

[Davis and Znosko, 2007]

Page 56: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

56

Feature relationships

[Christiansen an Znosko, 2008] – complete set of sequence symmetric tandem mismatches and improved model for predicting sequence asymmetric mismatches

Page 57: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

57

Model selection and feature relationships

1/64 of STrain 1/4 of STrain

Page 58: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

58

HotKnots predictions

Initial D&P Initial C&C

Page 59: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

59

DP vs CC, new parameters

With pseudoknots Without pseudoknots

Page 60: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

60

DP vs CC, initial parameters

With pseudoknots Without pseudoknots

Page 61: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

61

DP, new vs initial

With pseudoknots Without pseudoknots

Page 62: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

62

CC, new vs initial

With pseudoknots Without pseudoknots

Page 63: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

63

Pseudoknots

Page 64: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

64

Runtime

Page 65: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

65

Runtime

Page 66: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

66

Parameter correlations [Andr 07]

Page 67: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

67

Feature counts [Andr 07]

Page 68: Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of Computer Science Supervisors Anne Condon Holger Hoos Committee.

68

Accuracy vs iterations [Andr 07]