Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of...
-
date post
21-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of Computational approaches for RNA energy parameter estimation Mirela Andronescu Department of...
Computational approaches for RNA energy parameter estimation
Mirela AndronescuDepartment of Computer Science
Supervisors• Anne Condon• Holger Hoos
Committee• David Mathews• Kevin Murphy
2
RNA structure
RNA sequence
5’ ACGUAGCGA…3’
Tertiary structure
a set of base pairs: A-U,C-G, G-U
Secondary structure
3
Overview
accuracy60%
5’ ACUGCUAGCUGCGUUGC… 3’
inputoutput
Energy modelPrediction algorithm
71% accuracyNew energy modelPrediction algorithm
predict
4
Roles of RNA structures and thermodynamics
Translation Catalysis Splicing
Gene silencing
5
Determining RNA secondary structure
• Experimentally– X-ray crystallography, NMR, chemical &
structure probing -- expensive
• Computationally– Comparative sequence analysis, given many
homologous sequences– Thermodynamic approaches, using an energy
model
6
Thermodynamic RNA secondary structure prediction
• Assumption– RNAs fold into their minimum free energy
structures
• Common approach– dynamic programming algorithm O(n3)
[Zuker & Stiegler, 1981; Lyngso et al, 1999]
• Based on an energy model– the Turner model [Mathews et al, 1999, 2004]
7
The Turner model[Mathews et al, 1999, 2004]
Energy model:• Features (stacked pair AG/CU)• Parameters θ (-2.1 kcal/mol)• Energy function ΔG(θ) = cT θ
[3’ UTR protein-binding RNA from Rfam]
8
The Turner model[Mathews et al, 1999, 2004]
• Obtained by– Linear regression from experimental data– Biological knowledge
• Limitations– No thorough computational method was used– Many parameters have been extrapolated– Large amounts of data were not exploited
• Accuracy on our data set: 60%• Our goal: Improve the RNA energy model
9
• Parameter esti-mation for models with pseudoknots (Ch 7)
• Parameter esti-mation for models without pseudoknots (Ch 5)
Contributions
• Model selection and feature relationships (Ch 6)
• Databases (Ch 3)
• Parameter estimation algorithms (Ch 4)
10
RNA STRAND
• Structural data from 8 public databases– RNA sequences with
• known secondary structures• unknown free energies
– Determined by• comparative sequence analysis• X-ray crystallography• NMR
– 4600 RNAs, avg. length 530 nucleotides [Andronescu et al, BMC Bioinformatics 2008]
11
RNA THERMO
• Thermodynamic data from 58 papers– RNA sequences with
• known secondary structures• measured free energies
– Determined by • optical melting experiments
[Turner lab & collaborators]
– 1300 RNAs, avg. length 17 nucleotides
12
• Parameter esti-mation for models with pseudoknots (Ch 7)
• Parameter esti-mation for models without pseudoknots (Ch 5)
Outline
• Model selection and feature relationships (Ch 6)
Databases: RNA STRAND and RNA THERMO (Ch 3)
• Parameter estimation algorithms (Ch 4)
13
Parameter estimation problem
• Given– A structural set S (seq + str)– A thermodynamic set T (seq + str + free energy)– A model with
• a fixed set of features (e.g. Turner99 with 363 features)• a free energy function (e.g. linear in the parameters θ)
• Estimate (learn) parameters θ that maximize avg. accuracy when measured on reference set
Sn = #correctly predicted bp / # true bpPPV = #correctly predicted bp / # predicted bp F-measure = harmonic mean (Sn, PPV)
= 2*Sn*PPV/(Sn+PPV)
14
Constraint Generation (CG)
• Idea: for all (x,yknown) in S, yknown should have lower free energy than all other structures y
Predict low energy structures with the current θ
Solve a constrained quadratic opt. problem
min (Σδ2 + Σ (free energy error for T)2 + regularizer)subject to
ΔG(x,yknown,θ) < ΔG(x,y,θ) + δ, for all (x,yknown) in S
Repeat until convergence [Andronescu et al, Bioinformatics 2007]
15
Boltzmann Likelihood (BL)
• The probability of a structure y is a Boltzmann function:
• Solve a non-linear optimization problem with unique optimum max (P(structural data) P(thermo data) regularizer)
• Similar approach (CONTRAfold) proposed by [Do et al, 2006]– no thermo data was used– free energies are not predicted correctly
P(structural data) =
16
• Parameter esti-mation for models with pseudoknots (Ch 7)
• Parameter esti-mation for models without pseudoknots (Ch 5)
Outline
• Model selection and feature relationships (Ch 6)
Databases: RNA STRAND and RNA THERMO (Ch 3)
Parameter estimation algorithms: CG and BL (Ch 4)
17
Parameter estimation for models without pseudoknots
Sensitivity = #correctly predicted bp / # true bpPPV = #correctly predicted bp / # predicted bp
BL*, trained on STrain+T, F=0.69, RMSE=1.34
CG*, trained on STrain+T, F=0.68, RMSE=0.98
CONTRAfold 2.0,trained on SProcF=0.68, RMSE=6.02
CONTRAfold 1.1,trained on 151RfamF=0.61, RMSE=9.17
Turner99F=0.60, RMSE=1.24
CG 07 [Andr. 2007],trained on SProc+TF=0.65, RMSE=1.03
BL* gives the highest accuracy on average, an increase of 9% from the Turner99 parameters.
Set from RNA STRAND,# str: 2500Avg len: 330
Std len: 500
18
Runtime analysis
Parameter estimation algorithm CPU time
Boltzmann Likelihood (BL) 1-8 months
Constraint Generation (CG) 1-3 days
BL is at least 10 times slower than CG, but slightly more accurate.
Reference machine: a 3GHz Intel Xeon CPU (1MB cache and 2GB RAM)
19
• Parameter esti-mation for models with pseudoknots (Ch 7)
Parameter esti-mation for models without pseudoknots (Ch 5)
9% better F-measure
Outline
• Model selection and feature relationships (Ch 6)
Databases: RNA STRAND and RNA THERMO (Ch 3)
Parameter estimation algorithms: CG and BL (Ch 4)
20
Model selection
• Explore parsimonious and lavish models
• For lavish models, use feature relationships
Model #features BL F-measure
Parsimonious 79 0.646
Turner99 363 0.684
Lavish 7802 0.683
21
Feature relationships
• Link features not covered by thermo set T with those that are covered
BL: max (P(structural data) P(thermo data) regularizer)
22
Model selection and feature relationships
BL-FR*, trained on STrain+T, #features=7726,F=0.71, RMSE=1.51
BL*, trained on STrain+T, F=0.69, RMSE=1.34
CG*, trained on STrain+T, F=0.68, RMSE=0.98
CONTRAfold 2.0,trained on SProcF=0.68, RMSE=6.02
CONTRAfold 1.1,trained on 151RfamF=0.61, RMSE=9.17
Turner99F=0.60, RMSE=1.24
CG 07 [Andr. 2007],trained on SProc+TF=0.65, RMSE=1.03
Modeling feature relationships improves prediction by an additional 1.3% (10.6% from the Turner99 parameters).
23
• Parameter esti-mation for models with pseudoknots (Ch 7)
Parameter esti-mation for models without pseudoknots (Ch 5)
9% better F-measure
Outline
Model selection and feature relationships (Ch 6)
11% better F-measure
Databases: RNA STRAND and RNA THERMO (Ch 3)
Parameter estimation algorithms: CG and BL (Ch 4)
24
Parameter estimation for models with pseudoknots
• Models (Turner features + additional features for pseudoknots)– Dirks & Pierce [Dirks and Pierce, 2003]
– Cao & Chen [Cao and Chen, 2006]
• Prediction algorithm– HotKnots [Ren et al, 2005]
• Parameter estimation algorithm– CG modified for this problem
• BL was much harder to implement
25
Parameter estimation for models with pseudoknots
Params
With pknots Without pknots All
Short Long Short Long#str=78
Len=48
#str=20
Len=170
#str=261
Len=58
#str=87
Len=124
#str=446
Len=74
Initial D&P 0.62 0.51 0.71 0.69 0.68
New D&P 0.80 0.56 0.81 0.68 0.77
Initial C&C 0.77 0.54 0.71 0.68 0.71
New C&C 0.75 0.54 0.81 0.71 0.77
* Short means at most 100 nucleotides
Improvements on average:• Dirks & Pierce parameters by 9% • Cao &Chen parameters by 6%
26
Parameter esti-mation for models with pseudoknots (Ch 7)
9% and 6% better F
Parameter esti-mation for models without pseudoknots (Ch 5)
9% better F-measure
Conclusions
Model selection and feature relationships (Ch 6)
11% better F-measure
Databases: RNA STRAND and RNA THERMO (Ch 3)
Parameter estimation algorithms: CG and BL (Ch 4)
27
Applications
• CG 07 [Andr 2007] is part of RNA Vienna WebSuites
• Many other software packages benefit from this work– MFE and suboptimal secondary structure
prediction– Simulation of folding pathways, sampling and
clustering– Prediction of hybridization efficiency, target
availability of siRNA
28
Directions for future work
• No single parameter set (or algorithm) results in better accuracy for all structures– Combine parameter sets and algorithms
• Explore other models– Models for multi-loops are not accurate
• Accuracy of data is questionable– Obtain / generate / pre-process data more
accurately
29
Acknowledgments
• Supervisors: – Anne Condon, Holger Hoos
• Committee: – Dave Mathews, Kevin Murphy
• Collaborators: – Vera Bereg, Cristina Pop, Alex Brown
• Members of the BETA lab and CS department
• UBC and IBM Research for funding
30
Additional slides
31
RNAs play diverse roles
• Messenger RNA
• Ribosomal RNA
• Transfer RNA
[contexo.info]
32
RNA structure plays role in splicing
[Bruce R. Korf, Human Genetics and Genomics]
[Rogic et al, 2008]
33
RNAs can act as catalysts (ribozymes)
[James & Al-Shamkhani]
34
RNA hybridization thermodynamics
[Lu and Mathews, 2008]
35
RNA STRAND
Database (source) RNA type No. Median len
Gutell DB rRNA, intron 1056 1500
tmRDB tmRNA 726 360
Sprinzl tRNA DB tRNA 622 76
RNase P DB RNase P RNA 454 330
SRP DB SRP RNA 383 270
Rfam Various 313 60
PDB, NDB Various 1112 50
RNA STRAND All of the above 4666 300
36
Design of optical melting experiments
• 16% of multi-loops in RNA STRAND have 5 or more branches
• 30% of internal loops have ≥7 unpaired bases
• 13% of internal loops have asymmetry ≥ 3
• Pseudoknots (22 experiments, only 4 features out of the 11 DP are covered)
37
Analysis of RNA THERMO
38
Analysis of RNA THERMO
39
Schematic representation of data
40
Other BL results (M363)
Set Train RMSE S-Test S-STR
BL* rho=1 S-Full-Train 1.34 0.679 0.694
BL rho=5 S-Full-Train 1.07 0.677 0.687
BL rho=1 S-Full-Train-nopkstr 1.16 0.668 0.679
41
Accuracy on classes
Class # Len BL-FR* CG* CF2 T99
tRNA 582 80 0.79 0.81 0.77 0.60
RNaseP 387 332 0.61 0.60 0.67 0.55
SRP RNA 357 223 0.74 0.69 0.64 0.71
tmRNA 269 363 0.59 0.50 0.52 0.39
16S rRNA 187 1276 0.50 0.48 0.48 0.39
5S rRNA 117 118 0.88 0.78 0.73 0.73
Ham. riboz. 114 52 0.64 0.67 0.66 0.65
GI intron 78 362 0.60 0.61 0.62 0.56
23S rRNA 52 2684 0.55 0.53 0.59 0.47
All 2518 331 0.71 0.68 0.68 0.60
42
Correlations between parameters
43
Accuracy vs length, no pseudoknots
44
Accuracy vs length, no pseudoknots
45
Correlation accuracies, all
46
Correlation accuracies, all
47
Correlation accuracies, 0-200
48
Correlation accuracies, 200-700
49
Correlation accuracies, 700-2000
50
Correlation accuracies, 2000-4000
51
Sensitivity to the structural set size
52
Feature relationships
53
Feature relationships
54
Feature relationships
55
Feature relationships
[Davis and Znosko, 2007]
56
Feature relationships
[Christiansen an Znosko, 2008] – complete set of sequence symmetric tandem mismatches and improved model for predicting sequence asymmetric mismatches
57
Model selection and feature relationships
1/64 of STrain 1/4 of STrain
58
HotKnots predictions
Initial D&P Initial C&C
59
DP vs CC, new parameters
With pseudoknots Without pseudoknots
60
DP vs CC, initial parameters
With pseudoknots Without pseudoknots
61
DP, new vs initial
With pseudoknots Without pseudoknots
62
CC, new vs initial
With pseudoknots Without pseudoknots
63
Pseudoknots
64
Runtime
65
Runtime
66
Parameter correlations [Andr 07]
67
Feature counts [Andr 07]
68
Accuracy vs iterations [Andr 07]