Finding, Aligning and Analyzing Non Coding RNAs

38
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

description

Finding, Aligning and Analyzing Non Coding RNAs. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. They are Everywhere…. And ENCODE said… - PowerPoint PPT Presentation

Transcript of Finding, Aligning and Analyzing Non Coding RNAs

Page 1: Finding, Aligning and Analyzing Non Coding RNAs

Finding, Aligning and AnalyzingNon Coding RNAs

Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program

Page 2: Finding, Aligning and Analyzing Non Coding RNAs

They are Everywhere…

And ENCODE said…“nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”

Who Are They?– tRNA, rRNA, snoRNAs, – microRNAs, siRNAs– piRNAs– long ncRNAs (Xist, Evf, Air, CTN, PINK…)

How Many of them– Open question– 30.000 is a common guess– Harder to detect than proteins

.

Page 3: Finding, Aligning and Analyzing Non Coding RNAs

Searching

“…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”

Page 4: Finding, Aligning and Analyzing Non Coding RNAs

ncRNAs can have different sequences and Similar Structures

Page 5: Finding, Aligning and Analyzing Non Coding RNAs

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

GAACGGACC

CTTGCCTGG

GG

AAC CA

CGG

AG

AC G

CTTGCCTCC

GAACGGAGG

GG

AAC CA

CGG

AG

AC G

Page 6: Finding, Aligning and Analyzing Non Coding RNAs

ncRNAs are Difficult to Align

--CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG-- * * *** * * *** *

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

Regular Alignment

Page 7: Finding, Aligning and Analyzing Non Coding RNAs

ncRNAs are Difficult to Align

Same Structure Low Sequence Identity

Small Alphabet, Short Sequences Alignments often Non-Significant

Page 8: Finding, Aligning and Analyzing Non Coding RNAs

Obtaining the Structure of a ncRNA is difficult

Hard to Align The Sequences Without the Structure

Hard to Predict the Structures Without an Alignment

Page 9: Finding, Aligning and Analyzing Non Coding RNAs

The Holy Grail of RNA Comparison:Sankoff’ Algorithm

Page 10: Finding, Aligning and Analyzing Non Coding RNAs

The Holy Grail of RNA ComparisonSankoff’ Algorithm

Simultaneous Folding and Alignment

– Time Complexity: O(L2n)– Space Complexity: O(L3n)

In Practice, for Two Sequences:

– 50 nucleotides: 1 min. 6 M.– 100 nucleotides 16 min. 256 M.– 200 nucleotides 4 hours 4 G.– 400 nucleotides 3 days 3 T.

Forget about– Multiple sequence alignments– Database searches

Page 11: Finding, Aligning and Analyzing Non Coding RNAs

The next best Thing: Consan

Consan = Sankoff + a few constraints

Use of Stochastic Context Free Grammars

– Tree-shaped HMMs– Made sparse with constraints

The constraints are derived from the most confident positions of the alignment

Equivalent of Banded DP

Page 12: Finding, Aligning and Analyzing Non Coding RNAs

Consan for Databases: Infernal

Infernal is a Faster version of Consan

For Database Search

Sill Very Slow

Receiver operating characteristic (ROC)Comparison of Infernal with BLAST

Page 13: Finding, Aligning and Analyzing Non Coding RNAs

Consan for Databases: Infernal

BLAST: 360 s.

Fast Infernal: 182 000 s. Slow Infernal: 5 320 000 s.

Page 14: Finding, Aligning and Analyzing Non Coding RNAs

Searching Databases for New RNAs

Page 15: Finding, Aligning and Analyzing Non Coding RNAs

Rfam: In practice

Rfam contains RNA families

– Families Multiple Sequence Alignment Models

– Models are like Pfam Profiles Use Consan or Cmsearch rather than HMMer Much Slower

– Too expensive to search the models Models are used to build Rfam People usually BLAST Rfam

Page 16: Finding, Aligning and Analyzing Non Coding RNAs

Where do Rfam Families Come From?

Infernal Requires a Model

Models requires an MSA

The MSA requires a Family

It all starts with a BlastN

Rfam, Gardner et al. NAR 2008

Page 17: Finding, Aligning and Analyzing Non Coding RNAs

Can we make BlastN more accurate ?

BlastN is not very accurate because:

– Poor substitution models for Nucleic Acids– Low information density (4 symbols)

BlastN assumes– Equal evolution rates for all nucleotides– Independence form Neighbors

Page 18: Finding, Aligning and Analyzing Non Coding RNAs

Love Thy Neighbor

Measured Nearest Neighbor Dependencies on Rfam sequences

Page 19: Finding, Aligning and Analyzing Non Coding RNAs

High Rate of CpG mutations

Page 20: Finding, Aligning and Analyzing Non Coding RNAs

Measuring Di-Nucleotide Evolution

Each Nucleotide can be made more informative

It can incorporate the “name” of its Neighbor– AA => a– AG => b– AC => c– AT => d– …

A 16 Letter alphabet can be used to recode all nucleotide sequences

We name these extended Nucleotides

Page 21: Finding, Aligning and Analyzing Non Coding RNAs
Page 22: Finding, Aligning and Analyzing Non Coding RNAs

Blosum-R and eRNA

Page 23: Finding, Aligning and Analyzing Non Coding RNAs

Substitutions ??

How much does it cost to turn one nucleotide into another one ?

Blosum/Pam style matrix

Matrices estimated on Rfam families

Page 24: Finding, Aligning and Analyzing Non Coding RNAs

Blosum-R and eRNA

Page 25: Finding, Aligning and Analyzing Non Coding RNAs

Using BlastR

When Nucleic Acids look like Proteins They can be aligned with Protein Methods

– BlastN BlastP

– BlastP with eRNA is BlastR

Page 26: Finding, Aligning and Analyzing Non Coding RNAs

Validating Blast-R

Page 27: Finding, Aligning and Analyzing Non Coding RNAs

Benchmarking BlastR

Rfam

PPPN

E

VALUES

Blast

Query

Page 28: Finding, Aligning and Analyzing Non Coding RNAs

Benchmarking BlastR

Rfam 001

Rfam 002

Rfam …

Rfam 001

Rfam 002

Rfam …

Blast

Blast

Blast

ROC

Page 29: Finding, Aligning and Analyzing Non Coding RNAs

Benchmarking BlastR

Good Bad

False Positives

True Positive

GoodBad

Page 30: Finding, Aligning and Analyzing Non Coding RNAs

Benchmarking BlastR

False Positives

True Positive

GoodBad

Area Under Curve

Small AUC Better

Page 31: Finding, Aligning and Analyzing Non Coding RNAs

BlastR vs The World

Page 32: Finding, Aligning and Analyzing Non Coding RNAs

The 3 Components of Blast R

BlastP is better than BlastN BlosumR makes BlastP a little

bit better

Blast: wuBlast

Page 33: Finding, Aligning and Analyzing Non Coding RNAs

The 3 Components of Blast R

BlastP is better than BlastN BlosumR makes BlastP a little

bit better And Faster

Page 34: Finding, Aligning and Analyzing Non Coding RNAs

BlastR and Clustering

Given all Rfam in Bulk

How good is BlastR at reconstituting all the families

Sensitivity

1-Specificty

Page 35: Finding, Aligning and Analyzing Non Coding RNAs

BlastR and Clustering

Given all Rfam in Bulk

How good is BlastR at reconstituting all the families

Sensitivity

1-Specificty

Page 36: Finding, Aligning and Analyzing Non Coding RNAs

BllastR: In Practice

Page 37: Finding, Aligning and Analyzing Non Coding RNAs

BllastR: In Practice

E-Value Threshold: 10-20

BlastN

BlastR

Page 38: Finding, Aligning and Analyzing Non Coding RNAs

Take Home

Searching Nucleotides is Difficult

BlastN is not a very good algorithm

Simple Adaptations can improve the situation– Changing the algorithm (BlastP)– Changing the Scoring Scheme (BlastP-Nuc)– Changing the alphabet (BlastR)