Finding, Aligning and Analyzing Non Coding RNAs

Post on 04-Feb-2016

37 views 0 download

Tags:

description

Finding, Aligning and Analyzing Non Coding RNAs. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. They are Everywhere…. And ENCODE said… - PowerPoint PPT Presentation

Transcript of Finding, Aligning and Analyzing Non Coding RNAs

Finding, Aligning and AnalyzingNon Coding RNAs

Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program

They are Everywhere…

And ENCODE said…“nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”

Who Are They?– tRNA, rRNA, snoRNAs, – microRNAs, siRNAs– piRNAs– long ncRNAs (Xist, Evf, Air, CTN, PINK…)

How Many of them– Open question– 30.000 is a common guess– Harder to detect than proteins

.

Searching

“…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”

ncRNAs can have different sequences and Similar Structures

ncRNAs Can Evolve Rapidly

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

GAACGGACC

CTTGCCTGG

GG

AAC CA

CGG

AG

AC G

CTTGCCTCC

GAACGGAGG

GG

AAC CA

CGG

AG

AC G

ncRNAs are Difficult to Align

--CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG-- * * *** * * *** *

CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**

Regular Alignment

ncRNAs are Difficult to Align

Same Structure Low Sequence Identity

Small Alphabet, Short Sequences Alignments often Non-Significant

Obtaining the Structure of a ncRNA is difficult

Hard to Align The Sequences Without the Structure

Hard to Predict the Structures Without an Alignment

The Holy Grail of RNA Comparison:Sankoff’ Algorithm

The Holy Grail of RNA ComparisonSankoff’ Algorithm

Simultaneous Folding and Alignment

– Time Complexity: O(L2n)– Space Complexity: O(L3n)

In Practice, for Two Sequences:

– 50 nucleotides: 1 min. 6 M.– 100 nucleotides 16 min. 256 M.– 200 nucleotides 4 hours 4 G.– 400 nucleotides 3 days 3 T.

Forget about– Multiple sequence alignments– Database searches

The next best Thing: Consan

Consan = Sankoff + a few constraints

Use of Stochastic Context Free Grammars

– Tree-shaped HMMs– Made sparse with constraints

The constraints are derived from the most confident positions of the alignment

Equivalent of Banded DP

Consan for Databases: Infernal

Infernal is a Faster version of Consan

For Database Search

Sill Very Slow

Receiver operating characteristic (ROC)Comparison of Infernal with BLAST

Consan for Databases: Infernal

BLAST: 360 s.

Fast Infernal: 182 000 s. Slow Infernal: 5 320 000 s.

Searching Databases for New RNAs

Rfam: In practice

Rfam contains RNA families

– Families Multiple Sequence Alignment Models

– Models are like Pfam Profiles Use Consan or Cmsearch rather than HMMer Much Slower

– Too expensive to search the models Models are used to build Rfam People usually BLAST Rfam

Where do Rfam Families Come From?

Infernal Requires a Model

Models requires an MSA

The MSA requires a Family

It all starts with a BlastN

Rfam, Gardner et al. NAR 2008

Can we make BlastN more accurate ?

BlastN is not very accurate because:

– Poor substitution models for Nucleic Acids– Low information density (4 symbols)

BlastN assumes– Equal evolution rates for all nucleotides– Independence form Neighbors

Love Thy Neighbor

Measured Nearest Neighbor Dependencies on Rfam sequences

High Rate of CpG mutations

Measuring Di-Nucleotide Evolution

Each Nucleotide can be made more informative

It can incorporate the “name” of its Neighbor– AA => a– AG => b– AC => c– AT => d– …

A 16 Letter alphabet can be used to recode all nucleotide sequences

We name these extended Nucleotides

Blosum-R and eRNA

Substitutions ??

How much does it cost to turn one nucleotide into another one ?

Blosum/Pam style matrix

Matrices estimated on Rfam families

Blosum-R and eRNA

Using BlastR

When Nucleic Acids look like Proteins They can be aligned with Protein Methods

– BlastN BlastP

– BlastP with eRNA is BlastR

Validating Blast-R

Benchmarking BlastR

Rfam

PPPN

E

VALUES

Blast

Query

Benchmarking BlastR

Rfam 001

Rfam 002

Rfam …

Rfam 001

Rfam 002

Rfam …

Blast

Blast

Blast

ROC

Benchmarking BlastR

Good Bad

False Positives

True Positive

GoodBad

Benchmarking BlastR

False Positives

True Positive

GoodBad

Area Under Curve

Small AUC Better

BlastR vs The World

The 3 Components of Blast R

BlastP is better than BlastN BlosumR makes BlastP a little

bit better

Blast: wuBlast

The 3 Components of Blast R

BlastP is better than BlastN BlosumR makes BlastP a little

bit better And Faster

BlastR and Clustering

Given all Rfam in Bulk

How good is BlastR at reconstituting all the families

Sensitivity

1-Specificty

BlastR and Clustering

Given all Rfam in Bulk

How good is BlastR at reconstituting all the families

Sensitivity

1-Specificty

BllastR: In Practice

BllastR: In Practice

E-Value Threshold: 10-20

BlastN

BlastR

Take Home

Searching Nucleotides is Difficult

BlastN is not a very good algorithm

Simple Adaptations can improve the situation– Changing the algorithm (BlastP)– Changing the Scoring Scheme (BlastP-Nuc)– Changing the alphabet (BlastR)