RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

44
RNA informatics RNA informatics Unit 12 Unit 12 BIOL221T BIOL221T : Advanced : Advanced Bioinformatics for Bioinformatics for Biotechnology Biotechnology Irene Gabashvili, PhD

Transcript of RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Page 1: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RNA informaticsRNA informaticsUnit 12Unit 12

BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for

BiotechnologyBiotechnology

Irene Gabashvili, PhD

Page 2: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Non coding DNA Non coding DNA (98.5% human genome)(98.5% human genome)

IntergenicIntergenic Repetitive elementsRepetitive elements PromotersPromoters IntronsIntrons mRNA untranslated region (UTR)mRNA untranslated region (UTR)

Page 3: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RNA MoleculesRNA Molecules

mRNAmRNA tRNAtRNA rRNArRNA Other types of RNAOther types of RNA

-RNaseP –-RNaseP –trimming 5’ end of pre tRNAtrimming 5’ end of pre tRNA

-telomerase RNA- -telomerase RNA- maintaining the chromosome maintaining the chromosome ends ends

-Xist-Xist RNA- RNA- inactivation of the extra copy of the x inactivation of the extra copy of the x chromosomechromosome

Page 4: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

What are RNA and mRNA?What are RNA and mRNA?

Traditional role as messenger molecule (mRNA)Traditional role as messenger molecule (mRNA)

RNA is a polymer of nucleotides A, U, C, and G transcribed from

DNAGATTACA GAUUACA

Page 5: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

non-coding RNA (RNA non-coding RNA (RNA genes)genes)

RNA enzymes: catalytic RNARNA enzymes: catalytic RNA Ribosomal RNA (rRNA)Ribosomal RNA (rRNA) Transfer RNA (tRNA)Transfer RNA (tRNA)

RNAi: RNA mediated gene regulation Micro RNA (miRNA) Short-interfering RNA (siRNA)

Alternative splicing: small-nuclear RNA (snRNA)

Others: snoRNA, eRNA, srpRNA, tmRNA, gRNA

Structure essential to function for many ncRNAs

Page 6: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Some biological functions of Some biological functions of ncRNAncRNA

Nuclear exportNuclear export mRNA cellular localizationmRNA cellular localization Control of mRNA stabilityControl of mRNA stability Control of translationControl of translation

The function of the RNA molecule depends on its folded structure

Page 7: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Most biological molecules contain Most biological molecules contain one-dimensional information that is one-dimensional information that is called “sequence”, which can be called “sequence”, which can be treated as “string” in computer treated as “string” in computer science.science.

Molecules with sequence: DNA, RNA Molecules with sequence: DNA, RNA and proteinsand proteins

Molecules without much sequence Molecules without much sequence information: Lipids and information: Lipids and polysaccharides.polysaccharides.

From Sequence to From Sequence to StructureStructure

Page 8: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Can all the properties of a Can all the properties of a macromolecule be macromolecule be

predicted by its predicted by its sequences?sequences?

Three dimensional structuresThree dimensional structures Alternate splicingAlternate splicing Kinetics propertiesKinetics properties Etc. Etc.

Page 9: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RNA sequence hierarchyRNA sequence hierarchy

1D: 1D: CCAUCUUCUCCUUGGAGAUUUGGCCAUCUUCUCCUUGGAGAUUUGG

2D:2D:

3D:3D:

Page 10: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Control of Iron levels by mRNA Control of Iron levels by mRNA secondary structuresecondary structure

G U A GC N N N’ N N’ N N’ N N’C N N’ N N’ N N’ N N’ N N’

5’ 3’

conserved

Iron Responsive ElementIRE

Recognized byIRP1, IRP2

Page 11: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

IRP1/2

5’ 3’F mRNA

5’ 3’TR mRNA

IRP1/2

F: Ferritin = iron storageTR: Transferin receptor = iron uptake

IRE

Low IronIRE-IRP inhibits translation of ferritinIRE-IRP Inhibition of degradation of TR

High IronIRE-IRP off -> ferritin translated

Transferin receptor degradated

Page 12: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RNA Secondary RNA Secondary StructureStructure

U U

C G U A A UG C

5’ 3’5’

G A U C U U G A U C

3’

STEM

LOOP

The RNA molecule folds on itself. The RNA molecule folds on itself. The base pairing is as follows:The base pairing is as follows: G C A U G U G C A U G U hydrogen bond. hydrogen bond.

Page 13: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RNA Secondary RNA Secondary structurestructure

G G A U

U GC C GG A U A A U G CA G C U U

INTERNAL LOOP

HAIRPIN LOOP

BULGE

STEM

DANGLING ENDS5’ 3’

Page 14: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Examples of known Examples of known interactions of RNA interactions of RNA secondary structural secondary structural

elementselementsPseudo-knot

Kissing hairpins

Hairpin-bulge contact

These patterns are excluded from the prediction schemes as their computation is too intensive.

Page 15: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

What is RNA secondary What is RNA secondary structure/folding?structure/folding?

bulgeloop

helix (stem)

hairpin loopinternal loop

multi-branch

loop

Page 16: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

2D: mRNA Regulatory 2D: mRNA Regulatory elements elements

Mini-Rose and Macro-Rose at 37oC and 42oC

Page 17: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Legal structure

RNA secondary structure RNA secondary structure representationrepresentation,,

also:also:

Page 18: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RNA 2D structure in RNA 2D structure in MatlabMatlab

Page 19: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

3D motifs3D motifs

Page 20: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

16S rRNA 16S rRNA 22OO sstructuretructure can be predicted from can be predicted from 11OO structure structure

BacteriaBacteria ArchaeaArchaea EukaryaEukarya

Page 21: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

How is RNA folding How is RNA folding done?done?

Simple Nussinov Folding AlgorithmSimple Nussinov Folding Algorithm

Only scores interactions between paired bases

Useful for demonstrating general structure of more complex folding algorithms

Score for optimal structure from base i to base j

Base i is unpaired, consider pairing between i+1 and j

We want the highest scoring fold

Base j is unpaired, consider pairing between i and j-1

δ(i, j) = score for a pairing between i and j.

Page 22: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

How is RNA folding How is RNA folding done?done?

Simple Nussinov Folding AlgorithmSimple Nussinov Folding Algorithm

Only scores interactions between paired bases

Useful for demonstrating general structure of more complex folding algorithms

Pair i and j. Now consider pairing between i+1 and j-1.

Page 23: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

How is RNA folding How is RNA folding done?done?

Simple Nussinov Folding AlgorithmSimple Nussinov Folding Algorithm

Only scores interactions between paired bases

Useful for demonstrating general structure of more complex folding algorithms

i and j begin a bifurcation. Consider every possible bifurcation point k. Sum scores from each folded structure.

Page 24: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

CONTRAfoldCONTRAfold

ProblemProblem: Given an RNA sequence, predict the most : Given an RNA sequence, predict the most likely secondary structurelikely secondary structure

AUCCCCGUAUCGAUCAAAAUCCAUGGGUACCCUAGUGAAAGUGUAUAUACGUGCUCUGAUUCUUUACUGAGGAGUCAGUGAACGAACUGA

Page 25: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

How does CONTRAfold How does CONTRAfold work?work?

CONTRAfold looks at CONTRAfold looks at featuresfeatures that indicate a that indicate a good structuregood structure

C-G base pairings

A-U base pairings

Helices of length 5

Hairpin loops of size 9

Bulge loops of size 2

CG/GC Base-pair stacking interactions

For example:

These examples are called thermodynamic parameters because they represent free energy values

Page 26: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

What is an RNA What is an RNA regulatory motif?regulatory motif?

Motif: A conserved sequence elementMotif: A conserved sequence element

A regulator binds to a regulatory motif

RNA regulatory motif: A motif used to regulate translation

G A U U A C A . . . RNA

Regulatory motif (AUUAC)

Regulatory protein Micro RNA

U A A U G microRNA

Page 27: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

What is an accessible What is an accessible motif?motif?

If a sequence is part of an intramolecular hybridization, it is unlikely to bind to regulators

We define a motif as “accessible” if none of its nucleotides is hybridized as part of the folding

Page 28: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Accessible motifs cont’dAccessible motifs cont’d

Therefore, only accessible sequences should be scanned for regulatory motifs

Page 29: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Accessible motifs cont’dAccessible motifs cont’d

Therefore, only accessible sequences should be scanned for regulatory motifs.

Page 30: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Results: Degradation Related MotifsResults: Degradation Related Motifs

Page 31: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Prediction Tools based Prediction Tools based on Energy Calculationon Energy Calculation

Fold, MfoldFold, Mfold Zucker & Stiegler (1981) Nuc. Acids Res. 9:133-Zucker & Stiegler (1981) Nuc. Acids Res. 9:133-

148148Zucker (1989) Science 244:48-52Zucker (1989) Science 244:48-52

RNAfoldRNAfoldVienna RNA secondary structure serverVienna RNA secondary structure serverHofacker (2003) Nuc. Acids Res. 31:3429-3431Hofacker (2003) Nuc. Acids Res. 31:3429-3431

Page 32: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

LinksLinks

http://rna.tbi.univie.ac.at/ http://rna.tbi.univie.ac.at/cgi-bin/RNAfol

d.cgi http://frontend.bioinfo.rpi.edu/application

s/mfold/ http://frontend.bioinfo.rpi.edu/applications/mf

old/cgi-bin/rna-form1.cgi

http://bioweb2.pasteur.fr/nucleic/intro-en.html#rna http://mobyle.pasteur.fr/cgi-bin/MobylePortal/

portal.py?form=mfold

Page 33: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RNAalifold (Hofacker 2002)From the vienna RNA package

Predicts the consensus secondarystructure for a set of aligned RNA sequences by using modified dynamic programming algorithm that addcovariance term to the standardenergy model

Improvement in prediction accuracy

Page 34: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Other related programsOther related programs

COVECOVE RNA structure analysis using the covariance RNA structure analysis using the covariance modelmodel (implementation of the stochastic free (implementation of the stochastic free grammar method)grammar method)

QRNA (Rivas and Eddy 2001)QRNA (Rivas and Eddy 2001)Searching for conserved RNA structuresSearching for conserved RNA structures

tRNAscan-SEtRNAscan-SE tRNA detection in genome tRNA detection in genome sequencessequences

Sean Eddy’s Lab WUhttp://www.genetics.wustl.edu/eddy

Page 35: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RNA familiesRNA families

Rfam : General non-coding RNA Rfam : General non-coding RNA database database

(most of the data is taken from (most of the data is taken from specific databases)specific databases)

http://www.sanger.ac.uk/Software/Rfam/

Includes many families of non coding RNAs and functionalMotifs, as well as their alignement and their secondary structures

Page 36: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

RfamRfam

379 different RNA families or 379 different RNA families or functional functional

Motifs from mRNA UTRs etc.Motifs from mRNA UTRs etc.

GENE

INTRON

Cis ELEMENTS

Page 37: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Scopes of sequence Scopes of sequence analysisanalysis

Sequences onlySequences only Sequences of DNA/RNA/Proteins for Sequences of DNA/RNA/Proteins for

defining transcription unit and defining transcription unit and intronintron

Sequence with other kinds of dataSequence with other kinds of data MicroarrayMicroarray 3D-data such as comparative modeling3D-data such as comparative modeling Metabolic dataMetabolic data

Page 38: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Purposes of Purposes ofSequence Sequence Analysis i Analysis i

ncludenclude Identification of coding regions Identification of coding regions IdentificationIdentification ofof regulatoryregulatory elementselements Identifying events of genetic Identifying events of genetic

recombinationrecombination Identifying the existence of selective Identifying the existence of selective

pressurespressures Searching for homologous sequences Searching for homologous sequences Identifying shared patterns of a group of Identifying shared patterns of a group of

sequences sequences Modeling secondary or 3D structures (Modeling secondary or 3D structures (ab ab

initioinitio modeling) modeling)

Page 39: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

What can you do with a What can you do with a single sequence?single sequence?

Information content in a sequenceInformation content in a sequence G/C contentG/C content Codon usageCodon usage Synonymous/Non-synonymous Synonymous/Non-synonymous

mutationsmutations Periodicity Periodicity

Page 40: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

16Based on S rRNA sequenc aaa aaaaaaa aaaaaaaa a aaa a, aaaaaaa aaaa a aaaaaaaaaaaa3

s.

Page 41: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Information contained in Information contained in SSU rRNA genes SSU rRNA genes

SSU rRNA sequences are us SSU rRNA sequences are us ed for universal phylogeneti ed for universal phylogeneti

c construction c construction Its length is ~1500 bp~ Its length is ~1500 bp~33

000 bits of information 000 bits of information - 3538Lifebegins~ . . bi l l i on year s - 3538Lifebegins~ . . bi l l i on year s

p

Page 42: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Maximum average change o Maximum average change o f information in SSU rRNA f information in SSU rRNA

genesgenes

1 bit/million years of evolutio 1 bit/million years of evolutioaa

aaaaaaaa aaa aaaa aaa aaa aa aaaaaaaa aaa a, aaaaaaaa aaa aaaa aaa aaa aa aaaaaaaa aaa a,aaaaaaaaaaa a aaaa aaaaaaaaaa aaaaaaaaaaa a aaaa aaaaaaaaaa aaaa a aaa aaaaaaaa aaaa aaaaaaaaa a aaa aaaaaaaa aaaa aaaaa

aaaaaa< 1 . aaaaaa< 1 .

Page 43: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Homology vs Simil Homology vs Similarityarity

““ Homology” implies common origin. Homology” implies common origin. Homologous sequences are generall Homologous sequences are generall

y recognized by similarity y recognized by similarity Similar sequences may not be homo Similar sequences may not be homo

logous logous but may be but may be due to converge due to converge nt evolution or just by chance. nt evolution or just by chance.

Homology is qualitative, while simil Homology is qualitative, while simil arity is quantitative arity is quantitative

Page 44: RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Orthologous vs Paralo Orthologous vs Paralogousgous

Orthologues usually refer to homo Orthologues usually refer to homo logous genes with the same functi logous genes with the same functi

on in different organisms. on in different organisms. Paralogues usually refer to homol Paralogues usually refer to homol

ogous genes with different functio ogous genes with different functio ns, usually in the same organisms. ns, usually in the same organisms.

Bioinformatic tools cannot differe Bioinformatic tools cannot differe ntiate between both. ntiate between both.