Expression Genomics Laboratory -...

72
Transcriptomics 101 Expression Genomics Laboratory http://www.expressiongenomics.org Nicole Cloonan [email protected] Winter School, 7 th July 2009

Transcript of Expression Genomics Laboratory -...

Page 1: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Transcriptomics 101

Expression GenomicsLaboratoryhttp://www.expressiongenomics.org

Nicole [email protected]

Winter School, 7th July 2009

Page 2: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

I want to knowthe maths

Page 3: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

I want the bestsoftware package

Align tags to the genome1

Measure gene expression2

Find mutations3

Find novel expression4

Assemble transcripts5

Win Nobel Prize6

0

5

10

15

20

25

1 2 3 4 5

June 2009

Page 4: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Presentation Outline

What is a transcriptome?

What can we learn from

studying it?

Introduction

Genomic tools for

transcriptomics.

Deriving biological

insight from transcriptomics.

Transcriptomics

What’s old is new again.

Double stranded protocols.

Strand specific protocols.

Sequencing the

transcriptome

Mapping and quantitation.

Genomic context of gene

expression.

SNPs, exon-junctions, novel

genes.

Working withRNA data

The problem of limited

information content.

Known and novel

expression.

IsomiRs.

Working withmiRNA data

Page 5: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

pAATG

AUG AAA

TSS transcription start site pA polyadenylation signalprotein coding regions

AUG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS

All exonsfull length protein

Single transcript Geneone gene, one mRNA, one protein

Page 6: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

pAATG

AUG AAA

TSS transcription start site pA polyadenylation signalprotein coding regions

AUG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS

All exonsfull length protein

Alternative splicingone gene, many mRNAs, many proteins

Intron retentionnew STOP codon, truncated protein, altered function

AUG AAA

Exon skippingchanged domain content, altered function

Exon skippingnew STOP codon, truncated protein, altered function

AUG AAA

AUG AAA

Page 7: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

pAATG

AUG AAA

TSS transcription start site pA polyadenylation signalprotein coding regions

AUG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS

All exonsfull length protein

Alternative promotorsexpands coding output and gene control

Alt TSSdifferential control of gene, tissue specific or temporally specific, altered 5’ UTR content

AUG AAA

Alt TSSaltered 5’ UTR content, new ATG codon, expanded protein, altered function

AAAAUG

Page 8: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

pAATG

AUG AAA

TSS transcription start site pA polyadenylation signalprotein coding regions

AUG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS

All exonsfull length protein

Alternative 3’ exonscan change ORF and 3’UTR content

Alternative 3’ exondifferent 3’UTR content, can change the ORF

AUG AAA

AUG AAAAlternative pAdifferent 3’UTR content

Page 9: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Transcriptionalcomplexity

pA

pA pApAATG ATG

AAAAAA

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

TSS

PASR TASRmiRNA

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

tiRNA

Page 10: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Presentation Outline

What is a transcriptome?

What can we learn from

studying it?

Introduction

Genomic tools for

transcriptomics.

Deriving biological

insight from transcriptomics.

Transcriptomics

What’s old is new again.

Double stranded protocols.

Strand specific protocols.

Sequencing the

transcriptome

Mapping and quantitation.

Genomic context of gene

expression.

SNPs, exon-junctions, novel

genes.

Working withRNA data

The problem of limited

information content.

Known and novel

expression.

IsomiRs.

Working withmiRNA data

Page 11: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Microarrays

PrepareMicroarray

Scan

Sample to study

ExtractRNA

LabelRNA

Hybridize

ShortProbes

Page 12: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Wnt4

Sox9Amh

+Female Male

Wnt4

Sox9

Amh

male gene expression

fem

ale

gene

exp

ress

ionMicroarray based

profiling

13.5dpc male vs female gonad

Gene expression predates morphology

Page 13: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Microarray basedprofiling

Gene expression patternscorrelate strongly with

prognosis

Nature Reviews Genetics 1; 48-56 (2000)MOLECULAR PROFILING OF HUMAN CANCER

Page 14: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Limitationsof microarrays

Limitedsensitivity

Limited dynamic range

Cross-hybridization

Detectionlimited by

probe design

Page 15: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Using arrays to surveytranscriptional complexity

pA

pA pApAATG ATGTSS TSS TSS

TSS

AAAAAA

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

microarray exon arrays exon-junction arrays

Page 16: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Presentation Outline

What is a transcriptome?

What can we learn from

studying it?

Introduction

Genomic tools for

transcriptomics.

Deriving biological

insight from transcriptomics.

Transcriptomics

What’s old is new again.

Double stranded protocols.

Strand specific protocols.

Sequencing the

transcriptome

Mapping and quantitation.

Genomic context of gene

expression.

SNPs, exon-junctions, novel

genes.

Working withRNA data

The problem of limited

information content.

Known and novel

expression.

IsomiRs.

Working withmiRNA data

Page 17: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

pA

pA pApAATG ATGTSS TSS TSS

TSS

AAAAAA

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

3’ SAGE MPSS di-tag/mate-pair5’ SAGE

RNA sequencing

Page 18: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

pA

pA pApAATG ATGTSS TSS TSS

TSS

AAAAAA

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

Shotgun sequencing

Page 19: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

SQRL protocol

Step 1: pre-process RNA Step 2: 1st strand cDNA

Step 4: PCR amplification Step 3: template switch

AAAAAAAANNNNNN

FDV

NNNNNNFDV

NNNNNNFDV

CCC

CCC

CCC

rGGGRDV

rGGGRDV

rGGGRDV

NNNNNNFDV

CCCRDV

NNNNNNFDV

CCCRDV

NNNNNNFDV

CCCRDV

NNNNNN FDVCCCRDV

NNNNNN FDVCCCRDV

NNNNNN FDVCCCRDV

RDV FDV

RDV FDV

RDV FDV

AAAAAAAA

The SQRL protocolgenerates antisense

short-tags

Page 20: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

LEGenD protocol

Step 1: pre-process RNA Step 2: Adaptor Ligation

Step 4: PCR amplification

NN RDVFDV

AAAAAAAA

RDVFDV NN

NN RDVFDV RDVFDV NN

NN RDVFDV RDVFDV NN

Step 3: 1st Strand cDNA

NN RDVFDV RDV

NN RDVFDV RDV

NN RDVFDV RDV

FDV

FDV

FDV

NN RDVFDV RDV

NN RDVFDV RDV

NN RDVFDV RDV

FDV

FDV

FDV

The LEGenD protocolgenerates sense

short-tags

Page 21: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Most commonRNAseq protocols

Step 1: pre-process RNA Step 2: 1st and 2nd strand cDNA

Step 4: PCR amplification Step 3: Adaptor Ligation

AAAAAAAANNNNNN

NNNNNN

NNNNNN

AAAAAAAA

The RNAseq protocolgenerates unstranded

short-tags

NNNNNN

NNNNNN

NNNNNNRDVRDV

FDVFDV

RDVRDV

FDVFDV

RDVRDV

FDVFDV

NNNNNN

NNNNNN

NNNNNNRDVRDV

FDVFDV

RDVRDV

FDVFDV

RDVRDV

FDVFDV

Page 22: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Presentation Outline

What is a transcriptome?

What can we learn from

studying it?

Introduction

Genomic tools for

transcriptomics.

Deriving biological

insight from transcriptomics.

Transcriptomics

What’s old is new again.

Double stranded protocols.

Strand specific protocols.

Sequencing the

transcriptome

Mapping and quantitation.

Genomic context of gene

expression.

SNPs, exon-junctions, novel

genes.

Working withRNA data

The problem of limited

information content.

Known and novel

expression.

IsomiRs.

Working withmiRNA data

Page 23: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

pAATG

AUG AAA

TSS transcription start site pA polyadenylation signalprotein coding regions

AUG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS

Aligning tags to a reference genome

The fastest alignmentmethods are ungapped…but what about junctions?

Page 24: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Random fragmentationof RNA libraries

0

50

100

150

200

250

300

350

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Length of captured RNA

Freq

uenc

y

Short-tag length

Captured RNA Adaptor

Page 25: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Random fragmentationof RNA libraries

0

50

100

150

200

250

300

350

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Length of captured RNA

Freq

uenc

y

0

50

100

150

200

250

300

350

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Length of captured RNA

Freq

uenc

y

Short-tag length

Captured RNA Adaptor

Page 26: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Random fragmentationof RNA libraries

0

50

100

150

200

250

300

350

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Length of captured RNA

Freq

uenc

y

0

50

100

150

200

250

300

350

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Length of captured RNA

Freq

uenc

y

0

50

100

150

200

250

300

350

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86

Length of captured RNA

Freq

uenc

y

Short-tag length

Captured RNA Adaptor

Page 27: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Allowing for errorswhen mapping

Reference DNA

Amplificationerrors

Measurementerrors Polymorphisms Allelic specific

expression

RNA editing

Mappingerrors

Base changesin RNA sample

What is the minimumalignment length I should

use for my genome?

How many errors shouldI allow at the mapping

length used?

Page 28: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Unique- vs multi-mapping tags

Page 29: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Unique ≠ accurate

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

35.0.

035

.1.0

35.1.

135

.2.0

35.2.

135

.3.0

30.0.

030

.1.0

30.1.

130

.2.0

30.2.

130

.3.0

25.0.

025

.1.0

25.1.

125

.2.0

25.2.

125

.3.0

% sim % mum 5 % mum 10 % sims in known exons

Page 30: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Unique ≠ accurate

tagcgggatctctcgagagctcgcgat

tagcgggatctctcgacagctcgcgat

Chr A

Chr B

tctctcgacagct

1 MM

0 MM

tctctcgagagct0 MM

1 MM

Page 31: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Unique ≠ accurate

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

35.0.

035

.1.0

35.1.

135

.2.0

35.2.

135

.3.0

30.0.

030

.1.0

30.1.

130

.2.0

30.2.

130

.3.0

25.0.

025

.1.0

25.1.

125

.2.0

25.2.

125

.3.0

% sim % mum 5 % mum 10 % sims in known exonsIDEAL: match at thelongest possible length

Page 32: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

RNA-MATEv1.1http://www.expressiongenomics.org/RNA-MATE/

Start

1

2

3

ReadConfiguration File

tag aligned?

check quality?

rescuemultimappers?

Quality Check

Genome/JunctionAlignment

Trim Tag

Select SingleMapping Tags

MultimappingTag Rescue

End4 Create WigglePlot Files

Create JunctionBED Files

Yes

Yes

Yes

No

No

No

RNA-MATEv1.1 • perl/python coded• unix command line

(trialling web interface)• currently set up for PBS

managed cluster• GNU General Public

License v3.0• junction libraries

available

Page 33: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

RNA-MATEv1.1http://www.expressiongenomics.org/RNA-MATE/

Start

1

2

3

ReadConfiguration File

tag aligned?

check quality?

rescuemultimappers?

Quality Check

Genome/JunctionAlignment

Trim Tag

Select SingleMapping Tags

MultimappingTag Rescue

End4 Create WigglePlot Files

Create JunctionBED Files

Yes

Yes

Yes

No

No

No

RNA-MATEv1.1

[email protected]

Page 34: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Configuration File

Start

1

2

3

ReadConfiguration File

tag aligned?

check quality?

rescuemultimappers?

Quality Check

Genome/JunctionAlignment

Trim Tag

Select SingleMapping Tags

MultimappingTag Rescue

End4 Create WigglePlot Files

Create JunctionBED Files

Yes

Yes

Yes

No

No

No

RNA-MATEv1.1 tag_length=35,30num_mismatch=3mask=11111111111111111111111111111111111max_multimatch=10expect_strand=+rescue_window=10exp_name=tag_20000_F3chromosomes=chrM,chr2chr_path=/data/matching/hg18_fasta/junction=/data/libraries/hg18_junctions.fasta.catjunction_index=/data/libraries/hg18_junctions.fasta.indexoutput_root=/data/cxu/output_dir=/data/cxu/tag_20000_F3/raw_qual=/data/raw/tag20000.qualraw_csfasta=/data/raw/tag20000.csfastaquality_check=truescript_chr_start=/data/matching/chr_start.plscript_chr_wig=/data/matching/chr_wig.plf2m=/data/matching/f2m.plmapreads=/data/matching/mapreadsmaster_script=/data/matching/rna-mate-v1.0.pl

Page 35: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Quality Check(optional)

Start

1

2

3

ReadConfiguration File

tag aligned?

check quality?

rescuemultimappers?

Quality Check

Genome/JunctionAlignment

Trim Tag

Select SingleMapping Tags

MultimappingTag Rescue

End4 Create WigglePlot Files

Create JunctionBED Files

Yes

Yes

Yes

No

No

No

RNA-MATEv1.1QVBasecalls

< 5 basecalls where QV <10

Pass Fail

25mers30mers35mers

Page 36: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Genome Alignment

Start

1

2

3

ReadConfiguration File

tag aligned?

check quality?

rescuemultimappers?

Quality Check

Genome/JunctionAlignment

Trim Tag

Select SingleMapping Tags

MultimappingTag Rescue

End4 Create WigglePlot Files

Create JunctionBED Files

Yes

Yes

Yes

No

No

No

RNA-MATEv1.1 Recursive mapping strategy

Genome

Junction

Size

DiscoveryBin

MatchedData

Page 37: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Exon-junction libraries

pA pApAATG ATG

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

ATG AAA

ATG AAA

ATG AAA

ATG

ATG

ATG

AAA

AAAATG

Page 38: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Multimapping Rescue(optional)

Start

1

2

3

ReadConfiguration File

tag aligned?

check quality?

rescuemultimappers?

Quality Check

Genome/JunctionAlignment

Trim Tag

Select SingleMapping Tags

MultimappingTag Rescue

End4 Create WigglePlot Files

Create JunctionBED Files

Yes

Yes

Yes

No

No

No

RNA-MATEv1.1• Advantages:

• can add 5-20% more data• can interrogate genomic

regions previously hidden (genomic “black holes”)

• Disadvantages:• memory hungry• can slow down analysis

Page 39: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Multimapping Rescue(optional)

multi-mapping region

exons

genomic DNA positive strand expression

negative strand expression

Locus CLocus B

Locus A (predicted)

user defined window width

Page 40: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

BED and bedGraphs

Start

1

2

3

ReadConfiguration File

tag aligned?

check quality?

rescuemultimappers?

Quality Check

Genome/JunctionAlignment

Trim Tag

Select SingleMapping Tags

MultimappingTag Rescue

End4 Create WigglePlot Files

Create JunctionBED Files

Yes

Yes

Yes

No

No

No

RNA-MATEv1.1 • outputs:• strand specific bedGraphs

(wiggle plots)• strand specific start site

bedGraphs (for tag counting applications)

• “expected strand” junction BED file (for visualization)

• “unexpected strand”junction BED file (for assessing library directionality)

Page 41: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Genomic context ofexpression

Gene Symbol GRB7

Single nucleotide resolution coverage plot

Exon-exon junction usage

Known gene structure(exons and introns)

Alternative splicing

Novel exons or novel transcripts

Page 42: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Future Versions

Start

1

2

3

ReadConfiguration File

tag aligned?

check quality?

rescuemultimappers?

Quality Check

Genome/JunctionAlignment

Trim Tag

Select SingleMapping Tags

MultimappingTag Rescue

End4 Create WigglePlot Files

Create JunctionBED Files

Yes

Yes

Yes

No

No

No

RNA-MATEv1.1• Web browser interface• Integration of SNP analysis

pipeline for transcriptome• Allow the integration of

other mapping algorithms• Allow the integration of

other exon-junction identification strategies

Page 43: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Novel exon-junctiondiscovery (systematic)

pA pApAATG ATG

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

Pros:Computationally easy

Cons:Does not find all novel splicing

Page 44: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Novel exon-junctiondiscovery (de novo)

ACGATATGACACGTACAGTCAAATCGTACGATATTACACGTACATTCAAGTCGTACGATATTACACGCACAGTCAAGTCGTCGATATTACACGTCCAGTCAAGTCGTTATATTTCACGTACAGTCAAGTCGTTCGATATTAAACGTACAGTCAAGTCGTTCG

ATTGCACGTACAGTCAAGTCGTTCGGAATTACACGTACAGTCACGTCGTTCGGA

CACGTACAGTCAAGTCGTTCGGAACCTCACGTACCTTCAAGTCGTTCGGAACCT

ACGATATTACACGTACAGTCAAGTCGTTCGGAACCT consensus read

aligned reads

Non-matching tags

Create consensus read

remove adaptor sequence

Blat against genome

Pros:De novo

Cons:Requires high coverage

Page 45: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Novel exon-junctiondiscovery (TopHat)

pA pApAATG ATG

TSS transcription start site pA polyadenylation signalprotein coding regions

ATG translation start site AAA polyadenylationnon-coding regions

genomic DNA microRNAs spliced intron

TSS TSS TSS

ATG AAA

http://tophat.cbcb.umd.edu

Pros:Very sensitive

Cons:Relies on reference

Page 46: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Substitutionsand micro-indels

Dnttip3 Arid4b

Map tags togenome

Align tagsto identify SNPs

Annotate SNPs(eg. SNP is

non-synonymousin an ORF)

Rank SNPs(eg. polyphen,

Canpredict)

Validate SNPs(eg. SangerSequencing)

ACGATATTACACGTACACTCAAGTCGTTCGGAACCTACGATATTACACGTACATTCAAATCGTACGATATTACACGTACATTCAACTCGTACGATATTACACGCACATTCAAGTCGT

CGATATTACACGTACATTCAAGTCGTTATATTTCACGTACATTCAAGTCGTTCGATATTAAACGTACATTCAAGTCGTTCG

ATTACACGTACATTCAAGTCGTTCGGAATTACACGTACATTCACGTCGTTCGGA

CACGTACATTCAAGTCGTTCGGAACCT-----------------T------------------ SNP call

Aligned Reads

Reference

Page 47: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

“Diagnostic” features

AAA

protein coding regions AAA polyadenylationnon-coding regions spliced intron

AAA

AAA

AAA

A

B

C

D

Transcripts defined by Aceview (September 2007 release)

Page 48: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

“Diagnostic” features

AAA

protein coding regions AAA polyadenylationnon-coding regions spliced intron

92.6% known transcripts have diagnostic features (covers 99.8% of loci)217127 diagnostic features covering 160156 individual transcripts from 65254 loci

AAA

AAA

AAA

A

B

C

D

Accuracy relies on the qualityof the gene models used.

Different gene models will givedifferent results from the samedata.

Page 49: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Differential GeneExpression

Microarray Sequencing

http://www.bioconductor.org/packages/2.3/bioc/html/edgeR.html

Page 50: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Caution on ShotgunRNAseq analysis

Oshlack and WakefieldBiol Direct. 2009; 4: 14.

Categories of genesthat are enriched forshort sequences:

•innate immunity•cell-cell communication•signal transduction

Page 51: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Presentation Outline

What is a transcriptome?

What can we learn from

studying it?

Introduction

Genomic tools for

transcriptomics.

Deriving biological

insight from transcriptomics.

Transcriptomics

What’s old is new again.

Double stranded protocols.

Strand specific protocols.

Sequencing the

transcriptome

Mapping and quantitation.

Genomic context of gene

expression.

SNPs, exon-junctions, novel

genes.

Working withRNA data

The problem of limited

information content.

Known and novel

expression.

IsomiRs.

Working withmiRNA data

Page 52: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

nucleus cytoplasm

5’3’

RNA-Induced Silencing Complex(RISC)

5’ 3’ miRNAduplex

mRNA5’ AAAAAAAAAAAAAA 3’

5’T’

MicroRNAs can inhibittranslation of mRNAs

5’ 3’pri-miRNA 5’

3’ pre-miRNA

DroshaProcessing

DicerProcessing

AsymmetricalUnwinding

RISC-mRNAinteractionsTranslational

InhibitionmRNA

sequestrationmRNA

degradation

Page 53: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

microRNAs are small

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35Length of small RNAs in the databases

Prop

ortio

n of

sm

all R

NA

s

miRNAs piRNAs

Page 54: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Matches to the Genome

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

17mers 18mers 19mers 20mers 21mers 22mers 23mers 24mers 25mers 26mers 27mers0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

17mers 18mers 19mers 20mers 21mers 22mers 23mers 24mers 25mers 26mers 27mers

1 colourspace mismatch

Page 55: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

IsomiRs are commonN

umbe

r or i

dent

ical

read

s

Red = reads that start from a different location than Sanger reference

Blue = reads that start as the Sanger reference

Reference sequencemiRNAextension

5’….. …3’

31 miRNAs show the most abundant version starting from a different location than Sangre reference

Page 56: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Optimizing smallRNA mapping

Refining the reference set

Optimizing the mismatches

CAAAGUGCUUACAGUGCAGGUAGUUAAAGUGCUUAUAGUGCAGGUAG-AAAAGUGCUUACAGUGCAGGUAGCUAAAGUGCUGACAGUGCAGAU----AAAGUGCUGUUCGUGCAGGUAG-UAAGGUGCAUCUAGUGCAGAUA--

miR-17-5p :miR-20 :miR-106a :miR-106b :miR-93 :miR-18 :

UGUGCAAAUCUAUGCAAAACUGA-UGUGCAAAUCCAUGCAAAACUGA-UGUGCAAAUCCAUGCAAAACUGA-

miR-19a :miR-19b-1 :miR-19b-2 :

Page 57: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Optimizing smallRNA mapping

Refining the reference set

Optimizing the matching strategy

Optimizing the matching lengths

Optimizing the mismatches

0

20

40

60

80

100

120

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Length of tag when matching

Num

ber o

f mat

ches

Page 58: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Optimizing smallRNA mapping

Refining the reference set

Optimizing the matching strategy

Optimizing the matching lengths

Optimizing the mismatches

Filter spurious mappings

Page 59: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Comparisons withother platforms

r = 0.81 r = 0.80

Page 60: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Recursive or“vector stripping”

Start

1

2

ReadConfiguration File

tag aligned?

Decode Barcodes

Custom LibraryAlignment

Trim Tag

End

3 Count miRNAs

Yes

No

miRNA-MATEv1.0

4IdentifyAdaptor

5

tag aligned?

Custom LibraryAlignment

Discard TagNo

6Translate tobase-space

Yes

7SummarizeisomiR usage

8Create SequenceLogos

End

Page 61: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

miRNA-MATEv1.0(recursive output)

Page 62: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Reference sequences

• miR and miR* sequences• miRBase (http://microrna.sanger.ac.uk/)

• The “dominant” miRNA appearing in the databases is determined to be the “functional” miRNA, and the other strand is “a non-functional by product”.

(Junk RNA – sound familiar?)

• The miR and miR* sequences can change

Page 63: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Recursive or“vector stripping”

Start

1

2

ReadConfiguration File

tag aligned?

Decode Barcodes

Custom LibraryAlignment

Trim Tag

End

3 Count miRNAs

Yes

No

miRNA-MATEv1.0

4IdentifyAdaptor

5

tag aligned?

Custom LibraryAlignment

Discard TagNo

6Translate tobase-space

Yes

7SummarizeisomiR usage

8Create SequenceLogos

End

Page 64: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Adaptor Identification

T010202100202312312333020XXXXXXXXXXXXXX

“adaptor sequence”

transition base (cleaved)

SREK captured small RNA

transition base

33020XXXXXXXXXXXXXX| | | | | | | | | | | |

Tags are matched against a referenceset of miRNAs that are not ambiguous.

Page 65: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Correlation withrecursive mapping

r = 0.94

Page 66: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

miRNA-MATEv1.0(isomiR output)

Page 67: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Tissue specific isomiRism

Brain

Ovary

has-miR-181

Could be important to know about this for qRT-PCR validation

Changes in the startsite could change the“seed” region.

Page 68: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Presentation Outline

What is a transcriptome?

What can we learn from

studying it?

Introduction

Genomic tools for

transcriptomics.

Deriving biological

insight from transcriptomics.

Transcriptomics

What’s old is new again.

Double stranded protocols.

Strand specific protocols.

Sequencing the

transcriptome

Mapping and quantitation.

Genomic context of gene

expression.

SNPs, exon-junctions, novel

genes.

Working withRNA data

The problem of limited

information content.

Known and novel

expression.

IsomiRs.

Working withmiRNA data

Page 69: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Conclusions

Field is in its infancy, not all challenges have been solved. We need more mathematical and statistical input!

RNAseq is a powerful way to increase the sensitivity and usefulness of global gene expression surveys.

Be cautious with your analysis. Think and plan your analysis before you get into the lab.

Page 70: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

I want a Nature paper

=+ ≠Rubbish in, rubbish out.

Page 71: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

Medical Genomics

Page 72: Expression Genomics Laboratory - Bioinformaticsbioinformatics.org.au/ws09/presentations/Day2_NCloonan.pdfI want the best software package 1 Align tags to the genome 2 Measure gene

The End

Expression GenomicsLaboratoryhttp://grimmond.imb.uq.edu.au