Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data

29
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu

description

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data. Alex Zelikovsky Department of Computer Science Georgia State University. Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza , Ion Mandoiu. Advances in Next Generation Sequencing. - PowerPoint PPT Presentation

Transcript of Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data

Page 1: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data

Alex Zelikovsky Department of Computer Science

Georgia State University

Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu

Page 2: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

2

Advances in Next Generation Sequencing

http://www.economist.com/node/16349358

Roche/454 FLX Titanium400-600 million reads/run

400bp avg. length

Illumina HiSeq 2000Up to 6 billion PE reads/run

35-100bp read length

SOLiD 4/55001.4-2.4 billion PE reads/run

35-50bp read length

Ion Proton Sequencer

Page 3: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

3

RNA-SeqRNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression

A B C

A C

D E

Transcriptome Reconstruction Isoform Expression

Page 4: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

4

Transcriptome Assembly

• Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.

Page 5: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

5

Transcriptome Assembly Types

• Genome-independent reconstruction (de novo)– de Brujin k-mer graph

• Genome-guided reconstruction (ab initio)– Spliced read mapping – Exon identification– Splice graph

• Annotation-guided reconstruction– Use existing annotation (known transcripts) – Focus on discovering novel transcripts

Page 6: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

6

Previous approaches

• Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008)

• Genome-guided reconstruction – Scripture(2010)

• Reports “all” transcripts– Cufflinks(2010), IsoLasso(2011), SLIDE(2012),

CLIIQ(2012), TRIP(2012), Traph (2013)• Minimizes set of transcripts explaining reads

• Annotation-guided reconstruction– RABT(2011), DRUT(2011)

Page 7: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

7

Gene representation

• Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events

• Gene - set of non-overlapping pseudo-exons

e1 e3 e5

e2 e4 e6

Spse1Epse1

Spse2

Epse2Spse3

Epse3

Spse4

Epse4

Spse5

Epse5 Spse6

Epse6

Spse7Epse7

Pseudo-exons:

e1 e5

pse1 pse2 pse3 pse4 pse5 pse6 pse7

Tr1:

Tr2:

Tr3:

Page 8: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

8

Splice GraphGenome

1 42 3 5 6 7 8 9

TSSpseudo-exons

TES

Page 9: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

• Map the RNA-Seq reads to genome

• Construct Splice Graph - G(V,E)– V : exons– E: splicing events

• Candidate transcripts– depth-first-search (DFS)

• Select candidate transcripts– IsoEM– greedy algorithm

9

Genome

MaLTA Maximum Likelihood Transcriptome Assembly

Page 10: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

10

How to select?

• Select the smallest set of candidate transcripts • covering all transcript variants

Transcript : set of transcript variants

Sharmistha Pal, Ravi Gupta, Hyunsoo Kim, et al., Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development, Genome Res. 2011 21: 1260-1272

alternative first exon alternative last exon exon skipping intron retention

alternative 5' splice junction alternative 5' splice junction splice junction

Page 11: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

IsoEM: Isoform Expression Level Estimation

• Expectation-Maximization algorithm• Unified probabilistic model incorporating

– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores– Repeat and hexamer bias correction

Page 12: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

Read-isoform compatibility graphirw ,

a

aaair FQOw ,

Page 13: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

Fragment length distribution

A B C

A C

A B C

A C

A B C

A C

i

j

Series1

Fa(i)

Series1

Fa (j)

Page 14: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

14

Greedy algorithm

1. Sort transcripts by inferred IsoEM expression levels in decreasing order

2. Traverse transcripts – Select transcripts if it contains novel transcript

variant– Continue traversing until all transcript variant

are covered

Page 15: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

15

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 16: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

16

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 17: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

17

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 18: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

18

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 19: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

19

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 20: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

20

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 21: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

21

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 22: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

22

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 23: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

23

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 24: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

24

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

Page 25: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

25

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

STOP. All transcript variant are covered.

Page 26: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

26

MaLTA results on GOG-350 dataset

• 4.5M single Ion reads with average read length 121 bp, aligned using TopHat2• Number of assembled transcripts

– MaLTA : 15385 – Cufflinks : 17378

• Number of transcripts matching annotations– MaLTA : 4555(26%) – Cufflinks : 2031(13%)

Page 27: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

Expression Estimation on Ion Torrent reads

IsoEM HBR Cufflinks HBR IsoEM UHR Cufflinks UHR0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

R2 fo

r Iso

EM/C

ufflin

ks E

stim

ates

vs q

PCR

• Squared correlation– IsoEM / Cufflinks FPKMs vs qPCR values for 800 genes – 2 MAQC samples : Human Brain and Universal

Page 28: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

28

Conclusions

• Novel method for transcriptome assembly • Validated on Ion Torrent RNA-Seq Data• Comparing with Cufflinks:

– similar number of assembled transcripts– 2x more previously annotated transcripts

• Transcript quantification is useful for transcript assembly better quantification?

Page 29: Transcriptome Assembly and Quantification  from Ion  Torrent RNA-Seq Data

29