Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome...

75
Next Generation Genome Annotation Gunnar R¨ atsch Friedrich Miescher Laboratory of the Max Planck Society ubingen, Germany November 10, 2009 The Wellcome Trust Genome Campus, Hinxton

Transcript of Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome...

Page 1: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Next Generation Genome Annotation

Gunnar Ratsch

Friedrich Miescher Laboratoryof the Max Planck Society

Tubingen, Germany

November 10, 2009The Wellcome Trust Genome Campus, Hinxton

Page 2: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

RGASP Overview (Tubingen) fml

Every path corresponds to one submission

GenomeMapper/QPALMA

Read alignment

mGene.ngs

mTim segmentation

Transcript finding

rQuantSplice graph building

Transcript filtering

Quantitation/Filtering

Short Reads

Submission

Alignments

Transcripts

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 1 / 17

Page 3: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Read Alignment – GenomeMapper/QPALMA fmlGenomeMapper for (unspliced) read mapping:

I Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project (Schneeberger et al., 2009a)

I k-mer based index, well suited for smaller genomes with manymismatches/gaps

QPALMA for spliced read alignments:

I GenomeMapper identifies seed regions for spliced alignments

I Alignments are performed using QPALMA (De Bona et al., 2008)

I QPALMA is individually adapted to every SR dataset

Web server available at http://galaxy.tuebingen.mpg.de.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2 / 17

Page 4: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Read Alignment – GenomeMapper/QPALMA fmlGenomeMapper for (unspliced) read mapping:

I Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project (Schneeberger et al., 2009a)

I k-mer based index, well suited for smaller genomes with manymismatches/gaps

QPALMA for spliced read alignments:

I GenomeMapper identifies seed regions for spliced alignments

I Alignments are performed using QPALMA (De Bona et al., 2008)

I QPALMA is individually adapted to every SR dataset

Web server available at http://galaxy.tuebingen.mpg.de.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2 / 17

Page 5: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Read Alignment – GenomeMapper/QPALMA fmlGenomeMapper for (unspliced) read mapping:

I Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project (Schneeberger et al., 2009a)

I k-mer based index, well suited for smaller genomes with manymismatches/gaps

QPALMA for spliced read alignments:

I GenomeMapper identifies seed regions for spliced alignments

I Alignments are performed using QPALMA (De Bona et al., 2008)

I QPALMA is individually adapted to every SR dataset

Web server available at http://galaxy.tuebingen.mpg.de.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2 / 17

Page 6: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Read Alignment – QPALMA fml

Classical scoring f : Σ× Σ→ R

Source of informationI Sequence matches

I Computational splicesite predictions

I Intron length model

I Read qualityinformation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3 / 17

Page 7: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Read Alignment – QPALMA fml

Classical scoring f : Σ× Σ→ R

Source of informationI Sequence matches

I Computational splicesite predictions

I Intron length model

I Read qualityinformation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3 / 17

Page 8: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Read Alignment – QPALMA fml

Classical scoring f : Σ× Σ→ R

Source of informationI Sequence matches

I Computational splicesite predictions

I Intron length model

I Read qualityinformation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3 / 17

Page 9: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Read Alignment – QPALMA fml

Quality scoring f : (Σ× R)× Σ→ R (De Bona et al., 2008)

Source of informationI Sequence matches

I Computational splicesite predictions

I Intron length model

I Read qualityinformation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3 / 17

Page 10: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

RNA-Seq Read Alignment – QPALMA fmlGenerate set of artificially spliced reads

I Genomic reads with quality informationI Genome annotation for artificially splicing the readsI Use 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality (De Bona et al., 2008)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 4 / 17

Page 11: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

RNA-Seq Read Alignment – QPALMA fmlGenerate set of artificially spliced reads

I Genomic reads with quality informationI Genome annotation for artificially splicing the readsI Use 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality

Error vs. intron position

(De Bona et al., 2008)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 4 / 17

Page 12: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

RNA-Seq Read Alignment – QPALMA fmlGenerate set of artificially spliced reads

I Genomic reads with quality informationI Genome annotation for artificially splicing the readsI Use 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality

8nt12nt12nt8ntError vs. intron position

(De Bona et al., 2008)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 4 / 17

Page 13: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Step 2: Transcript Prediction fml

1. Extension of mGene gene finding system to use NGS data forprotein coding transcript prediction

2. Coverage segmentation algorithm mTIM for general transcripts(no coding bias/assumption)

3. Splice graph construction by extending splice graph with splicedreads (connecting exons)

Approaches 1 & 2 use read coverages and spliced reads.Approach 3 uses existing transcripts and spliced reads.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 5 / 17

Page 14: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

mGene-based Transcript Prediction I fml

acc

don

tss

tis

stop

True gene model 2 3 4 5

STEP 1: SVM Signal Predictions

genomic position

genomic position

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6 / 17

Page 15: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

mGene-based Transcript Prediction I fml

acc

don

tss

tis

stop

True gene model 2 3 4 5

F(x,y)

transform features

STEP 1: SVM Signal Predictions

STEP 2: Integration

genomic position

genomic position

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6 / 17

Page 16: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

mGene-based Transcript Prediction I fml

acc

don

tss

tis

stop

True gene model 2 3 4 5

Wrong gene model

large margin

F(x,y)

transform features

STEP 1: SVM Signal Predictions

STEP 2: Integration

genomic position

genomic position

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6 / 17

Page 17: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

mGene-based Transcript Prediction I fml

acc

don

tss

tis

stop

True gene model 2 3 4 5

Wrong gene model

large margin

F(x,y)

transform features

STEP 1: SVM Signal Predictions

STEP 2: Integration

Tiling Array Data

genomic position

genomic position

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6 / 17

Page 18: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

mGene-based Transcript Prediction II fmlmGene with RNA-Seq (Behr et al., unpublished; Schweikert et al., 2009a,b)

I Use transcriptome measurements to enhance recognition ofexonic regions

Results for A. thaliana: (Comparison with known gene models)

transcript level (SN + SP)/2

1. mGene (ab initio) . . . 73.3%

2. . . . with tiling arrays (11 tissues) 82.1%

3. . . . with mRNA-seq (1 tissue) 81.1%

Similar observations for RGASP predictions.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7 / 17

Page 19: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

mGene-based Transcript Prediction II fmlmGene with RNA-Seq (Behr et al., unpublished; Schweikert et al., 2009a,b)

I Use transcriptome measurements to enhance recognition ofexonic regions

Results for A. thaliana: (Comparison with known gene models)

transcript level (SN + SP)/2

1. mGene (ab initio) . . . 73.3%

2. . . . with tiling arrays (11 tissues) 82.1%

3. . . . with mRNA-seq (1 tissue) 81.1%

Similar observations for RGASP predictions.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7 / 17

Page 20: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

mGene-based Transcript Prediction II fmlmGene with RNA-Seq (Behr et al., unpublished; Schweikert et al., 2009a,b)

I Use transcriptome measurements to enhance recognition ofexonic regions

Results for A. thaliana: (Comparison with known gene models)

transcript level (SN + SP)/2

1. mGene (ab initio) . . . 73.3%

2. . . . with tiling arrays (11 tissues) 82.1%

3. . . . with mRNA-seq (1 tissue) 81.1%

Similar observations for RGASP predictions.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7 / 17

Page 21: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

mGene-based Transcript Prediction II fmlmGene with RNA-Seq (Behr et al., unpublished; Schweikert et al., 2009a,b)

I Use transcriptome measurements to enhance recognition ofexonic regions

Results for A. thaliana: (Comparison with known gene models)

transcript level (SN + SP)/2

1. mGene (ab initio) . . . 73.3%

2. . . . with tiling arrays (11 tissues) 82.1%

3. . . . with mRNA-seq (1 tissue) 81.1%

Similar observations for RGASP predictions.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7 / 17

Page 22: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Tiling Array/Read Coverage Segmentationfml

Goal: Characterize each “probe” as either intergenic, exonic or intronic

0

5

10

Log-

inte

nsity

observed intensity

annotated exonic

annotated intronic

annotated intergenic

Novel segmentation method (“mSTAD”/“mTIM”)

I accounts for spliced transcripts

I provides very accurate predictions

(Zeller et al., 2008a)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8 / 17

Page 23: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Tiling Array/Read Coverage Segmentationfml

Goal: Characterize each “probe” as either intergenic, exonic or intronic

0

5

10

Log-

inte

nsity

observed intensity

annotated exonic

annotated intronic

annotated intergenic

Novel segmentation method (“mSTAD”/“mTIM”)

I accounts for spliced transcripts

I provides very accurate predictions

(Zeller et al., 2008a)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8 / 17

Page 24: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Tiling Array/Read Coverage Segmentationfml

Goal: Characterize each “probe” as either intergenic, exonic or intronic

0

5

10

Log-

inte

nsity

observed intensity

annotated exonic

annotated intronic

annotated intergenic

transcript

Novel segmentation method (“mSTAD”/“mTIM”)

I accounts for spliced transcripts

I provides very accurate predictions

(Zeller et al., 2008a)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8 / 17

Page 25: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Tiling Array/Read Coverage Segmentationfml

Goal: Characterize each “probe” as either intergenic, exonic or intronic

0

5

10

Log-

inte

nsity

observed intensity

annotated exonic

annotated intronic

annotated intergenic

transcript

Novel segmentation method (“mSTAD”/“mTIM”)

I accounts for spliced transcripts

I provides very accurate predictions (Zeller et al., 2008a)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8 / 17

Page 26: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

The mSTAD/mTIM Approachfml

intergenic exonic intronic

E IS

I Learn to associate a state with each probe given its hybridizationsignal and local context

I For mTIM: also score spliced reads and splice sites

I HM-SVM training: Optimize transformations: signal → score

G. Zeller et al., Pac. Symp. Biocomp. (2008)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17

Page 27: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

The mSTAD/mTIM Approachfml

intergenic exonic intronic

E I

0

0.2

0.4

0.6

0.8

1

5 6 7 8 9 10 11 12 13 14

hybridization signal

0

0.2

0.4

5 6 7 8 9 10 11 12 13 14

hybridization signal

0

0.5

1

feat

ure

scor

e

5 6 7 8 9 10 11 12 13 14

hybridization signal

S

I Learn to associate a state with each probe given its hybridizationsignal and local context

I For mTIM: also score spliced reads and splice sites

I HM-SVM training: Optimize transformations: signal → score

G. Zeller et al., Pac. Symp. Biocomp. (2008)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17

Page 28: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

The mSTAD/mTIM Approachfml

expression

1

2

Q

. . .

intergenic exonic intronic

S

EQ

E2

E1

IQ

I2

I1

. . .

. . .

I Learn to associate a state with each probe given its hybridizationsignal and local context

I For mTIM: also score spliced reads and splice sites

I HM-SVM training: Optimize transformations: signal → score

G. Zeller et al., Pac. Symp. Biocomp. (2008)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17

Page 29: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Digestion . . . fml

I mGene and mTIM predict single transcripts (no alternativetranscripts)

I mGene uses more assumptions on structure of transcripts

I mTIM exploits “uniformity” read coverage among exons of sametranscript

I Spliced reads used to generate a more complete splicing graph

I Paths through splicing graph define transcripts for quantitation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 10 / 17

Page 30: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Digestion . . . fml

I mGene and mTIM predict single transcripts (no alternativetranscripts)

I mGene uses more assumptions on structure of transcripts

I mTIM exploits “uniformity” read coverage among exons of sametranscript

I Spliced reads used to generate a more complete splicing graph

I Paths through splicing graph define transcripts for quantitation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 10 / 17

Page 31: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Digestion . . . fml

I mGene and mTIM predict single transcripts (no alternativetranscripts)

I mGene uses more assumptions on structure of transcripts

I mTIM exploits “uniformity” read coverage among exons of sametranscript

I Spliced reads used to generate a more complete splicing graph

I Paths through splicing graph define transcripts for quantitation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 10 / 17

Page 32: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

RNA-Seq Biases and Quantitation fml

Sequencer short sequence reads

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

fragmentation

sequence library

RT

fragmen-tation

RT

exonic reads

genome

transcripts

junction reads

aligned reads

Biases due to . . .I cDNA library construction

I Sequencing

I Read mapping

relative transcript position 5’ -> 3’

read

cov

erag

e

0

20

40

60

80

(average over annotated

transcripts of length ≈1kb for the

C. elegans SRX001872 dataset)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 11 / 17

Page 33: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

RNA-Seq Biases and Quantitation fml

Sequencer short sequence reads

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

fragmentation

sequence library

RT

fragmen-tation

RT

exonic reads

genome

transcripts

junction reads

aligned reads

Biases due to . . .I cDNA library construction

I Sequencing

I Read mapping

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

80

(average over annotated

transcripts of length ≈1kb for the

C. elegans SRX001872 dataset)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 11 / 17

Page 34: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Basic Idea fml

A

read

cov

erag

e

Short transcript

relative transcript position 5’ -> 3’

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17

Page 35: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Basic Idea fml

A

B

read

cov

erag

e

Short transcript

relative transcript position 5’ -> 3’

read

cov

erag

e

Long transcript

relative transcript position 5’ -> 3’

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17

Page 36: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Basic Idea fmlre

ad c

over

age

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

A

B

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17

Page 37: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Basic Idea fml

read

cov

erag

e

genome position 5’ -> 3’

Mixture of transcripts

read

cov

erag

e

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

MA

B

Mi = wAAi + wBBi

⇒ minwA,wB

∑i ` (Mi , Ri)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17

Page 38: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Basic Idea fml

read

cov

erag

e

genome position 5’ -> 3’

Mixture of transcripts

expected

observed

read

cov

erag

e

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

MA

B R

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17

Page 39: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)

2. Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3. Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

99,960 100,368 100,776 101,184 101,592

70

60

50

40

30

20

10

0

read

cou

nts

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17

Page 40: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)

2. Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3. Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

99,960 100,368 100,776 101,184 101,592

27.20

70

60

50

40

30

20

10

0

read

cou

nts

27.20

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17

Page 41: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)

2. Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3. Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

99,960 100,368 100,776 101,184 101,592

4.80

27.20

4.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17

Page 42: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)

2. Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3. Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17

Page 43: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)

2. Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3. Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

38.00

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17

Page 44: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2. Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)

3. Repeat 1. and 2. until convergence.

distance to transcript start/end

pro!

le w

eigh

t

0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

38.00

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17

Page 45: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw

∑i `

(∑t w (t)p

(t)i , Ri

)2. Optimise profile weights: minp

∑i `

(∑t w (t)p

(t)i , Ri

)3. Repeat 1. and 2. until convergence.

distance to transcript start/end

pro!

le w

eigh

t

0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

4.95

8.35

23.99

70

60

50

40

30

20

10

0

read

cou

nts

8.354.95

23.99

37.29

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17

Page 46: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Preliminary Evaluation I fmlCDS (precision+recall)/2

expression percentiles [%]

mTiMmGene ab initiomGene with RNA-Seq

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 14 / 17

Page 47: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Preliminary Evaluation II fmlCDS (precision+recall)/2

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

expression percentiles [%]

mGene w/o graph, w/o !ltermGene w/o graph, w/ !lter

mGene w/ graph, w/o !ltermGene w/ graph, w/ !lter

mTiM w/o graph, w/o !ltermTiM w/o graph, w/ !lter

mTiM w/ graph, w/o !ltermTiM w/ graph, w/ !lter

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 15 / 17

Page 48: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Conclusions fmlI GenomeMapper/QPalma

I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising

I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation

I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data

I Adding alternative transcripts increases recall

I rQuant-based filtering improves precision

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17

Page 49: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Conclusions fmlI GenomeMapper/QPalma

I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising

I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation

I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data

I Adding alternative transcripts increases recall

I rQuant-based filtering improves precision

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17

Page 50: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Conclusions fmlI GenomeMapper/QPalma

I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising

I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation

I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data

I Adding alternative transcripts increases recall

I rQuant-based filtering improves precision

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17

Page 51: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Conclusions fmlI GenomeMapper/QPalma

I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising

I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation

I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data

I Adding alternative transcripts increases recall

I rQuant-based filtering improves precision

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17

Page 52: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Conclusions fmlI GenomeMapper/QPalma

I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising

I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation

I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data

I Adding alternative transcripts increases recall

I rQuant-based filtering improves precision

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17

Page 53: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Acknowledgements fmlRGASP Team

I Jonas Behr (FML)

I Georg Zeller (FML & MPI)

I Regina Bohnert (FML)

Funding by DFG &Max Planck Society.

Thank you for your attention.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 17 / 17

Page 54: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

References fmlFabio De Bona, Stephan Ossowski, Korbinian Schneeberger, and Gunnar Ratsch. Optimal spliced alignments of short sequence

reads. Bioinformatics (Oxford, England), 24(16):i174–180, August 2008.

Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics, 25(8):1026–1032,April 2009.

Sam E V Linsen, Elzo de Wit, Georges Janssens, Sheila Heater, Laura Chapman, Rachael K Parkin, Brian Fritz, Stacia KWyman, Ewart de Bruijn, Emile E Voest, Scott Kuersten, Muneesh Tewari, and Edwin Cuppen. Limitations andpossibilities of small RNA digital gene expression profiling. Nature Methods, 6(7):474–476, July 2009.

M. Sammeth. The Flux Capacitor. Website, 2009a. http://flux.sammeth.net/capacitor.html.

M. Sammeth. The Flux Simulator. Website, 2009b. http://flux.sammeth.net/simulator.html.

Korbinian Schneeberger, Jorg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and DetlefWeigel. Simultaneous alignment of short reads against multiple genomes. Genome Biol, 10(9):R98, Jan 2009a. doi:10.1186/gb-2009-10-9-r98. URL http://genomebiology.com/2009/10/9/R98.

Korbinian Schneeberger, Jorg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and DetlefWeigel. Simultaneous alignment of short reads against multiple genomes. Genome Biology, 10(9):R98, 2009b.

Gabriele Schweikert, Jonas Behr, Alexander Zien, Georg Zeller, Cheng Soon Ong, Soren Sonnenburg, and Gunnar Ratsch.mGene.web: a web service for accurate computational gene finding. Nucleic Acids Research, 37(Web Server issue):W312W316, July 2009a.

Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio DeBona, Lisa Hartmann, Anja Bohlen, Nina Kruger, Soren Sonnenburg, and Gunnar Ratsch. mGene: accurate SVM-basedgene finding with an application to nematode genomes. Genome Research, September 2009b.

G. Zeller, S.R. Henz, S. Laubinger, D. Weigel, and G Ratsch. Transcript normalization and segmentation of tiling array data. InProceedings Pac. Symp. on Biocomputing, pages 527–538, 2008a.

Georg Zeller, Stefan R. Henz, Sascha Laubinger, Detlef Weigel, and Gunnar Ratsch. Transcript normalization and segmentationof tiling array data. In Proceedings Pac. Symp. on Biocomputing, pages 527–538, 2008b.

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 1

Page 55: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Method Comparisonfml

0 20 40 60 80 1000

20

40

60

80

100

mSTAD (HM-SVM)

Affymetrix TAS

Recall [%]

Prec

isio

n [%

]

Substantially improved exon probe recognitionover the most widely used “transfrag” method

www.fml.mpg.de/raetsch/research/tiling/Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2

Page 56: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Method Comparisonfml

0 20 40 60 80 1000

20

40

60

80

100

mSTAD (HMM)mSTAD (HM-SVM)

Affymetrix TAS

Recall [%]

Prec

isio

n [%

]

Substantially improved exon probe recognitionover the most widely used “transfrag” method

www.fml.mpg.de/raetsch/research/tiling/Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2

Page 57: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

sequence library

RT

fragmen-tation

fragmentation

RT

Transcript profiles for different lengths

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

801.0 - 1.1 kb

RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)

I Random priming

I Physical cDNA fragmentation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3

Page 58: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

sequence library

RT

fragmen-tation

fragmentation

RT

Transcript profiles for different lengths

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

801.0 - 1.1 kb1.1 - 1.4 kb

RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)

I Random priming

I Physical cDNA fragmentation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3

Page 59: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

sequence library

RT

fragmen-tation

fragmentation

RT

Transcript profiles for different lengths

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

801.0 - 1.1 kb1.1 - 1.4 kb1.4 - 1.8 kb

RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)

I Random priming

I Physical cDNA fragmentation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3

Page 60: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

sequence library

RT

fragmen-tation

fragmentation

RT

Transcript profiles for different lengths

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

801.0 - 1.1 kb1.1 - 1.4 kb1.4 - 1.8 kb1.8 - 2.5 kb

RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)

I Random priming

I Physical cDNA fragmentation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3

Page 61: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

sequence library

RT

fragmen-tation

fragmentation

RT

Transcript profiles for different lengths

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

801.0 - 1.1 kb1.1 - 1.4 kb1.4 - 1.8 kb1.8 - 2.5 kb2.5 - 70.0 kb

RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)

I Random priming

I Physical cDNA fragmentation

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3

Page 62: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

sequence library

RT

fragmen-tation

fragmentation

RT

Transcript profiles for different lengths

relative transcript position 5’ -> 3’

read

cov

erag

e

0

0.5

1

1.5

1.0 - 1.6 kb1.6 - 1.9 kb1.9 - 2.5 kb2.5 - 17.0 kb

RNA-Seq data (A. thaliana, D. Weigel’s Lab, MPI Tubingen)

I Chemical RNA fragmentation

I Random priming

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3

Page 63: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Sequence Bias fml

sequence library

Sequencer short sequence reads

Mea

n lo

g re

ad c

over

age Percent exonic positions

C+G in %0 10 40 60 80

0123456789

10 20

0

18

8

16

246

101214

Figure provided by Georg Zeller

RNA-Seq data (A. thaliana, D. Weigel’s Lab, MPI Tubingen)

I Exonic GC content

I Dinucleotides at the boundaries (Linsen et al., 2009)

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 4

Page 64: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement

Read Mapping Bias fml

sequence library

Sequencer short sequence reads

position relative to intron

read

cov

erag

e200!200 !100 0 100

0.4

0.6

0.8

1

1.2

RNA-Seq data (A. thaliana, D. Weigel’s Lab, MPI Tubingen)

I Exon boundaries

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 5

Page 65: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)

compared to

I Position-wise, without profiles

I Segment-wise, without profiles (e.g. Jiang and Wong (2009))

I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))

Estimate transcript abundances

I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))

I Subset of alternatively spliced genes

Evaluation: Spearman correlation between

I Simulated RNA expression level and

I Predicted transcript weights

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6

Page 66: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)

compared to

I Position-wise, without profiles

I Segment-wise, without profiles (e.g. Jiang and Wong (2009))

I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))

Estimate transcript abundances

I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))

I Subset of alternatively spliced genes

Evaluation: Spearman correlation between

I Simulated RNA expression level and

I Predicted transcript weights

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6

Page 67: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)

compared to

I Position-wise, without profiles

I Segment-wise, without profiles (e.g. Jiang and Wong (2009))

I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))

Estimate transcript abundances

I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))

I Subset of alternatively spliced genes

Evaluation: Spearman correlation between

I Simulated RNA expression level and

I Predicted transcript weights

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6

Page 68: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)

compared to

I Position-wise, without profiles

I Segment-wise, without profiles (e.g. Jiang and Wong (2009))

I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))

Estimate transcript abundances

I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))

I Subset of alternatively spliced genes

Evaluation: Spearman correlation between

I Simulated RNA expression level and

I Predicted transcript weights

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6

Page 69: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)

compared to

I Position-wise, without profiles

I Segment-wise, without profiles (e.g. Jiang and Wong (2009))

I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))

Estimate transcript abundances

I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))

I Subset of alternatively spliced genes

Evaluation: Spearman correlation between

I Simulated RNA expression level and

I Predicted transcript weights

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6

Page 70: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)

compared to

I Position-wise, without profiles

I Segment-wise, without profiles (e.g. Jiang and Wong (2009))

I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))

Estimate transcript abundances

I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))

I Subset of alternatively spliced genes

Evaluation: Spearman correlation between

I Simulated RNA expression level and

I Predicted transcript weights

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6

Page 71: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)

compared to

I Position-wise, without profiles

I Segment-wise, without profiles (e.g. Jiang and Wong (2009))

I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))

Estimate transcript abundances

I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))

I Subset of alternatively spliced genes

Evaluation: Spearman correlation between

I Simulated RNA expression level and

I Predicted transcript weights

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6

Page 72: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation II fml

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7

Page 73: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation II fml

Position-wise, w/o pro!les

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

Segment-wise, w/ pro!les

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7

Page 74: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Evaluation II fml

rQuantPosition-wise,

w/ pro!lesPosition-wise, w/o pro!les

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

Segment-wise, w/ pro!les

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7

Page 75: Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17. The mSTAD/mTIM Approach fml expression 1 2 Q.

Supplement Results

Preliminary Evaluation I fmlCDS (precision+recall)/2

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

expression percentiles [%]

SRX004863 w/o graphSRX004863 w graph

SRX004865 w/o graphSRX004865 w/ graph

SRX004867 w/o graphSRX004867 w/ graph

SRX001872 w/o graphSRX001872 w/ graph

SRX001873 w/o graphSRX001873 w/ graph

SRX001874 w/o graphSRX001874 w/ graph

SRX001875 w/o graphSRX001875 w/o graph

mTiM mGene

Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8