Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome...
Transcript of Next Generation Genome Annotation · Gunnar R¨atsch (FML, Tubingen)¨ Next Generation Genome...
Next Generation Genome Annotation
Gunnar Ratsch
Friedrich Miescher Laboratoryof the Max Planck Society
Tubingen, Germany
November 10, 2009The Wellcome Trust Genome Campus, Hinxton
RGASP Overview (Tubingen) fml
Every path corresponds to one submission
GenomeMapper/QPALMA
Read alignment
mGene.ngs
mTim segmentation
Transcript finding
rQuantSplice graph building
Transcript filtering
Quantitation/Filtering
Short Reads
Submission
Alignments
Transcripts
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 1 / 17
Read Alignment – GenomeMapper/QPALMA fmlGenomeMapper for (unspliced) read mapping:
I Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project (Schneeberger et al., 2009a)
I k-mer based index, well suited for smaller genomes with manymismatches/gaps
QPALMA for spliced read alignments:
I GenomeMapper identifies seed regions for spliced alignments
I Alignments are performed using QPALMA (De Bona et al., 2008)
I QPALMA is individually adapted to every SR dataset
Web server available at http://galaxy.tuebingen.mpg.de.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2 / 17
Read Alignment – GenomeMapper/QPALMA fmlGenomeMapper for (unspliced) read mapping:
I Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project (Schneeberger et al., 2009a)
I k-mer based index, well suited for smaller genomes with manymismatches/gaps
QPALMA for spliced read alignments:
I GenomeMapper identifies seed regions for spliced alignments
I Alignments are performed using QPALMA (De Bona et al., 2008)
I QPALMA is individually adapted to every SR dataset
Web server available at http://galaxy.tuebingen.mpg.de.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2 / 17
Read Alignment – GenomeMapper/QPALMA fmlGenomeMapper for (unspliced) read mapping:
I Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project (Schneeberger et al., 2009a)
I k-mer based index, well suited for smaller genomes with manymismatches/gaps
QPALMA for spliced read alignments:
I GenomeMapper identifies seed regions for spliced alignments
I Alignments are performed using QPALMA (De Bona et al., 2008)
I QPALMA is individually adapted to every SR dataset
Web server available at http://galaxy.tuebingen.mpg.de.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2 / 17
Read Alignment – QPALMA fml
Classical scoring f : Σ× Σ→ R
Source of informationI Sequence matches
I Computational splicesite predictions
I Intron length model
I Read qualityinformation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3 / 17
Read Alignment – QPALMA fml
Classical scoring f : Σ× Σ→ R
Source of informationI Sequence matches
I Computational splicesite predictions
I Intron length model
I Read qualityinformation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3 / 17
Read Alignment – QPALMA fml
Classical scoring f : Σ× Σ→ R
Source of informationI Sequence matches
I Computational splicesite predictions
I Intron length model
I Read qualityinformation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3 / 17
Read Alignment – QPALMA fml
Quality scoring f : (Σ× R)× Σ→ R (De Bona et al., 2008)
Source of informationI Sequence matches
I Computational splicesite predictions
I Intron length model
I Read qualityinformation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3 / 17
RNA-Seq Read Alignment – QPALMA fmlGenerate set of artificially spliced reads
I Genomic reads with quality informationI Genome annotation for artificially splicing the readsI Use 10, 000 reads for training and 30, 000 for testing
Alig
nmen
t Er
ror
Rat
e
SmithW14.19%
Intron9.96% 1.94% 1.78%
Intron+Splice
Intron+Splice+Quality (De Bona et al., 2008)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 4 / 17
RNA-Seq Read Alignment – QPALMA fmlGenerate set of artificially spliced reads
I Genomic reads with quality informationI Genome annotation for artificially splicing the readsI Use 10, 000 reads for training and 30, 000 for testing
Alig
nmen
t Er
ror
Rat
e
SmithW14.19%
Intron9.96% 1.94% 1.78%
Intron+Splice
Intron+Splice+Quality
Error vs. intron position
(De Bona et al., 2008)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 4 / 17
RNA-Seq Read Alignment – QPALMA fmlGenerate set of artificially spliced reads
I Genomic reads with quality informationI Genome annotation for artificially splicing the readsI Use 10, 000 reads for training and 30, 000 for testing
Alig
nmen
t Er
ror
Rat
e
SmithW14.19%
Intron9.96% 1.94% 1.78%
Intron+Splice
Intron+Splice+Quality
8nt12nt12nt8ntError vs. intron position
(De Bona et al., 2008)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 4 / 17
Step 2: Transcript Prediction fml
1. Extension of mGene gene finding system to use NGS data forprotein coding transcript prediction
2. Coverage segmentation algorithm mTIM for general transcripts(no coding bias/assumption)
3. Splice graph construction by extending splice graph with splicedreads (connecting exons)
Approaches 1 & 2 use read coverages and spliced reads.Approach 3 uses existing transcripts and spliced reads.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 5 / 17
mGene-based Transcript Prediction I fml
acc
don
tss
tis
stop
True gene model 2 3 4 5
STEP 1: SVM Signal Predictions
genomic position
genomic position
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6 / 17
mGene-based Transcript Prediction I fml
acc
don
tss
tis
stop
True gene model 2 3 4 5
F(x,y)
transform features
STEP 1: SVM Signal Predictions
STEP 2: Integration
genomic position
genomic position
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6 / 17
mGene-based Transcript Prediction I fml
acc
don
tss
tis
stop
True gene model 2 3 4 5
Wrong gene model
large margin
F(x,y)
transform features
STEP 1: SVM Signal Predictions
STEP 2: Integration
genomic position
genomic position
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6 / 17
mGene-based Transcript Prediction I fml
acc
don
tss
tis
stop
True gene model 2 3 4 5
Wrong gene model
large margin
F(x,y)
transform features
STEP 1: SVM Signal Predictions
STEP 2: Integration
Tiling Array Data
genomic position
genomic position
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6 / 17
mGene-based Transcript Prediction II fmlmGene with RNA-Seq (Behr et al., unpublished; Schweikert et al., 2009a,b)
I Use transcriptome measurements to enhance recognition ofexonic regions
Results for A. thaliana: (Comparison with known gene models)
transcript level (SN + SP)/2
1. mGene (ab initio) . . . 73.3%
2. . . . with tiling arrays (11 tissues) 82.1%
3. . . . with mRNA-seq (1 tissue) 81.1%
Similar observations for RGASP predictions.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7 / 17
mGene-based Transcript Prediction II fmlmGene with RNA-Seq (Behr et al., unpublished; Schweikert et al., 2009a,b)
I Use transcriptome measurements to enhance recognition ofexonic regions
Results for A. thaliana: (Comparison with known gene models)
transcript level (SN + SP)/2
1. mGene (ab initio) . . . 73.3%
2. . . . with tiling arrays (11 tissues) 82.1%
3. . . . with mRNA-seq (1 tissue) 81.1%
Similar observations for RGASP predictions.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7 / 17
mGene-based Transcript Prediction II fmlmGene with RNA-Seq (Behr et al., unpublished; Schweikert et al., 2009a,b)
I Use transcriptome measurements to enhance recognition ofexonic regions
Results for A. thaliana: (Comparison with known gene models)
transcript level (SN + SP)/2
1. mGene (ab initio) . . . 73.3%
2. . . . with tiling arrays (11 tissues) 82.1%
3. . . . with mRNA-seq (1 tissue) 81.1%
Similar observations for RGASP predictions.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7 / 17
mGene-based Transcript Prediction II fmlmGene with RNA-Seq (Behr et al., unpublished; Schweikert et al., 2009a,b)
I Use transcriptome measurements to enhance recognition ofexonic regions
Results for A. thaliana: (Comparison with known gene models)
transcript level (SN + SP)/2
1. mGene (ab initio) . . . 73.3%
2. . . . with tiling arrays (11 tissues) 82.1%
3. . . . with mRNA-seq (1 tissue) 81.1%
Similar observations for RGASP predictions.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7 / 17
Tiling Array/Read Coverage Segmentationfml
Goal: Characterize each “probe” as either intergenic, exonic or intronic
0
5
10
Log-
inte
nsity
observed intensity
annotated exonic
annotated intronic
annotated intergenic
Novel segmentation method (“mSTAD”/“mTIM”)
I accounts for spliced transcripts
I provides very accurate predictions
(Zeller et al., 2008a)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8 / 17
Tiling Array/Read Coverage Segmentationfml
Goal: Characterize each “probe” as either intergenic, exonic or intronic
0
5
10
Log-
inte
nsity
observed intensity
annotated exonic
annotated intronic
annotated intergenic
Novel segmentation method (“mSTAD”/“mTIM”)
I accounts for spliced transcripts
I provides very accurate predictions
(Zeller et al., 2008a)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8 / 17
Tiling Array/Read Coverage Segmentationfml
Goal: Characterize each “probe” as either intergenic, exonic or intronic
0
5
10
Log-
inte
nsity
observed intensity
annotated exonic
annotated intronic
annotated intergenic
transcript
Novel segmentation method (“mSTAD”/“mTIM”)
I accounts for spliced transcripts
I provides very accurate predictions
(Zeller et al., 2008a)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8 / 17
Tiling Array/Read Coverage Segmentationfml
Goal: Characterize each “probe” as either intergenic, exonic or intronic
0
5
10
Log-
inte
nsity
observed intensity
annotated exonic
annotated intronic
annotated intergenic
transcript
Novel segmentation method (“mSTAD”/“mTIM”)
I accounts for spliced transcripts
I provides very accurate predictions (Zeller et al., 2008a)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8 / 17
The mSTAD/mTIM Approachfml
intergenic exonic intronic
E IS
I Learn to associate a state with each probe given its hybridizationsignal and local context
I For mTIM: also score spliced reads and splice sites
I HM-SVM training: Optimize transformations: signal → score
G. Zeller et al., Pac. Symp. Biocomp. (2008)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17
The mSTAD/mTIM Approachfml
intergenic exonic intronic
E I
0
0.2
0.4
0.6
0.8
1
5 6 7 8 9 10 11 12 13 14
hybridization signal
0
0.2
0.4
5 6 7 8 9 10 11 12 13 14
hybridization signal
0
0.5
1
feat
ure
scor
e
5 6 7 8 9 10 11 12 13 14
hybridization signal
S
I Learn to associate a state with each probe given its hybridizationsignal and local context
I For mTIM: also score spliced reads and splice sites
I HM-SVM training: Optimize transformations: signal → score
G. Zeller et al., Pac. Symp. Biocomp. (2008)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17
The mSTAD/mTIM Approachfml
expression
1
2
Q
. . .
intergenic exonic intronic
S
EQ
E2
E1
IQ
I2
I1
. . .
. . .
I Learn to associate a state with each probe given its hybridizationsignal and local context
I For mTIM: also score spliced reads and splice sites
I HM-SVM training: Optimize transformations: signal → score
G. Zeller et al., Pac. Symp. Biocomp. (2008)Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 9 / 17
Digestion . . . fml
I mGene and mTIM predict single transcripts (no alternativetranscripts)
I mGene uses more assumptions on structure of transcripts
I mTIM exploits “uniformity” read coverage among exons of sametranscript
I Spliced reads used to generate a more complete splicing graph
I Paths through splicing graph define transcripts for quantitation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 10 / 17
Digestion . . . fml
I mGene and mTIM predict single transcripts (no alternativetranscripts)
I mGene uses more assumptions on structure of transcripts
I mTIM exploits “uniformity” read coverage among exons of sametranscript
I Spliced reads used to generate a more complete splicing graph
I Paths through splicing graph define transcripts for quantitation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 10 / 17
Digestion . . . fml
I mGene and mTIM predict single transcripts (no alternativetranscripts)
I mGene uses more assumptions on structure of transcripts
I mTIM exploits “uniformity” read coverage among exons of sametranscript
I Spliced reads used to generate a more complete splicing graph
I Paths through splicing graph define transcripts for quantitation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 10 / 17
RNA-Seq Biases and Quantitation fml
Sequencer short sequence reads
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
fragmentation
sequence library
RT
fragmen-tation
RT
exonic reads
genome
transcripts
junction reads
aligned reads
Biases due to . . .I cDNA library construction
I Sequencing
I Read mapping
relative transcript position 5’ -> 3’
read
cov
erag
e
0
20
40
60
80
(average over annotated
transcripts of length ≈1kb for the
C. elegans SRX001872 dataset)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 11 / 17
RNA-Seq Biases and Quantitation fml
Sequencer short sequence reads
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
fragmentation
sequence library
RT
fragmen-tation
RT
exonic reads
genome
transcripts
junction reads
aligned reads
Biases due to . . .I cDNA library construction
I Sequencing
I Read mapping
relative transcript position 5’ -> 3’
read
cov
erag
e 0
20
40
60
80
(average over annotated
transcripts of length ≈1kb for the
C. elegans SRX001872 dataset)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 11 / 17
rQuant – Basic Idea fml
A
read
cov
erag
e
Short transcript
relative transcript position 5’ -> 3’
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17
rQuant – Basic Idea fml
A
B
read
cov
erag
e
Short transcript
relative transcript position 5’ -> 3’
read
cov
erag
e
Long transcript
relative transcript position 5’ -> 3’
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17
rQuant – Basic Idea fmlre
ad c
over
age
genome position 5’ -> 3’
Short transcript
read
cov
erag
e
genome position 5’ -> 3’
Long transcript
A
B
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17
rQuant – Basic Idea fml
read
cov
erag
e
genome position 5’ -> 3’
Mixture of transcripts
read
cov
erag
e
genome position 5’ -> 3’
Short transcript
read
cov
erag
e
genome position 5’ -> 3’
Long transcript
MA
B
Mi = wAAi + wBBi
⇒ minwA,wB
∑i ` (Mi , Ri)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17
rQuant – Basic Idea fml
read
cov
erag
e
genome position 5’ -> 3’
Mixture of transcripts
expected
observed
read
cov
erag
e
genome position 5’ -> 3’
Short transcript
read
cov
erag
e
genome position 5’ -> 3’
Long transcript
MA
B R
Mi = wAAi + wBBi ⇒ minwA,wB
∑i ` (Mi , Ri)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 12 / 17
rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw
∑i `
(∑t w (t)p
(t)i , Ri
)
2. Optimise profile weights: minp
∑i `
(∑t w (t)p
(t)i , Ri
)3. Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
99,960 100,368 100,776 101,184 101,592
70
60
50
40
30
20
10
0
read
cou
nts
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17
rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw
∑i `
(∑t w (t)p
(t)i , Ri
)
2. Optimise profile weights: minp
∑i `
(∑t w (t)p
(t)i , Ri
)3. Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
99,960 100,368 100,776 101,184 101,592
27.20
70
60
50
40
30
20
10
0
read
cou
nts
27.20
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17
rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw
∑i `
(∑t w (t)p
(t)i , Ri
)
2. Optimise profile weights: minp
∑i `
(∑t w (t)p
(t)i , Ri
)3. Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
99,960 100,368 100,776 101,184 101,592
4.80
27.20
4.80
70
60
50
40
30
20
10
0
read
cou
nts
27.20
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17
rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw
∑i `
(∑t w (t)p
(t)i , Ri
)
2. Optimise profile weights: minp
∑i `
(∑t w (t)p
(t)i , Ri
)3. Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
AT1G01240.3 +1,941
99,960 100,368 100,776 101,184 101,592
6.00
4.80
27.20
6.004.80
70
60
50
40
30
20
10
0
read
cou
nts
27.20
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17
rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw
∑i `
(∑t w (t)p
(t)i , Ri
)
2. Optimise profile weights: minp
∑i `
(∑t w (t)p
(t)i , Ri
)3. Repeat 1. and 2. until convergence.
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
AT1G01240.3 +1,941
99,960 100,368 100,776 101,184 101,592
6.00
4.80
27.20
6.004.80
70
60
50
40
30
20
10
0
read
cou
nts
27.20
38.00
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17
rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw
∑i `
(∑t w (t)p
(t)i , Ri
)2. Optimise profile weights: minp
∑i `
(∑t w (t)p
(t)i , Ri
)
3. Repeat 1. and 2. until convergence.
distance to transcript start/end
pro!
le w
eigh
t
0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
AT1G01240.3 +1,941
99,960 100,368 100,776 101,184 101,592
6.00
4.80
27.20
6.004.80
70
60
50
40
30
20
10
0
read
cou
nts
27.20
38.00
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17
rQuant – Iterative Algorithm fml1. Optimise transcript weights: minw
∑i `
(∑t w (t)p
(t)i , Ri
)2. Optimise profile weights: minp
∑i `
(∑t w (t)p
(t)i , Ri
)3. Repeat 1. and 2. until convergence.
distance to transcript start/end
pro!
le w
eigh
t
0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
gene AT1G01240chromosome 1, forward strand
AT1G01240.1 +110 1,178
625
AT1G01240.2 +190 1,178
436
AT1G01240.3 +1,941
99,960 100,368 100,776 101,184 101,592
4.95
8.35
23.99
70
60
50
40
30
20
10
0
read
cou
nts
8.354.95
23.99
37.29
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 13 / 17
Preliminary Evaluation I fmlCDS (precision+recall)/2
expression percentiles [%]
mTiMmGene ab initiomGene with RNA-Seq
10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 14 / 17
Preliminary Evaluation II fmlCDS (precision+recall)/2
10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
expression percentiles [%]
mGene w/o graph, w/o !ltermGene w/o graph, w/ !lter
mGene w/ graph, w/o !ltermGene w/ graph, w/ !lter
mTiM w/o graph, w/o !ltermTiM w/o graph, w/ !lter
mTiM w/ graph, w/o !ltermTiM w/ graph, w/ !lter
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 15 / 17
Conclusions fmlI GenomeMapper/QPalma
I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising
I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation
I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data
I Adding alternative transcripts increases recall
I rQuant-based filtering improves precision
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17
Conclusions fmlI GenomeMapper/QPalma
I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising
I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation
I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data
I Adding alternative transcripts increases recall
I rQuant-based filtering improves precision
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17
Conclusions fmlI GenomeMapper/QPalma
I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising
I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation
I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data
I Adding alternative transcripts increases recall
I rQuant-based filtering improves precision
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17
Conclusions fmlI GenomeMapper/QPalma
I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising
I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation
I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data
I Adding alternative transcripts increases recall
I rQuant-based filtering improves precision
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17
Conclusions fmlI GenomeMapper/QPalma
I Splice site predictions improve alignment performanceI Integrating QPALMA scoring into other read mappers promising
I mGeneI Higher recallI Identifies also non-expressed genes ⇒ good for annotation
I mTIMI Higher precisionI Better for identifying transcripts specific to experimental data
I Adding alternative transcripts increases recall
I rQuant-based filtering improves precision
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 16 / 17
Acknowledgements fmlRGASP Team
I Jonas Behr (FML)
I Georg Zeller (FML & MPI)
I Regina Bohnert (FML)
Funding by DFG &Max Planck Society.
Thank you for your attention.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 17 / 17
References fmlFabio De Bona, Stephan Ossowski, Korbinian Schneeberger, and Gunnar Ratsch. Optimal spliced alignments of short sequence
reads. Bioinformatics (Oxford, England), 24(16):i174–180, August 2008.
Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics, 25(8):1026–1032,April 2009.
Sam E V Linsen, Elzo de Wit, Georges Janssens, Sheila Heater, Laura Chapman, Rachael K Parkin, Brian Fritz, Stacia KWyman, Ewart de Bruijn, Emile E Voest, Scott Kuersten, Muneesh Tewari, and Edwin Cuppen. Limitations andpossibilities of small RNA digital gene expression profiling. Nature Methods, 6(7):474–476, July 2009.
M. Sammeth. The Flux Capacitor. Website, 2009a. http://flux.sammeth.net/capacitor.html.
M. Sammeth. The Flux Simulator. Website, 2009b. http://flux.sammeth.net/simulator.html.
Korbinian Schneeberger, Jorg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and DetlefWeigel. Simultaneous alignment of short reads against multiple genomes. Genome Biol, 10(9):R98, Jan 2009a. doi:10.1186/gb-2009-10-9-r98. URL http://genomebiology.com/2009/10/9/R98.
Korbinian Schneeberger, Jorg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and DetlefWeigel. Simultaneous alignment of short reads against multiple genomes. Genome Biology, 10(9):R98, 2009b.
Gabriele Schweikert, Jonas Behr, Alexander Zien, Georg Zeller, Cheng Soon Ong, Soren Sonnenburg, and Gunnar Ratsch.mGene.web: a web service for accurate computational gene finding. Nucleic Acids Research, 37(Web Server issue):W312W316, July 2009a.
Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio DeBona, Lisa Hartmann, Anja Bohlen, Nina Kruger, Soren Sonnenburg, and Gunnar Ratsch. mGene: accurate SVM-basedgene finding with an application to nematode genomes. Genome Research, September 2009b.
G. Zeller, S.R. Henz, S. Laubinger, D. Weigel, and G Ratsch. Transcript normalization and segmentation of tiling array data. InProceedings Pac. Symp. on Biocomputing, pages 527–538, 2008a.
Georg Zeller, Stefan R. Henz, Sascha Laubinger, Detlef Weigel, and Gunnar Ratsch. Transcript normalization and segmentationof tiling array data. In Proceedings Pac. Symp. on Biocomputing, pages 527–538, 2008b.
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 1
Supplement
Method Comparisonfml
0 20 40 60 80 1000
20
40
60
80
100
mSTAD (HM-SVM)
Affymetrix TAS
Recall [%]
Prec
isio
n [%
]
Substantially improved exon probe recognitionover the most widely used “transfrag” method
www.fml.mpg.de/raetsch/research/tiling/Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2
Supplement
Method Comparisonfml
0 20 40 60 80 1000
20
40
60
80
100
mSTAD (HMM)mSTAD (HM-SVM)
Affymetrix TAS
Recall [%]
Prec
isio
n [%
]
Substantially improved exon probe recognitionover the most widely used “transfrag” method
www.fml.mpg.de/raetsch/research/tiling/Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 2
Supplement
Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
sequence library
RT
fragmen-tation
fragmentation
RT
Transcript profiles for different lengths
relative transcript position 5’ -> 3’
read
cov
erag
e 0
20
40
60
801.0 - 1.1 kb
RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)
I Random priming
I Physical cDNA fragmentation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3
Supplement
Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
sequence library
RT
fragmen-tation
fragmentation
RT
Transcript profiles for different lengths
relative transcript position 5’ -> 3’
read
cov
erag
e 0
20
40
60
801.0 - 1.1 kb1.1 - 1.4 kb
RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)
I Random priming
I Physical cDNA fragmentation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3
Supplement
Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
sequence library
RT
fragmen-tation
fragmentation
RT
Transcript profiles for different lengths
relative transcript position 5’ -> 3’
read
cov
erag
e 0
20
40
60
801.0 - 1.1 kb1.1 - 1.4 kb1.4 - 1.8 kb
RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)
I Random priming
I Physical cDNA fragmentation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3
Supplement
Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
sequence library
RT
fragmen-tation
fragmentation
RT
Transcript profiles for different lengths
relative transcript position 5’ -> 3’
read
cov
erag
e 0
20
40
60
801.0 - 1.1 kb1.1 - 1.4 kb1.4 - 1.8 kb1.8 - 2.5 kb
RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)
I Random priming
I Physical cDNA fragmentation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3
Supplement
Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
sequence library
RT
fragmen-tation
fragmentation
RT
Transcript profiles for different lengths
relative transcript position 5’ -> 3’
read
cov
erag
e 0
20
40
60
801.0 - 1.1 kb1.1 - 1.4 kb1.4 - 1.8 kb1.8 - 2.5 kb2.5 - 70.0 kb
RNA-Seq data (C. elegans SRX001872, R. Waterston Lab, University of Washington)
I Random priming
I Physical cDNA fragmentation
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3
Supplement
Priming and Fragmentation Biases fmlProfile: normalised positional read coverage along the transcript
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
AAAAAAA
mRNA
sequence library
RT
fragmen-tation
fragmentation
RT
Transcript profiles for different lengths
relative transcript position 5’ -> 3’
read
cov
erag
e
0
0.5
1
1.5
1.0 - 1.6 kb1.6 - 1.9 kb1.9 - 2.5 kb2.5 - 17.0 kb
RNA-Seq data (A. thaliana, D. Weigel’s Lab, MPI Tubingen)
I Chemical RNA fragmentation
I Random priming
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 3
Supplement
Sequence Bias fml
sequence library
Sequencer short sequence reads
Mea
n lo
g re
ad c
over
age Percent exonic positions
C+G in %0 10 40 60 80
0123456789
10 20
0
18
8
16
246
101214
Figure provided by Georg Zeller
RNA-Seq data (A. thaliana, D. Weigel’s Lab, MPI Tubingen)
I Exonic GC content
I Dinucleotides at the boundaries (Linsen et al., 2009)
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 4
Supplement
Read Mapping Bias fml
sequence library
Sequencer short sequence reads
position relative to intron
read
cov
erag
e200!200 !100 0 100
0.4
0.6
0.8
1
1.2
RNA-Seq data (A. thaliana, D. Weigel’s Lab, MPI Tubingen)
I Exon boundaries
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 5
Supplement Results
Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)
compared to
I Position-wise, without profiles
I Segment-wise, without profiles (e.g. Jiang and Wong (2009))
I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))
Estimate transcript abundances
I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))
I Subset of alternatively spliced genes
Evaluation: Spearman correlation between
I Simulated RNA expression level and
I Predicted transcript weights
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6
Supplement Results
Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)
compared to
I Position-wise, without profiles
I Segment-wise, without profiles (e.g. Jiang and Wong (2009))
I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))
Estimate transcript abundances
I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))
I Subset of alternatively spliced genes
Evaluation: Spearman correlation between
I Simulated RNA expression level and
I Predicted transcript weights
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6
Supplement Results
Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)
compared to
I Position-wise, without profiles
I Segment-wise, without profiles (e.g. Jiang and Wong (2009))
I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))
Estimate transcript abundances
I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))
I Subset of alternatively spliced genes
Evaluation: Spearman correlation between
I Simulated RNA expression level and
I Predicted transcript weights
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6
Supplement Results
Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)
compared to
I Position-wise, without profiles
I Segment-wise, without profiles (e.g. Jiang and Wong (2009))
I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))
Estimate transcript abundances
I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))
I Subset of alternatively spliced genes
Evaluation: Spearman correlation between
I Simulated RNA expression level and
I Predicted transcript weights
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6
Supplement Results
Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)
compared to
I Position-wise, without profiles
I Segment-wise, without profiles (e.g. Jiang and Wong (2009))
I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))
Estimate transcript abundances
I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))
I Subset of alternatively spliced genes
Evaluation: Spearman correlation between
I Simulated RNA expression level and
I Predicted transcript weights
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6
Supplement Results
Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)
compared to
I Position-wise, without profiles
I Segment-wise, without profiles (e.g. Jiang and Wong (2009))
I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))
Estimate transcript abundances
I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))
I Subset of alternatively spliced genes
Evaluation: Spearman correlation between
I Simulated RNA expression level and
I Predicted transcript weights
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6
Supplement Results
Evaluation I fmlOur method rQuant: Position-wise, with profiles(estimating library and mapping bias)
compared to
I Position-wise, without profiles
I Segment-wise, without profiles (e.g. Jiang and Wong (2009))
I Segment-wise, with profiles (e.g. Flux Capacitor by Sammeth (2009a))
Estimate transcript abundances
I Using simulated data for A. thaliana (Flux Simulator (Sammeth, 2009b))
I Subset of alternatively spliced genes
Evaluation: Spearman correlation between
I Simulated RNA expression level and
I Predicted transcript weights
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 6
Supplement Results
Evaluation II fml
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Spea
rman
Cor
rela
tion
Across transcripts
Within genes (mean)
Segment-wise, w/o pro!les
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7
Supplement Results
Evaluation II fml
Position-wise, w/o pro!les
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Spea
rman
Cor
rela
tion
Across transcripts
Within genes (mean)
Segment-wise, w/o pro!les
Segment-wise, w/ pro!les
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7
Supplement Results
Evaluation II fml
rQuantPosition-wise,
w/ pro!lesPosition-wise, w/o pro!les
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Spea
rman
Cor
rela
tion
Across transcripts
Within genes (mean)
Segment-wise, w/o pro!les
Segment-wise, w/ pro!les
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 7
Supplement Results
Preliminary Evaluation I fmlCDS (precision+recall)/2
10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
expression percentiles [%]
SRX004863 w/o graphSRX004863 w graph
SRX004865 w/o graphSRX004865 w/ graph
SRX004867 w/o graphSRX004867 w/ graph
SRX001872 w/o graphSRX001872 w/ graph
SRX001873 w/o graphSRX001873 w/ graph
SRX001874 w/o graphSRX001874 w/ graph
SRX001875 w/o graphSRX001875 w/o graph
mTiM mGene
Gunnar Ratsch (FML, Tubingen) Next Generation Genome Annotation Hinxton, Nov 10, 2009 8