Post on 20-May-2020
New RNA-seq workflowsCharlotte Soneson University of Zurich
Brixen 2016
Wikipedia
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
The traditional workflow
ALIGNMENT
COUNTING
ANALYSIS
doesn’t account for all uncertainty in abundance estimates
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
• slow• unnecessary?
The traditional workflow
ALIGNMENT
COUNTING
ANALYSIS
inaccurate?
doesn’t account for all uncertainty in abundance estimates
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
• slow• unnecessary?
inaccurate?
The traditional workflow
ALIGNMENT
COUNTING
ANALYSIS
length = L
length = 2L
sample 1 sample 2
T1
T2
Abundance quantification
length = L
length = 2L
sample 1 sample 2
T1
T2
Abundance quantification
Gene-level read countslength = L
length = 2L
150 reads
sample 1 sample 2
150 reads
T1
T2
Gene length = 2.6L
Gene S1 S2
Count 150 150
What can we do?
• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)
• Include “adjustment” of gene counts to reflect underlying isoform composition
• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)
• Include “adjustment” of gene counts to reflect underlying isoform composition
What can we do?
How can we get such values?
How could such adjustment be done?
Are they any good?
• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)
• Include “adjustment” of gene counts to reflect underlying isoform composition
What can we do?
How can we get such values?
How could such adjustment be done?
Are they any good?
We need transcript-level
information!
library size
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
Abundance units
`i
ciread count for transcript i
length of transcript i
library size
fragment length
TPMi / RPKMiX
i
TPMi = 106
library size
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
Abundance units
`i
ciread count for transcript i
length of transcript i
library size
fragment length
TPMi / RPKMiX
i
TPMi = 106
Abundance units
`i
ciread count for transcript i
length of transcript i
library size
fragment length
TPMi / RPKMiX
i
TPMi = 106
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
Abundance units
`i
ciread count for transcript i
length of transcript i
library size
fragment length
TPMi / RPKMiX
i
TPMi = 106
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
Abundance units
`i
ciread count for transcript i
length of transcript ifragment length
TPMi / RPKMiX
i
TPMi = 106
library size
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
• Similar to correction factors for library size, but sample- and gene-specific
• Weighted average of transcript lengths, weighted by estimated abundances (TPMs)
• Average transcript length for gene g in sample s:
Offsets (“average transcript lengths”)
ATLgs =
X
i2g
✓is¯`is,X
i2g
✓is = 1
¯`is = e↵ective length of isoform i (in sample s)✓is = relative abundance of isoform i in sample s
length = L
length = 2L
T1
T2
Average transcript lengths
ATLg1 = 1 · L+ 0 · 2L = L
ATLg2 = 0 · L+ 1 · 2L = 2L
length = L
length = 2L
T1
T2
Average transcript lengths
ATLg2 = 0.5 · L+ 0.5 · 2L = 1.5L
ATLg1 = 0.75 · L+ 0.25 · 2L = 1.25L
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
The traditional workflow
ALIGNMENT
COUNTING
ANALYSIS
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
The “modern” workflow
“MAPPING”
ESTIMATION
ANALYSIS
• Does not provide “full” alignment information (i.e., no exact base-by-base alignment).
• Rather, finds all transcripts (and positions) that a read is compatible with.
• Comes in various flavors: • pseudoalignment (kallisto) • lightweight alignment (Salmon) • quasimapping (Sailfish, RapMap)
The “mapping” step
Bray et al. 2016; Patro et al. 2014; Patro et al. 2015; Srivastava et al. 2016
• Input: for each read, the “equivalence class” of compatible transcripts
• Probabilistic modeling of read generation process, with transcript abundance as parameter
• EM algorithm
• Output: estimated abundance of each transcript
The “estimation” step
Step 1: build transcriptome index
kallisto
Salmon
name of index
transcriptome fasta file
name of index
transcriptome fasta file
number of cores
Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html
Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html
reference files for alignment-based workflow
Step 2: quantify
kallisto
Salmon
name of index
output folder
number of cores
name of index
input fastq files
# bootstrapsnumber of cores input fastq files
libtype
output folder # bootstraps
Salmon LIBTYPE argumenthttp://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype
output
kallisto
Salmon
output
kallisto
Salmon
[abundance.tsv]
[quant.sf]
Comparison to traditional workflow
Salmon/kallisto…
• … are considerably faster than traditional alignment+counting -> allow bootstrapping
• … provide more highly resolved estimates (transcripts rather than gene) - can be aggregated to gene level
• … can use a larger fraction of the reads
• … don’t give precise alignments (for e.g. visualization in genome browser) - but avoid large alignment files
kallisto and Salmon gene counts overall similar
0 1 2 3 4 5
01
23
45
SRR1039508
kallisto (log(counts + 1))
Salm
on (l
og(c
ount
s +
1))
Gene-level counts mostly similar to traditional approach
0 1 2 3 4 5
01
23
45
SRR1039508
featureCounts (log(counts + 1))
Salm
on (l
og(c
ount
s +
1))
kallisto and Salmon can use slightly more reads
0e+00
1e+07
2e+07
3e+07
SRR
1039
508
SRR
1039
509
SRR
1039
512
SRR
1039
513
SRR
1039
516
SRR
1039
517
SRR
1039
520
SRR
1039
521
Num
ber o
f map
ped
read
s
featureCounts kallisto Salmon
How to get the estimated values into R?
How to get the estimated values into R?
How to get the estimated values into R?
TPMs
counts
“ATL” offsets
• Abundance estimates for lowly expressed transcripts are highly variable and should be interpreted with caution
A word of warning
A B
C
1
Estim
ated
TPM
True TPM
A B
C
1
Soneson, Love & Robinson, F1000 Research 2016
• Problematic when coverage of region defining an isoform is low
A word of warning
Soneson, Love & Robinson, F1000 Research 2016
• When aggregated to the gene level, abundance estimates are less variable
A word of warning
A B
C
1
Estim
ated
TPM
True TPM
A B
C
1
Soneson, Love & Robinson, F1000 Research 2016
A B
C
1
Differential analysis types for RNA-seq
• Has the total output of a gene changed? DGE
• Has the expression of individual transcripts changed? DTE
• Has any isoform of a given gene changed? DTE+G
• Has the isoform composition for a given gene changed? DTU/DEU
- need different abundance quantification of transcriptomic features (genes, transcripts, exons)
• Srivastava et al.: RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq read to transcriptomes. Bioinformatics 32:i192-i200 (2016) - RapMap
• Patro et al.: Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv http://dx.doi.org/10.1101/021592 (2015) - Salmon
• Bray et al.: Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34(5):525-527 (2016) - kallisto • Patro et al.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.
Nature Biotechnology 32:462-464 (2014) - Sailfish • Pimentel et al.: Differential analysis of RNA-Seq incorporating quantification uncertainty. bioRxiv http://dx.doi.org/
10.1101/058164 (2016) - sleuth • Wagner et al.: Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among
samples. Theory in Biosciences 131:281-285 (2012) - TPM vs FPKM • Soneson et al.: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.
F1000Research 4:1521 (2016) - ATL offsets (tximport package) • Li et al.: RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4):493-500 (2010) -
TPM, RSEM
References