1
Exploring the package TopHat-CuffDiff
Jean-François TalyBioinformatics Core Facilities
Group meetingOctober 2nd 2012
2
RNAseq expression data analysis
1. TopHat for mapping reads to the reference– Reads directionality
2. CuffDiff for the differential enrichment– Statistics with version 2.0.0 or 2.0.1
3. Enrichment threshold– Which transcripts are present in mitochondria?
3
PNPASE
miRNP
miRNAs
?
miRNAs
Regulation on mitochondrial translation
mRNAsMito
proteins
MitomiR_ EU0183Slide from
Question 1 : Are Nuclear DNA-encoded miRNAs imported to mitochondria ?
MitomiR project
4
miRNP ?
miRNAs
Regulation on mitochondrial translation
mRNAsproteins
miRNAs
MitomiR_ EU0183Slide from
Question 2 : Do miRNAs exist in the mitochondrial genome?
MitomiR project
5
One cell, two DNAs
-Circular DNA-human (ADNmt) mitochondria genome = 16.6 kb - 13 for subunits of respiratory complexes I, III, IV
and V - 22 for mitochondrial tRNA - 2 for rRNA
*One mitochondrion can contain two to ten copies of its DNA
* Exceptions to the universal genetic code (UGC) in mitochondria
From Lung et al. , 2006
Mitochondria
-23 chromosome pairs -human DNA : 2.9 billion DNA base pairs
-20,000 and 25,000 human protein-coding genes-»Junk » DNA or non coding DNA-Noncoding functional RNA (tRNA, rRNA,miRNA…)
The human genome may encode over 1000 miRNAs, which may target about 60% of mammalian gene
Nucleus
MitomiR_ EU0183
6
RNAseq libraries
• Short insert size: searching for miRNAs– No poly-A selection– No fragmentation– Size selected: 18-36 nt– stranded
• Long insert size: searching for lncRNAs– No poly-A selection– Fragmented– Size selected: 200 nt– stranded
7
2 Conditions
• Total fraction (tot)– Full cell lysate
• Mitochondrial fraction (mit)– RNA extracted from mitochondria
8
RNAseq expression data analysis
1. TopHat for mapping reads to the reference– Reads directionality
2. CuffDiff for the differential enrichment– Statistics with version 2.0.0 or 2.0.1
3. Enrichment threshold– Which transcripts are present in mitochondria?
9
Stranded RNAseq: Vocabulary
Forward
http://www.biostars.org/post/show/3423/forward-and-reverse-strand-conventions/
Reverse
5’ 3’
3’ 5’
Forward = 5’ end the closest from centromer in Human50% of the genes are coding in the forward strand
coding
coding
Forward / Reverse = Plus / MinusCoding / Template = Sense / Anti-sense
10
Orientation of reads?
5’ 3’3’ 5’
coding DNA
template DNA
RNA5’ 3’Transcription
3’ 5’cDNA
RNA5’ 3’
Reverse-transcription
First strand sequencingdUTP, NSR, NNSR
3’ 5’cDNA
coding DNA5’ 3’
DuplicationSecond strand sequencing
Directional Illumina (Ligation) Standard SOLiD
11
Proper TopHat option?
--library-type :• fr-unstranded: Default, Standard Illumina Reads • fr-firststrand: dUTP, NSR, NNSR• fr-secondstrand: Directional Illumina (Ligation),
Standard SOLiD
We mapped the reads using the unstranded and the secondstrand for comparisons
12
How can we evaluate directionality?
• Reads mapping in the F strand should be aligned with genes coding in F as well.
• Bitwise FLAG of the BAM file:– How many reads in forward?samtools view -c -F 16 accepted_hits.bam
– How many reads in reverse?samtools view -c -f 16 accepted_hits.bam
Total number of reads Percentage of Forward Mapping (PFM)
--library-typefr-secondstrand 173,219,584 55%
default 173,196,005 55%
13
How can we evaluate directionality? (2)
• Gene by gene– Bitwise FLAG + gene strand annotation
Transcripts in the (+) strand
Transcripts in the (-) strand
Transcripts in both strands
--library-
typefr-
secondstrand
Number of transcripts 82,782 80,648 163,430
Average PFM 77% 24% 51%
Median PFM 92% 1% 55%
default
Number of transcripts 82,868 80,693 163,561
Average PFM 77% 24% 51%
Median PFM 92% 1% 54%
A small number of genes received a huge amount of miss-mapped reads!
14
Example of miss-aligned reads
• AC097532.1: chr2:133038647-133038738– miRNA automatically annotated in E67 but retired
from E68;– CIGAR string of some reads is 26kb long;– 11,000,115 reads mapped (6% of total);– 8,205,667 mapped to the position 133,038,644;– NCBI blast of the major sequence:
• hit on the opposite strand but with 100% coverage and 100% identity to the 28S ribosomal RNA.
15
RNAseq expression data analysis
1. TopHat for mapping reads to the reference– Reads directionality
2. CuffDiff for the differential enrichment– Statistics with version 2.0.0 or 2.0.1
3. Enrichment threshold– Which transcripts are present in mitochondria?
16
CuffDiff needs a special GTF
• CuffDiff needs a GTF with the 2 following tags:– tss_id: The ID of this transcript's inferred start site. – p_id: The ID of the coding sequence this transcript
contains.• You can produce a compatible GTF with
CuffCompare:cuffcompare -s /path/to/genome_seqs.fa -CG -r
annotation.gtf
17
Effect of CuffCompare
CuffDiff V2.0.2 CuffDiff V2.0.2
CuffC
ompa
re +
Cuff
Diff
V2.
0.2
CuffC
ompa
re +
Cuff
Diff
V2.
0.2
18
Effect of CuffDiff Version
CuffDiff V2.0.1 CuffDiff V2.0.1
CuffD
iff V
2.0.
2
CuffD
iff V
2.0.
2
19
Highly sensible statistics
Reproducibility?
Version effect?
CuffCompare effect?
Genome annotation effect?
From 902 differentialy expressed genes with V2.0.1, we went to 15 with v2.0.2!!!
20
RNAseq expression data analysis
1. TopHat for mapping reads to the reference– Reads directionality
2. CuffDiff for the differential enrichment– Statistics with version 2.0.0 or 2.0.1
3. Enrichment threshold– Which transcripts are present in mitochondria?
21
Expression data reflects expectations
Ensembl Ids GeneLengthshortest
qPCR(tot)/qPCR(mit) 21-07-2011
qPCR 29-07-2011
RNA seq ShortIS
RNA seq LongIS
ENSG00000198899 MT-ATP6 681 0.600 0.500 - 0.18
ENSG00000198840 MT-ND3 346 0.400 0.400 - 0.21
ENSG00000111640 GAPDH 390 416.000 362.000 - 7.1
ENSG00000089157 RLP0 402 611.000 446.000 - 8.6
Statistics may not be trustable but the fold change is! Define an enrichment threshold based on log2(FPKMtot/FPKMmit)
Cytosol Vincinity of mitochodria Mitochondrial genes
22
Compartimented genes
• Cytosolic genes:– UniProt: experimentaly observed in cytosol– Ensembl: no automatic annotations
• Vincinity of mitochondria:– Paper from Kang et al. 2012
• Mitochondrial genes– The 37 genes in the chromosome
23
Log2(Fold Change) distributions for the long insert library
24
Summary
All CytEnsembl67
CytUniProt Mitochondrial Kang2012
VicinityMit
SortIS DE Mean 1.7 0.41 - -0.6 -
DE Median 2.05 0.46 - -0.65 -
SeqNumb 2117 9 0 22 0
LongIS DE Mean 0.46 1.05 0.9 -2.21 1.94
DE Median 0.5 1.14 0.96 -2.27 2.2
SeqNumb 21030 1664 127 34 13
25
Significantly enriched genes
Method Short Insert Long InsertCuffDiff V2.0.1 988 908Threshold 309 714Intersection 22 99
26
Back Up slides
27
Mithochondrial genome
28
Mithochondrial genome – first 3 genes
29
Short Long
Ensembl Ids Gene Length FPKM mit FPKM tot log2(tot/mit) FPKM mit FPKM tot log2(tot/mit)ENSG00000198695 MT-ND6 525 81 23 -1.81 1820 377 -2.27ENSG00000198712 MT-CO2 684 459 169 -1.44 4063 764 -2.41ENSG00000198727 MT-CYB 1141 159 144 -0.15 2332 504 -2.21ENSG00000198763 MT-ND2 1042 172 59 -1.53 1559 285 -2.45ENSG00000198786 MT-ND5 1812 129 58 -1.15 2153 437 -2.30ENSG00000198804 MT-CO1 1542 154 58 -1.42 4186 766 -2.45ENSG00000198840 MT-ND3 346 226 66 -1.77 2890 610 -2.24ENSG00000198886 MT-ND4 1378 166 56 -1.56 3400 698 -2.28ENSG00000198888 MT-ND1 956 150 92 -0.71 1183 233 -2.35ENSG00000198899 MT-ATP6 681 94 26 -1.83 2357 431 -2.45ENSG00000198938 MT-CO3 784 270 269 -0.01 2037 401 -2.34ENSG00000209082 J01415.1 75 39041 34034 -0.20 56409 9045 -2.64ENSG00000210049 J01415.2 71 179164 80467 -1.15 257938 55524 -2.22ENSG00000210077 J01415.3 69 96298 67810 -0.51 2524440 682409 -1.89ENSG00000210082 J01415.4 1559 1546 642 -1.27 HIDATA 27286 0.00ENSG00000210100 J01415.5 69 10163 12512 0.30 63087 11058 -2.51ENSG00000210107 J01415.6 72 75946 35617 -1.09 2191 455 -2.27ENSG00000210112 J01415.7 68 171524 97116 -0.82 67897 22503 -1.59ENSG00000210117 J01415.8 68 11418 7479 -0.61 7944 2424 -1.71ENSG00000210127 J01415.9 69 1932 1427 -0.44 13615 3971 -1.78ENSG00000210135 J01415.10 73 20509 12667 -0.70 1864 196 -3.25ENSG00000210140 J01415.11 66 12550 7616 -0.72 77355 13629 -2.50ENSG00000210144 J01415.12 66 9804 5234 -0.91 74448 11999 -2.63ENSG00000210151 J01415.13 69 5078 1809 -1.49 NOTEST NOTEST NOTESTENSG00000210154 J01415.14 68 5943 3392 -0.81 1800 760 -1.24ENSG00000210156 J01415.15 70 28619 32650 0.19 1734 345 -2.33ENSG00000210164 J01415.16 68 5627 3232 -0.80 5572 1972 -1.50ENSG00000210174 J01415.17 65 7569 10780 0.51 11149 4206 -1.41ENSG00000210176 J01415.18 69 43092 28770 -0.58 150713 34863 -2.11ENSG00000210184 J01415.19 59 1175590 395027 -1.57 735380 208681 -1.82ENSG00000210191 J01415.20 71 67641 36817 -0.88 70081 14281 -2.29ENSG00000210194 J01415.21 69 157602 115972 -0.44 603010 124182 -2.28ENSG00000210195 J01415.22 66 71836 77279 0.11 19871 4777 -2.06ENSG00000210196 J01415.23 68 45761 30983 -0.56 121678 15826 -2.94ENSG00000211459 J01415.24 954 943 583 -0.69 HIDATA 29151 0.00ENSG00000212907 MT-ND4L 297 412 141 -1.54 9230 1991 -2.21ENSG00000228253 J01415.25 207 735 160 -2.20 36590 8531 -2.10
30
Ensembl Ids Gene Type Status LevelENSG00000198695 MT-ND6 protein_coding KNOWN 3ENSG00000198712 MT-CO2 protein_coding KNOWN 3ENSG00000198727 MT-CYB protein_coding KNOWN 3ENSG00000198763 MT-ND2 protein_coding KNOWN 3ENSG00000198786 MT-ND5 protein_coding KNOWN 3ENSG00000198804 MT-CO1 protein_coding KNOWN 3ENSG00000198840 MT-ND3 protein_coding KNOWN 3ENSG00000198886 MT-ND4 protein_coding KNOWN 3ENSG00000198888 MT-ND1 protein_coding KNOWN 3ENSG00000198899 MT-ATP6 protein_coding KNOWN 3ENSG00000198938 MT-CO3 protein_coding KNOWN 3ENSG00000209082 J01415.1 Mt_tRNA NOVEL 3ENSG00000210049 J01415.2 Mt_tRNA NOVEL 3ENSG00000210077 J01415.3 Mt_tRNA NOVEL 3ENSG00000210082 J01415.4 Mt_rRNA KNOWN 3ENSG00000210100 J01415.5 Mt_tRNA NOVEL 3ENSG00000210107 J01415.6 Mt_tRNA NOVEL 3ENSG00000210112 J01415.7 Mt_tRNA NOVEL 3ENSG00000210117 J01415.8 Mt_tRNA NOVEL 3ENSG00000210127 J01415.9 Mt_tRNA NOVEL 3ENSG00000210135 J01415.10 Mt_tRNA NOVEL 3ENSG00000210140 J01415.11 Mt_tRNA NOVEL 3ENSG00000210144 J01415.12 Mt_tRNA KNOWN 3ENSG00000210151 J01415.13 Mt_tRNA NOVEL 3ENSG00000210154 J01415.14 Mt_tRNA NOVEL 3ENSG00000210156 J01415.15 Mt_tRNA NOVEL 3ENSG00000210164 J01415.16 Mt_tRNA NOVEL 3ENSG00000210174 J01415.17 Mt_tRNA NOVEL 3ENSG00000210176 J01415.18 Mt_tRNA NOVEL 3ENSG00000210184 J01415.19 Mt_tRNA NOVEL 3ENSG00000210191 J01415.20 Mt_tRNA NOVEL 3ENSG00000210194 J01415.21 Mt_tRNA KNOWN 3ENSG00000210195 J01415.22 Mt_tRNA NOVEL 3ENSG00000210196 J01415.23 Mt_tRNA NOVEL 3ENSG00000211459 J01415.24 Mt_rRNA KNOWN 3ENSG00000212907 MT-ND4L protein_coding KNOWN 3ENSG00000228253 J01415.25 protein_coding KNOWN 3
31
Cellular metabolism regulation(E2C slide)
Warburg effect
Differentiation
Proliferative cellsUndifferentiated cellsBiosynthesis efficiency
Working cellsDifferentiated cellsEnergetic efficiency
Glucose
Pyruvate
Lactate
O2
CO2
Glycolysis
OXPHOS
2 ATP
36 ATP
Cells grown in different metabolic condition might represent aunique way to distinguish RNA subpopulation expressed in mitochondria (ncRNA and … miRNA?)
Mitochondrial dysfunctionGlucose
Pyruvate
Lactate
Glycolysis2 ATP
Aminoacidsnucleotides
Slide from
MCF7 MCF7 is a breast cancer cell line able to grow in OXPHOS conditions
32
MCF7
MCF7oxphosMilieu OXPHOS
Milieu HIGH Glucose
Min 3 weeks
MCF7High Gluc
Stable cell lines
MCF7oxphos
MCF7oxphos
MCF7Oxphos shift to
High Gluc
MCF7oxphos
OXPHOS
HIGH Glucose
J0 J1
Total cells and mito extraction
N= 3 to 4 independent batchesOXPHOS
MCF7High Gluc
MCF7High Gluc
MCF7High Gluc
MCF7High Gluc shit
to OXPHOS
HIGH Glucose
SHIFTS!!!
AGB:CH3854
ATCC:HTB-22
Total cells and mito extraction
TLDA
RNA-seq
TLDA = Microfluidic miRNA qPCR
High Glucose
Stable MCF-7 cell lines
OXPHOS 0mM glucose
Low Glucose
Experimental design
33
Exon
Exon 1 Exon2
Top Related