JOBIM 3 July 2012jobim2012.inria.fr/sources/slides/s14.pdf · Small non coding RNA...
Transcript of JOBIM 3 July 2012jobim2012.inria.fr/sources/slides/s14.pdf · Small non coding RNA...
JOBIM
3 July 2012
Chondrichthyans
Teleostomi
Scyliorhinus canicula (dog fish) Genome sequencing
Ongoing project with Génoscope started
3.5 Gbases,
Illumina paired-end sequencing, 32 x
Draft assembly : 3 449 662 contigs, N50 : 1 292 bp
Draft assembly Callorhinchus milii (elephant shark)
910 Mbases
Sanger + 454,1.4 x, 633 833 contigs, N50 : 1 466 bp
Draft assembly Leucoraja erinacea (little skate)
3.42 Gbases,
Illumina paired-end, 26 x, 2 962 365 contigs, N50 : 665 bp
Transcriptome project
Peptisan project
Sequencing done by Génoscope
Libraries for mRNA
Two normalised libraries (Non directional / directional)
Illumina paired-end sequencing (~412 M, ~316 M)
Poster on the transcriptome assembly (Pierre Pericard)
Two Small RNA libraries
Adult and Embryo libraries
Illumina paired-end sequencing 51 nt long
to identify miRNA : de novo identification
Small non coding RNA
post-transcriptional regulators of mRNA transcripts
Discovery of lin-4 in C.elegans in 1993
Pre-miRNA structure
miRNA conservation
miR-143 miRNA * loop miRNA
Zebrafish .....GAUCUACAGUCGUCUGGCCCGCGGUGCAGUGCUGCAUCUCUGGUCAACUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAGGACAACACUGUCAGCUC.....
Medaka UGGUUCUGGUCCAUCUCUGCUGCCCAUGGUGCAGUGCUGCAUCUCUGGUCAGUUGAUAGUCUGAGAUGAAGCACUGUAGCUCGGGACGGAGGGCAGGAGUCUCAGUCUG
Xenopus ............UGUCUCCCAGCCCAAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGGGAAU..............
Human .GCGCAGCGCCCUGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGGAAGAGAGAAGUUGUUCUGCAGC..
Mouse ......................CCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGG........................
Rat .GCGGAGCGCC.UGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCAGGAAGGGAGAAGAUGUUCUGCAGC..
Cow ......GCGUCCUGUCUCCCAGCCUGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGAAGUUGUUCUGCAGC..
Pig .............GUCCCCCAGCCGGAGGUGCAGUGCUGCAUCUCUGGUCAGCUGGGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGA................
Opossum ......................CCCGAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGG........................
Lizard ...........AUGUCUCCCAGCCCAAGGUGCAGUGCUGCAUCUCUGGUCAGUUGUGAGUCUGAGAUGAAGCACUGUAGCUCGGGAAGGGAGGAAC.............
GAGUAAA UA UA GA U
5’ CCUUG G GCAGCACA AUGGUUUGUG UU U
||||| | |||||||| |||||||||| || G
3’ GGAAC C CGUCGUGU UACCGGACGU AA A
AUAAAAA UC UA GG A
miRNA*
miRNA
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences
17 – 27 nt
Data CleaningPRINSEQ Flashcutadapt
Sequences
< 17nt ; >27nt
no adaptors
rRNA, tRNA, ncRNA
Rfam
S. canicula
Draft GenomemiRBase 18.0
miRDeep2
Putative miRNA
Mature, Star, pre-miRNA
ValidationMIReNA CIDmiRNA
Triplet-SVM Conservation
miRNAPredmiRNA SVM
C. milii
GenomeR. erinacea
Genome
MFE
randfold PHDcleav
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences
17 – 27 nt
Data CleaningPRINSEQ Flashcutadapt
Sequences
< 17nt ; >27nt
no adaptors
rRNA, tRNA, ncRNA
Rfam
S. canicula
Draft GenomemiRBase 18.0
miRDeep2
Putative miRNA
Mature, Star, pre-miRNA
ValidationMIReNA CIDmiRNA
Triplet-SVM Conservation
miRNAPredmiRNA SVM
C. milii
GenomeR. erinacea
Genome
MFE
randfold PHDcleav
Cleaning
Prediction
Validation
@PHOSPHORE_0144:8:1101:1512:2663#GGCUAC/1
UUCCCAAGACUGUGAAACCCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG
@PHOSPHORE_0144:8:1101:1699:2666#GGCUAC/1
AGGGCCCGGAUAGCUCAGUCGGUAG UGGAAUUCUCGGGUGCCAAGGAACUC
@PHOSPHORE_0144:8:1101:1503:2691#GGCUAC/1
GAAUACCAGGUGCAGUAGGCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG
@PHOSPHORE_0144:8:1101:1512:2663#GGCUAC/2
AAGGGUUUCACAGUCUUGGGAA GAUCGUCGGACUGUAGAACUCUGAACGUG
@PHOSPHORE_0144:8:1101:1699:2666#GGCUAC/2
CUACCGACUGAGCUAUCCGGGCCCU GAUCGUCGGACUGUAGAACUCUGAAC
@PHOSPHORE_0144:8:1101:1503:2691#GGCUAC/2
AAGCCUACUGCCCCUGGUAUUC GAUCGUCGGACUGUAGAACUCUGAACGUG
UUCCCAAGACUGUGAAACCCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG
CACGUUCAGAGUUCUACAGUCCGACGAUC UUCCCAAGACUGUGAAACCCUU
AGGGCCCGGAUAGCUCAGUCGGUAG UGGAAUUCUCGGGUGCCAAGGAACUC
GUUCAGAGUUCUACAGUCCGACGAUC AGGGCCCGGAUAGCUCAGUCGGUAG
GAAUACCAGGUGCAGUAGGCUU UGGAAUUCUCGGGUGCCAAGGAACUCCAG
CACGUUCAGAGUUCUACAGUCCGACGAUC GAAUACCAGGGGCAGUAGGCUU
• PRINSEQ (Schmieder and Edwards 2011 Bioinformatics)
• Cutadapt (Martin 2011. EMBnet.journal)
• Flash (Magoč and Salzberg 2011 Bioinformatics)
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences
17 – 27 nt
Data CleaningPRINSEQ Flashcutadapt
Sequences
< 17nt ; >27nt
no adaptors
rRNA, tRNA, ncRNA
Rfam
Cleaning
Embryo Adult All
Initial reads 89,766,100 81,179,402 170,945,502
Cleaned reads 82,325,424 65,651,400 147,976,824
Fre
qu
en
cy
Embryo Adult All
Initial reads 89,766,100 81,179,402 170,945,502
Cleaned reads 82,325,424 65,651,400 147,976,824
Fre
qu
en
cy
miR-143-3p
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences
17 – 27 nt
Data CleaningPRINSEQ Flashcutadapt
Sequences
< 17nt ; >27nt
no adaptors
rRNA, tRNA, ncRNA
Rfam
miRDeep2 : Friedländer et al. 2008 Nature Biotechnology
S. canicula
Draft GenomemiRBase 18.0
miRDeep2
Putative miRNA
Mature, Star, pre-miRNA
Prediction
Pre-miRNA Structural information:
miRNA and miRNA* information:
both miRNA and miRNA*
Overexpression of the miRNA vs miRNA*
Overhang (around 2 nt)
Sequence conservation
Modification to miRDeep2
Variability of the miRDeep2 related to randfold
Putative new miRNA
2445 new miRNA with score >= 0
1103 new miRNA with score >= 5 with 10% expected false positives
Conserved miRNA
170 miRNA identified similar to other species
15 rejected after manual inspection (2 with score > 5)
155 good known miRNA (21 with score < 5)
NNNUNNNNNANNNUNNNNNNCUNNNNNNNANNNNGANGNU
GUUNCAGGGNACANUCAACGNNGUCGGUGNGUUUNNUNCNA
|||N|||||N|||N||||||NN|||||||N||||NN|N|
CGANGUUCCNUGUNAGUUGCNNCAGCUACNCAAANNANGNU
NNNUNNNNNANNNUNNNNNN--NNNNNNN-NNNNG-NGNU
contig_452580_14256NNNNNNNNNNNNNNNNNNNNNNNNNNNNNAACAUUCAACGCUGUCGGUGAGUNNNNNNNNNNNNNNNNNACCAUCGACCGUUGAUUGUACC
NNNNNNNNNNNNNNNNNNNNGUUUCAGGGAACAUUCAACGCUGUCGGUGAGUUUGAUGCUAUUGGAGAAACCAUCGACCGUUGAUUGUACCUUGUAGC
GAAUUCUGCUUCGAAUGGUUGCUUCAGUGAACAUUCAACGCUGUCGGUGAGUUUGGAAUUAAAGUAGAAACCAUCGACCGUUGAUUGUACCCUGCGGCAACCACCGUCCU
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNAACAUUCAACGCUGUCGGUGAGUNNNNNNNNNNNNNNNNNACCAUCGACCGUUGAUUGUACC
oan-mir-181a (Ornithorhynch)
GCUU AA U U A U CU A GGAAU
CG UGGUUGCU CAG G ACA UCAACG GUCGGUG GUUU U
|| |||||||| ||| | ||| |||||| ||||||| |||| A
GC ACCAACGG GUC C UGU AGUUGC CAGCUAC CAAA A
UCCU -C C C A U -- - GAUGA
Comparison conserved miRNA with other species
C. milii (elephant shark) and L. erinacea (little skate)
131 identified in C.milii, 152 identified in L.erinacea, 154 altogether
Previously identified chondrichthyans miRNA (Heimberg et al. 2011)
104 S.canicula miRNA mapped on C.milii scaffolds
all 104 miRNAs identified in S. canicula
miRNA* loop miRNA
sca-mir-301 UGUCGGAGGCUCUGACGAUAUUGCACUACUGUACUCACAGU-UAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCACC
cmi-mir-301 UGUCGGAGGCUCUGACGAUAUUGCACUACUGUCCUCACCGU-UAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCAAC
ler-mir-301 UGUCGGGCGCUCUGACGAUAUUGCACUACUGUCCGCACAGCUAAAGCAGUGCAAUAGUAUUGUCAAAGCGUCAGGCACC
hsa-mir-301a ACUGCUAACGAAUGCUCUGACUUUAUUGCACUACUGUACUUUACAG-CUAGCAGUGCAAUAGUAUUGUCAAAGCAUCUGAAAGCAGG
mmu-mir-301a CCUGCUAACGGCUGCUCUGACUUUAUUGCACUACUGUACUUUACAG-CGAGCAGUGCAAUAGUAUUGUCAAAGCAUCCGCGAGCAGG
pma-mir-301a CUUGCAAGCCCCUGCUGGAGGCUCUGACACCAUUGCACUACUGUACGCAAUGG-UGAGCAGUGCAAUUGUAUUGUCAAAGCUUCCGUCGGUGAGCCCA
G G C --- A GU U
UGUC GA GCU UGACGAUAU UGCACU CU AC C
|||| || ||| ||||||||| |||||| || || A
ACGG CU CGA ACUGUUAUG ACGUGA GA UG C
A G A AUA C AU A
miRBase miRNA not in data set
blastn of all miRBase miRNA against genome assembly
24 potential new conserved miRNA
2 identified by miRDeep2 but not identified as conserved
23444 522851
AAAG-UUCUGUCAUACACUCAGGCU UCAGUGCAUCACAGAACUUUGA
contig_3412856_61753 CUCGAGCUAAAG-UUCUGUCAUACACUCAGGCUGCAGAUACACA-AGGUCAGUGCAUCACAGAACUUUGAUUCGGG
rno-mir-148b UUGAGGUGAAG-UUCUGUUAUACACUCAGGCUGUGGCU-CUGA-AAGUCAGUGCAUCACAGAACUUUGUCUCG
cmi CCCAAGCUGAAG-UUCUGUCAUACACUCAGGCUGUAGCUAAUGG-AAGUCAGUGCAUCACAGAACUUUGACUCGAGAU
ler CUCAAGCCAAAGGUUCUGUCAUACACUUUGGCUCUGUCGCUGGG-AAGUCAGUGCAUGACAGAACUUUG
C C A CA GCAGA
CUCGAG UAAAGUUCUGU AU CACU GGCU U
|||||| ||||||||||| || |||| ||||
GGGCUU GUUUCAAGACA UA GUGA CUGG A
A C C -- AACAC
1425623 19236
UGAGAACUGAAUUCCAUGGGC UCCAUAGUAGACAGUUCUCCAG
contig_2512524_51750 UUCCCAGCUAUGAGAACUGAAUUCCAUGGGCUGGUUGCACACUUUAUUUC-UCAGUCCAUAGUAGACAGUUCUCCAGCUUGGCUGCU
gga-mir-146c-1 UUCCCAGCUCUGAGAACUGAAUUCCAUGGACUGGUUUCAAUUCCAUGCGU-UCAGUCCAUGGUAUUCAGUUCUCUAGCUUGGCUGC
cmi CCAGCUGUGAGAACUGAAUUCCAUGGGCUGGUCACGCAGUUUUCUUCCUCAGUCCAUAGUAGUCAGUUCUUCCGUUUGGCUGCU
ler UUCCUGGCUCUGAGAACUGAAUUCCAUGGGCUGGUUGUUCACAUUAUUUC-UCAGUCCAUAGUAG-CAGUUCUCCGGCUUGGCUGCU
---UUCCCA AU AAUUCC UUGCACA
GCU GAGAACUG AUGGGCUGG C
||| |||||||| |||||||||
CGA CUCUUGAC UACCUGACU U
UCGUCGGUU C- AGAUGA CUUUAUU
Illumina paired-end sequencing
Adult Embryo
High-Quality Sequences
17 – 27 nt
Data CleaningPRINSEQ Flashcutadapt
Sequences
< 17nt ; >27nt
no adaptors
rRNA, tRNA, ncRNA
Rfam
S. canicula
Draft GenomemiRBase 18.0
miRDeep2
Putative miRNA
Mature, Star, pre-miRNA
ValidationMIReNA CIDmiRNA
Triplet-SVM Conservation
miRNAPredmiRNA SVM
C. milii
GenomeR. erinacea
Genome
MFE
randfold PHDcleav
Validation
Several potential tools to validate miRNA predictions
MIReNA (Mathelier and Carbone 2010 Bioinformatics)
Microprocessor SVM : prediction of Drosha cleavage site (Helvik et al. 2007 Bioinformatics)
PHDCleav : prediction of Dicer cleavage site (http://www.imtech.res.in/raghava/phdcleav)
Randfold : mono / dinucleotide and markov randomisation (Bonnet et al. 2004, Bioinformatics)
Plant –miRNA pred : ath 82.65%, hsa 85.77% (http://nclab.hit.edu.cn/PlantMiRNAPred)
…
Evaluate tool accuracy
Robust control data set (Ritchie et al. 2012 BioInformatics)
129 positive controls, M.musculus miRNA with publications associated
682 negative controls from NGS sample but validated as non miRNA
Conserved miRNA identified with miRDeep
miRNA validation tools
S.canicula Control data set
Sensitivity Specificity Sensitivity Specificity
miRDeep2 87,1% 86,7% 77,5% 99,1%
Plant-miRNAPred 94,8% 80,0% 97,7% 75,4%
MIReNA 91,6% 86,7% 95,3% 92,4%
RNA-fold (MFE) 95,5% 73,3% 96,1% 56,5%
Randfold d 999 94,2% 86,7% 87,6% 96,0%
Randfold m 999 81,3% 93,3% 71,3% 99,9%
Randfold s 999 96,1% 86,7% 95,3% 94,9%
triplet_SVM 92,9% 73,3% 86,8% 91,5%
Microprocessor SVM 57,4% 100,0% 64,3% 98,8%
PHDcleav 72,9% 86,7% 64,3% 68,9%
Blastn other spêcies 99,4% 46,7% 88,4% 92,8%
CIDmiRNA 93,5% 86,7% 93,8% 95,2%
miRNA validation tools
S.canicula Control data set
Sensitivity Specificity Sensitivity Specificity
miRDeep2 87,1% 86,7% 77,5% 99,1%
Plant-miRNAPred 94,8% 80,0% 97,7% 75,4%
MIReNA 91,6% 86,7% 95,3% 92,4%
RNA-fold (MFE) 95,5% 73,3% 96,1% 56,5%
Randfold d 999 94,2% 86,7% 87,6% 96,0%
Randfold m 999 81,3% 93,3% 71,3% 99,9%
Randfold s 999 96,1% 86,7% 95,3% 94,9%
triplet_SVM 92,9% 73,3% 86,8% 91,5%
Microprocessor SVM 57,4% 100,0% 64,3% 98,8%
PHDcleav 72,9% 86,7% 64,3% 68,9%
Blastn other spêcies 99,4% 46,7% 88,4% 92,8%
CIDmiRNA 93,5% 86,7% 93,8% 95,2%
Combinations of all tools
Conserved miRNA passing all test : 83 / 155
Which criteria and threshold to apply ?
miRDeepPlant-miRNA
PredMIReNA
RNAfoldMFE
randfoldd 999
randfoldm 999
randfolds 999
TripletSVM
microSVM
PHDcleavBlastnother
species
CIDmiRNA
contig_2184464_47128 1,9 1 -1 -19,8 0,90% 2,20% 0,10% 1 -0,90 1,28 1 -1
contig_1435315_35146 50529,5 1 1 -33,2 0,10% 0,30% 0,10% 1 -0,04 0,25 1 1
contig_2147172_46625 4,7 -1 -1 -24,1 8,70% 14,80% 1,60% 1 -1,32 2,01 1 -1
contig_1446688_35335 25916,3 1 1 -35,3 0,10% 0,10% 0,10% 1 0,52 2,37 1 1
46910 1
contig_2147172_46625 UGUGGUGAACUAGCAGCACAUAAUGGUUUGUGAGUUGUAUGGAGAUGCAGGCCACAUUGUGCUGCCACAUGAAC
hsa-miR-15a CCUUGGAGUAAAGUAGCAGCACAUAAUGGUUUGUGGAUUUUGAAAAGGUGCAGGCCAUAUUGUGCUGCCUCAAAAAUACAAGG
GGUGAACUA UAA GA GU GAGUAAA UA UA GA U
GCAGCACA UGGUUUGU GUU A CCUUG G GCAGCACA AUGGUUUGUG UU U
|||||||| |||||||| ||| U ||||| | |||||||| |||||||||| || G
CGUCGUGU ACCGGACG UAG G GGAAC C CGUCGUGU UACCGGACGU AA A
CAAGUACAC UAC -- AG AUAAAAA UC UA GG A
Support Vector Machine
supervised learning methods that analyze data and recognize patterns, used
for classification and regression analysis
takes a set of input data and predicts, for each given input, which of two
possible classes forms the input : a non-probabilistic binary linear classifier.
Parameters : C-SVC type with polynomial kernel
What is the best combinations of tools ?
Try all the possible combinations of the validation tools / parameters
4095 combinations, 1 optimum with the minimum number of tools.
MIReNA CIDmiRNATriplet-SVM
Blastn
miRNAPred
micro SVM
MFERandfold m PHDcleavRandfod d
Randfold s miRDeep2
S.Canicula Control data set
Sensitivity Specificity Sensitivity Specificity
100,0% 93,3 % 96,9% 99,0%
MIReNA CIDmiRNATriplet-SVM
Blastn
miRNAPred
micro SVM
MFERandfold m PHDcleavRandfod d
Randfold s miRDeep2
Supplementary filters
Remapping of the reads on the hairpin with no mismatch
At least 5 sequences corresponding to mature miRNA
Remove prediction with fragments in the loop, 3’ and 5’ of pre-miRNA
968 potential new miRNA
155 conserved miRNA + 24 but not in dataset
Accurate miRNA set for S. canicula
Phylogenetic analysis
Chondrychtians specific genes
When Genome available
Analysis to be redone
Compare with CDS to remove contaminations
Target Prediction
Differential expression Adult / Embryo
piRNA identification
Transcriptome / small RNA studies was supported by environmental and
functional genomic CPER research initiative and PEPTISAN project funding
from Bretagne region. Thanks to FASTERIS and Genoscope for the RNA
libraries construction and sequencing.
Scyliorhinus canicula Genome sequencing project done in collaboration
with Genoscope.
To the organisers of Jobim
Thanks for your attention