Post on 10-May-2015
description
Improving and validating the Atlantic Cod genome assembly using error-corrected
as well as raw PacBio reads
Lex Nederbragt, NSC and CEESlex.nederbragt@ibv.uio.no
@lexnederbragtOK
Acknowledgements
University of Oslo
Sequencing team NSC
Ole Kristian TøressenKjetill Jakobsen
Sissel JentoftCod genome group
Jason Miller, JCVI
Pacific Biosciences
The Atlantic cod genome project
Cod: the genome
850 million bases (Mbp )Heterozygote
‘Wild-caught’
Cod: phase 1
(Sanger sequencing)454 sequencing
N50
50% of the genome is in contigs as large as the N50 value
Courtesy of Michael Schatz, CSHL
1000 bp genome
445
520
400
490
N50
Sum
Cod: phase 1
(Sanger sequencing)454 sequencing
Phase 1 assembly157 887 sequences753 Mbp of 830 Mbp
Scaffoldcontig
gap
N50 460 kbp
N50 2.8 kbp
Cod: phase 1
6467 scaffolds
35% gap bases
The causes
Short Tandem Repeats (>20% of gaps)
The causes
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4Contig 1
Heterozygosity?
Cod: phase 2
New dataIllumina sequencingPaired end >200xMate Pair 5kb >100x
Improved/new software
23 pseudochromosomes
Below 5% gap bases
Longer contigs
Cod: phase 2 goal
Phase 2 goalScaffold N50 1 MbpContig N50 15 kbp
Cod: phase 2 programs
Zhang et al. PLoSOne 2011
Cod phase 2: status
Goal
Contig scaffold N50 gaps N50
15 kbp <5% 1.5 Mbp
Celera, 454 + Ilmn
Newbler, 454
9 kbp 5% too short
6 kbp 24% OK
Enter PacBio
Large Insert Sizes
Sequencing
Aim for looooong insert sizes
Photo: Tore Oldeide Elgvin
147 SMRT Cells
Chemistry Coverage Av. Raw length
C2 9.2x 3.0 kb
C2-XL 3.2x 4.6 kb
XL-XL 3.5x 5.1 kb
TOTAL 15.9x
Error-correction
Celera Assembler merTrim
+
27x
234x
PacBioToCa (Koren et al)
+
13.7x
27x
9x (67%) recovered
Using PacBio reads
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement Celera PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
Assembly improvement: corrected reads
Celera, 454 reads
Goal
N50 gaps
15 kbp <5%
9 kbp 5%
+ corrected PacBio + PBJelly 11 kbp 1.5%
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
Assembly improvement: raw reads
Goal
N50 gaps
15 kbp <5%
6 kbp 24%Newbler, 454
+ raw PacBio + PBJelly30 kbp 20%
Assembly improvement: raw reads
Goal
N50 gaps
15 kbp <5%
9 kbp 5%
Too good to be true?
Celera, 454 + Ilmn
+ raw PacBio + PBJelly
46 kbp 1.5%
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
Assembly validation
Sequence
Assembly validation
Sequence
Aligned raw Pacbio reads
Coverage
Assembly validation
Sequence
Aligned raw Pacbio reads
Coverage
Aligned corrected Pacbio reads
Assembly validationRa
wpa
cbio
read
sCo
rrec
ted
pacb
io re
ads
(TG)n repeat (TG)n repeat
308 bp gap
Newbler scaffold
Assembly validationRa
wpa
cbio
read
s
(AG)n repeat
939 bp gap
Newbler scaffold
Heterozygous region
Assembly validationRa
wpa
cbio
read
s
Celera scaffold
Misassembly?
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
Assembly validation: bridgemapper (beta)
structural variation misassemblies
Split alignments
bridgemapper (beta) on E. coli
Positions in the contig color coded Illumina + velvet
s05514
bridgemapper (beta) on cod
2510 bp gap
Point to a 2350 bp scaffold
s08737
bridgemapper (beta) on cod
2145 bp gap
Point to a 3 kbp scaffold
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
Assembly with error-corrected reads
Celera Assembly
Goal
Contig N50 gaps scaffolds
15 kbp <5%
9 kbp 5% too short
1.4 times genome size underassembled
CA + corrected PacBio + 454 mates 8 kbp 2% very short
The improved Atlantic cod genome: status
http://en.wikipedia.org
Newbler plus Celera
Scaffoldcontig
gap
Celera: Long contigs, short scaffolds
Slide courtesy of Ole Kristian Tøressen
Newbler plus Celera
Scaffoldcontig
gap
Scaffoldcontig
gap
Celera: Long contigs, short scaffolds
Newbler: Short contigs, long scaffolds
Slide courtesy of Ole Kristian Tøressen
Newbler plus Celera
Scaffoldcontig
gap
Scaffoldcontig
gap
Celera: Long contigs, short scaffolds
Newbler: Short contigs, long scaffolds
Scaffoldcontig
gapCombined: Long contigs, long scaffolds
Slide courtesy of Ole Kristian Tøressen
Contig
Scaffold
PacBio reads
Slide courtesy of Ole Kristian Tøressen
Adding PacBio
Closed gap Reduced gap
Using PBJelly
Polishing the assembly
454 and Illumina reads
Slide courtesy of Ole Kristian Tøressen
Contig
Scaffold
Contig N50: 30 - 40 kbpScaffold N50: 1 - 1.5 Mbp
Imageby Mathieu Thouvenin http://www.flickr.com/photos/mathoov/4681491052/
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera
PacBio reads for cod
Error-correctedreads
Rawreads
Assembly improvement PBJelly PBJelly
Assembly validation blasr blasrbridgemapper
De novo assembly Celera Celera
Assembly
Goal
Contig N50 gaps scaffolds
15 kbp <5%
8 kbp 2% very short CA + corrected PacBio + 454 mates
1.6 times genome size underassembled
CA + raw PacBio reads + 454 mates 38 kbp <1% very short
Lessons learned from PacBio reads
Heterozygous:Large polymorphism
(100’s of bases)
Heterozygous:Large indel
(100’s of bases)
Homozygous HomozygousHomozygous
Cod genome
Atlantic cod version 2
23 pseudochromosomes
Below 5% gap bases
Longer contigs
New annotation
From observation to insight
Mathias Bigge, Ricordisamoa, others (wikimedia commons)
We need better programs