The first near-complete assembly of the hexaploid bread wheat genome,
Tritricum aestivum
Daniela PuiuAleksey Zimin, Richard Hall, Sarah Kingan, Bernardo Clavijo, Steven Salzberg
ICG-12Oct 27 2017
IGC-12The Wheat Genome 2
Sequencing and Assembly of the Ancestral and Common Wheat
Aegilops tauschii ssp strangulata accession AL8/78Chinese spring variety (CS42, accession Dv418)
2013-2017
IGC-12The Wheat Genome 3
History of Wheat
~8,000 years ago: spontaneous hybridizationEmmer Wheat + Goat grass = Bread Wheat (World's 3rd cereal crop)
Triticum turgidum + Aegilops tauschii = Triticum aestivumAABB + DD = AABBDD
Whole Genome => Assisted Breeding => Improved Yield
IGC-12The Wheat Genome 4
The Wheat Genome
One of the most complex genomes !
1) Genome size: over 15 billion bases 2) Allohexapoild : six copies of each chromosome3) >90% repeats
Multiple past attempts to assemble => assemblies shorter than the estimated genome size.
IGC-12The Wheat Genome 5
New vs Previous Assemblies
Tritricum 3.1
N50
232K
IGC-12The Wheat Genome 6
Data Reduction
Original Reads Number Sum Coverage Accuracy
Illumina 7.06G 1Tb 65x 99.5%
PacBio 55.5M 545Gb 36x 87.5%
Processed Seq Number Sum Coverage Accuracy
super-reads 95.7M 31Gb 2x 99.95%
mega-reads 57M 278Gb 18x 99.65%
MaSuRCA mega-readshybrid correction
IGC-12The Wheat Genome 7
MaSuRCA mega-reads Correction
IGC-12The Wheat Genome 8
Assembly Pipeline
MaSuRCA Correction
Illumina
Celera WGS Assembler
Mega-reads
Remove Duplicates
Tritricum 1.0
Tritricum 2.0
FALCON Correction
PacBio
FALCON Assembler
pReads
Arrow Polishing
FALCON Trit 0.5
FALCON Trit 1.0
k-mer Analysis
Merge
Tritricum 3.1
IGC-12The Wheat Genome 9
k-mer Analysis
50M
k-mers missing from the PacBio assembly only
40M
30M
20M
10M
31-mer frequencies
IGC-12The Wheat Genome 10
Assembly Merge
Merging of the Hybrid and PacBio assemblies Merging of the Hybrid and PacBio assemblies
Tritricum 2.0 contig
FALCON contigA FALCON contigB
Tritricum 3.1
>5Kb >5Kb>5Kb
IGC-12The Wheat Genome 11
Assembly Statistics
Assembly Number Total size (bp)
N50 size (bp)
Triticum 2.0 375,328 14,395,027,822 75,599
FALCON Trit 1.0 97,809 12,939,100,857 215,314
Triticum 3.1 279,439 15,344,693,583 232,659
IGC-12The Wheat Genome 12
Run Time: 100 CPU years
Main Steps
RunTimeCPUhrs
WallTimeMonths
MaSuRCA 100K 1.5
Celera WGS 470K 5
FALCON 150K 0.75
ARROW 160K 0.75
total 880K 9
100K CPU hrs=11.5 years800K CPU hrs=100 years
IGC-12The Wheat Genome 13
Genome Repetitiveness
k-mer uniqueness ratios
WHEAT
FLY
COW
RICE
PINE
Ae tauschii
IGC-12The Wheat Genome 14
Publication
IGC-12The Wheat Genome 15
Conclusions
The most challenging genome (we) assembled!
Learning experience!
Assembly quality vs computational resources?
Share your data!
The most challenging genome (we) assembled!
Learning experience!
Assembly quality vs computational resources?
Share your data!
IGC-12The Wheat Genome 16
Acknowledgements
Steven Salzberg
Aleksey ZImin
Johns Hopkins University UCDavis Plant Sciences
Jan Dvorak
Earlham Institute
Bernardo Clavijo
Mingcheng Luo
Top Related