Ngs de novo assembly progresses and challenges
-
Upload
scott-edmunds -
Category
Technology
-
view
3.621 -
download
1
description
Transcript of Ngs de novo assembly progresses and challenges
NGS de novo assembly: progresses and challenges
Yingrui LiBGI Shenzhen
Overall sketch of SOAPdenovo
2
Overall sketch of SOAPdenovo
3
Main issues in NGS de novo assembly
Efficient graph building and reduction Contig construction Scaffold construction Gap closure (to solve repeats) Iterative refining assemblies
1. Reducing graph complexity
Eliminate errors in original raw readso Graph-basedo Kmer frequency spectrum-based
Reduce errors beforehand to construct graph memory- and time-efficiently
Also will significantly reduce the load in graph-reduction step
Improve reliability of primary contigs, which serve as data basis for subsequent steps
Recent progresses 1) larger Kmer (up to 27) can be used with acceptable
memory and speed.
2) algorithm is optimized so more error bases can be corrected.
3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result.
Simulation result of Arabidopsis data using different Kmer sizeGenome size (without N)
Coverage depth
Raw reads Raw bases (bp)Raw error bases (bp)
Average error rate
Ratio of error free reads
113M 20 23,799,530 2,379,953,000 22,952,266 0.964% 37.666%
Peak memory
(G)
Average memory
(G)
Kmer size
Result reads Result basesResult error
basesAverage
error rate
Ratio of error free
reads
2.88 2.85 17 23,681,287 2,287,424,396 503,786 0.0220% 98.338%
3.10 3.08 18 23,739,274 2,299,564,571 322,351 0.0140% 98.901%
3.22 3.20 19 23,742,156 2,300,221,899 252,190 0.0110% 99.137%
3.40 3.37 21 23,741,397 2,299,619,318 203,847 0.0089% 99.311%
3.50 3.48 23 23,733,265 2,297,402,148 178,185 0.0078% 99.407%
3.60 3.58 25 23,761,937 2,304,560,512 157,724 0.0068% 99.475%
3.68 3.66 27 23,755,744 2,303,144,409 142,652 0.0062% 99.533%
Results of different versions for error correction
correct v1.0 correct v1.1 correct v1.1(overlap_cor)
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
Average Error Rate
HumanC. gigas (Oyster)Z. mays (Maize)Rice
* overlap_cor: combination of error correction and merging of PE-read
2. Contiging For SOAPdenovo, contiging is a process that
finds all unique unambiguous paths in complexity-reduced de Bruijn graph
Progresses
1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina.
2) longer repeat can be resolved using overhung PE-read.
3. Scaffolding
Scaffolding is to link primary contigs to a unambiguous path in relationship graph The data basis for gap-closure Highly-associated with final contig size Performance are hyper-sensitive to parameter setting
Progresses
1) repetitive contigs are handled more cautiously.
2) some algorithmic logic are optimized to make less mistakes.
*When one(more) contig(s) in a scaffold is(are) not in correct position(s), there is an error.
total scaffold number
totalcontig inside
error scaffold number
involved contigs number
Fruit fly
old version 26 7,026 8 27
new version 23 7,513 3 3
Human X
old version 102 333,834 58 466
new version 78 331,455 23 87
4. Gap closure
Based on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs): Unique regions that did not pass stringent contiging
threshold Repeat regions that are cut/not assembled in original
assemblies A process that has high risk to induce errors
Progresses
1) overhung PE-read are used to span small gaps and fill them.
2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not…
3) local assembly strategy is optimized to make better decision when encountering conflicts.
Results of different versions for gap filling
total gap number
fully filled gap number
fully filled ratio
error number ratio*
scaftig N50
Fruit fly
old version 29,917 26,387 88.20% 7.35% 160,013
new version 29,917 28,404 94.94% 5.39% 180,723
Human X
old version 297,470 258,719 86.97% 2.66% 12,472
new version 297,470 293,577 98.69% 0.95% 86,670
IRGSP
old version 390,563 339,211 86.85% 5.06% 12,007
new version 390,563 383,078 98.08% 2.72% 56,666
* When gap sequence of fully filled gap is not exactly the same as reference sequence, there is an error.
5. Post-processing
Align reads back to the assembly to evaluate the reliability of each locus
Correct artifacts in the assemblies Analyze the possibility of further improvement
6. Computational performance A bunch of low-level optimizations now achieved
1 round of assembly cost 1 day for human genome on a 256G memory node
Cloud-based assembler at dawn (dev code: Hecate) Memory footprint cut to <32G; speed performance
scalable to number of nodes used.
Issues
Achieving theorectical upper limit in contiging Paired-end short reads + insert size ~= Long reads
Mixing up two haploids Several key factors affect quality of WGS
assembly Heterozygous rate of the diploid genome Repetitive sequence distribution pattern of the
species’ genome K-mer size used when the de Bruijn graph assembly
applied
Revised Hierarchical Assembly Build libraries hierarchically
Using Fosmid clones Avoid combining two haploids
Assembly hierarchically Combines de Bruijn graph & OLC strategies
Providing an affordable sequencing solution to
diploid & complex genome
Flowchart of Revised
Hierarchical Assembly
Revised Hierarchical de novo Assembly on a Asian Genome
Data Production:
• 8x(500k) Fosmids on a human genome• ~16k index libraries
• Optimally 30 Fosmids clones a pool• 40x raw data per Fosmid clone
• 20x 200bp IS• 20x 500bp IS
• 320 index libraries per lane• ~120 Illumina HiSeq lanes
• Total Amount of data: 1650G• Sequenced: 15 lanes• Produced data: 213G
Expect of Outcomes
1. Novel sequences for the gap closure of reference genome.
2. A comprehensive map of structural variations.3. Diploid sequences in relatively highly
heterogenous regions.4. An assembly that is more “real”
Progress of G10-BGI Species
PROGRESS STATUS
total species: 101
6410
1611
BGI-G10K Progress status
waiting samples
sample testing
Preliminary assembly
finish
FINISHED SPECIES
fish
bird
mammal
SPECIES # SPECIESCOMMON
NAMESEQUENCING
DEPTHDETAIL
18 Cynoglossus semilaevis Tongue solefemale:145X male:141X
contigN50=37K, scaffoldN50=734K
contigN50=24.5K, scaffoldN50=577K
19 Paralichthys olivaceus Bastard halibut 119X contigN50=20K, scaffoldN50=1.2M
55Anas platyrhynchos
domesticaPeking duck 80X contigN50=26K,scaffoldN50=1.2M
74 Ailuropoda melanoleuca Giant panda 56X contigN50=39.9K,scaffoldN50=1.3M
75 Ursus maritimus Polar bear 102X contigN50=32.4K,scaffoldN50=15.9M
78 Bos grunniens Domestic yak 119X contigN50=20.4K,scaffoldN50=1.5M
79 Pantholops hodgsonii Chiru 88X contigN50=18K,scaffoldN50=2.76M
80 Capra aegagrus hircus Goat 93X contigN50=18.7K,scaffoldN50=3.06M
81 Ovis aries Sheep 80X contigN50=17.4K,scaffoldN50=5.67M
83 Camelus dromedarius Arabian camel 78X contigN50=54K, scaffoldN50=4.12M
97 Macaca fascicularisCrab-eating
macaque54X contigN50=12.7K, scaffoldN50=652K
Preliminary assembled species
mammal
reptile
fish
bird
SPECIES # SPECIESCOMMON
NAMESEQUENCING
DEPTHDETAIL
11 Hypophthalmichthys molitrix Silver carp 152X contigN50=19.9K,scaffoldN50=972.8K
17 Pseudosciaena crocea
Large yellow croaker 61X contigN50=922bp,scaffoldN50=15K
21Epinephelus coioidesGrouper 34X contigN50=20K, scaffoldN50=700K
24 Monopterus albus Finless eel 55X contigN50=1.3K,scaffoldN50=21K
39Alligator sinensis Chinese alligator 53X contigN50=5.6K,scaffoldN50=24.7K
48 Trionyx (Pelodiscus) sinensis
Chinese softshell turtle 30X contigN50=1.1K,scaffoldN50=10K
56 Anser anser domesticus Domestic goose 47X contigN50=6.6K,scaffoldN50=23.2K
58 Nipponia nippon Crested ibis 106X contigN50=22K,scaffoldN50=5M
60 Falco peregrinus Peregrine falcon 130X contigN50=28.6K,scaffoldN50=4.47M
61 Falco cherrug Saker falcon 41X contigN50=9.2K,scaffoldN50=42.7K
66 Pygoscelis adeliae Adelie penguin 90X contigN50=19K,scaffoldN50=5M
67Aptenodytes forsteri Emperor penguin 67X contigN50=30K,scaffoldN50=5M
70Panthera tigris altaica Amur tiger 39X contigN50=4.1K,scaffoldN50=27.7K
71 Acinonyx jubatus Cheetah 61X contigN50=30K,scaffoldN50=3M
72 Panthera leo Lion 70X contigN50=11.6K,scaffoldN50=1.32M
82 Camelus bactrianus Bactrian camel 62X contigN50=8.4K,scaffoldN50=61.5K
Sequencing of species
mammal
reptile
fish
bird
SPECIES # SPECIES COMMON NAME DETAIL
4Polypterus senegalus Bichir sequencing
9Aristichthys nobilis Bighead carp sequencing
13Hippocampus comes Tiger tail seahorse sequencing
15Scleropages formosus Golden arowana sequencing
25Mola mola Sunfish sequencing
50Chelonia mydas Green turtle sequencing
53 Calypte anna Anna's hummingbird sample arrived
68Struthio camelus Ostrich sequencing
84Elaphurus davidianus Pere David's deer sequencing
94Tachyglossus aculeatus Short-beaked echidna sequencing
Straw webhost on genomes
http://climb.genomics.org.cn/g10k/home.jsp Please advise what kind of functions to include,
considering the fact that genomes will be available at different levels of completeness: Finished map Fine map w/ haploids solved Draft map w/ physical map anchord
Thank you!