Ngs de novo assembly progresses and challenges

NGS de novo assembly: progresses and challenges

Yingrui LiBGI Shenzhen

Overall sketch of SOAPdenovo

2

Overall sketch of SOAPdenovo

3

Main issues in NGS de novo assembly

Efficient graph building and reduction Contig construction Scaffold construction Gap closure (to solve repeats) Iterative refining assemblies

1. Reducing graph complexity

Eliminate errors in original raw readso Graph-basedo Kmer frequency spectrum-based

Reduce errors beforehand to construct graph memory- and time-efficiently

Also will significantly reduce the load in graph-reduction step

Improve reliability of primary contigs, which serve as data basis for subsequent steps

Recent progresses 1) larger Kmer (up to 27) can be used with acceptable

memory and speed.

2) algorithm is optimized so more error bases can be corrected.

3) combination of error correction and merging of PE-read, whose insert size is slight shorter than the sum of two reads’ length, e.g., insert size of 170bp with read length of 100bp, which further improves the result.

Simulation result of Arabidopsis data using different Kmer sizeGenome size (without N)

Coverage depth

Raw reads Raw bases (bp)Raw error bases (bp)

Average error rate

Ratio of error free reads

113M 20 23,799,530 2,379,953,000 22,952,266 0.964% 37.666%

Peak memory

(G)

Average memory

(G)

Kmer size

Result reads Result basesResult error

basesAverage

error rate

Ratio of error free

reads

2.88 2.85 17 23,681,287 2,287,424,396 503,786 0.0220% 98.338%

3.10 3.08 18 23,739,274 2,299,564,571 322,351 0.0140% 98.901%

3.22 3.20 19 23,742,156 2,300,221,899 252,190 0.0110% 99.137%

3.40 3.37 21 23,741,397 2,299,619,318 203,847 0.0089% 99.311%

3.50 3.48 23 23,733,265 2,297,402,148 178,185 0.0078% 99.407%

3.60 3.58 25 23,761,937 2,304,560,512 157,724 0.0068% 99.475%

3.68 3.66 27 23,755,744 2,303,144,409 142,652 0.0062% 99.533%

Results of different versions for error correction

correct v1.0 correct v1.1 correct v1.1(overlap_cor)

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

Average Error Rate

HumanC. gigas (Oyster)Z. mays (Maize)Rice

* overlap_cor: combination of error correction and merging of PE-read

2. Contiging For SOAPdenovo, contiging is a process that

finds all unique unambiguous paths in complexity-reduced de Bruijn graph

Progresses

1) larger kmer up to 127 can be used when having merged PE-read or longer read coming out soon, e.g., read length of 150bp from illumina.

2) longer repeat can be resolved using overhung PE-read.

3. Scaffolding

Scaffolding is to link primary contigs to a unambiguous path in relationship graph The data basis for gap-closure Highly-associated with final contig size Performance are hyper-sensitive to parameter setting

Progresses

1) repetitive contigs are handled more cautiously.

2) some algorithmic logic are optimized to make less mistakes.

*When one(more) contig(s) in a scaffold is(are) not in correct position(s), there is an error.

　　

total scaffold number

totalcontig inside

error scaffold number

involved contigs number

Fruit fly

old version 26 7,026 8 27

new version 23 7,513 3 3

Human X

old version 102 333,834 58 466

new version 78 331,455 23 87

4. Gap closure

Based on conservatively constructed scaffolds, intra-scaffold gaps (between linked contigs) are attempted to be filled to form longer contigs (scaftigs): Unique regions that did not pass stringent contiging

threshold Repeat regions that are cut/not assembled in original

assemblies A process that has high risk to induce errors

Progresses

1) overhung PE-read are used to span small gaps and fill them.

2) gaps are treated specifically according to their characteristics, e.g., gap size, nearby contigs’ length, number of reads fell in gaps, having tandem repeat inside or not…

3) local assembly strategy is optimized to make better decision when encountering conflicts.

Results of different versions for gap filling　　

total gap number

fully filled gap number

fully filled ratio

error number ratio*

scaftig N50

Fruit fly

old version 29,917 26,387 88.20% 7.35% 160,013

new version 29,917 28,404 94.94% 5.39% 180,723

Human X

old version 297,470 258,719 86.97% 2.66% 12,472

new version 297,470 293,577 98.69% 0.95% 86,670

IRGSP

old version 390,563 339,211 86.85% 5.06% 12,007

new version 390,563 383,078 98.08% 2.72% 56,666

* When gap sequence of fully filled gap is not exactly the same as reference sequence, there is an error.

5. Post-processing

Align reads back to the assembly to evaluate the reliability of each locus

Correct artifacts in the assemblies Analyze the possibility of further improvement

6. Computational performance A bunch of low-level optimizations now achieved

1 round of assembly cost 1 day for human genome on a 256G memory node

Cloud-based assembler at dawn (dev code: Hecate) Memory footprint cut to <32G; speed performance

scalable to number of nodes used.

Issues

Achieving theorectical upper limit in contiging Paired-end short reads + insert size ~= Long reads

Mixing up two haploids Several key factors affect quality of WGS

assembly Heterozygous rate of the diploid genome Repetitive sequence distribution pattern of the

species’ genome K-mer size used when the de Bruijn graph assembly

applied

Revised Hierarchical Assembly Build libraries hierarchically

Using Fosmid clones Avoid combining two haploids

Assembly hierarchically Combines de Bruijn graph & OLC strategies

Providing an affordable sequencing solution to

diploid & complex genome

Flowchart of Revised

Hierarchical Assembly

Revised Hierarchical de novo Assembly on a Asian Genome

Data Production:

• 8x(500k) Fosmids on a human genome• ~16k index libraries

• Optimally 30 Fosmids clones a pool• 40x raw data per Fosmid clone

• 20x 200bp IS• 20x 500bp IS

• 320 index libraries per lane• ~120 Illumina HiSeq lanes

• Total Amount of data: 1650G• Sequenced: 15 lanes• Produced data: 213G

Expect of Outcomes

1. Novel sequences for the gap closure of reference genome.

2. A comprehensive map of structural variations.3. Diploid sequences in relatively highly

heterogenous regions.4. An assembly that is more “real”

Progress of G10-BGI Species

PROGRESS STATUS

total species： 101

6410

1611

BGI-G10K Progress status

waiting samples

sample testing

Preliminary assembly

finish

FINISHED SPECIES

fish

bird

mammal

SPECIES # SPECIESCOMMON

NAMESEQUENCING

DEPTHDETAIL

18 Cynoglossus semilaevis Tongue solefemale:145X male:141X

contigN50=37K， scaffoldN50=734K

contigN50=24.5K， scaffoldN50=577K

19 Paralichthys olivaceus Bastard halibut 119X contigN50=20K， scaffoldN50=1.2M

55Anas platyrhynchos

domesticaPeking duck 80X contigN50=26K,scaffoldN50=1.2M

74 Ailuropoda melanoleuca Giant panda 56X contigN50=39.9K,scaffoldN50=1.3M

75 Ursus maritimus Polar bear 102X contigN50=32.4K,scaffoldN50=15.9M

78 Bos grunniens Domestic yak 119X contigN50=20.4K,scaffoldN50=1.5M

79 Pantholops hodgsonii Chiru 88X contigN50=18K,scaffoldN50=2.76M

80 Capra aegagrus hircus Goat 93X contigN50=18.7K,scaffoldN50=3.06M

81 Ovis aries Sheep 80X contigN50=17.4K,scaffoldN50=5.67M

83 Camelus dromedarius Arabian camel 78X contigN50=54K， scaffoldN50=4.12M

97 Macaca fascicularisCrab-eating

macaque54X contigN50=12.7K, scaffoldN50=652K

Preliminary assembled species

mammal

reptile

fish

bird

SPECIES # SPECIESCOMMON

NAMESEQUENCING

DEPTHDETAIL

11 Hypophthalmichthys molitrix Silver carp 152X contigN50=19.9K,scaffoldN50=972.8K

17 Pseudosciaena crocea

Large yellow croaker 61X contigN50=922bp,scaffoldN50=15K

21Epinephelus coioidesGrouper 34X contigN50=20K， scaffoldN50=700K

24 Monopterus albus Finless eel 55X contigN50=1.3K,scaffoldN50=21K

39Alligator sinensis Chinese alligator 53X contigN50=5.6K,scaffoldN50=24.7K

48 Trionyx (Pelodiscus) sinensis

Chinese softshell turtle 30X contigN50=1.1K,scaffoldN50=10K

56 Anser anser domesticus Domestic goose 47X contigN50=6.6K,scaffoldN50=23.2K

58 Nipponia nippon Crested ibis 106X contigN50=22K,scaffoldN50=5M

60 Falco peregrinus Peregrine falcon 130X contigN50=28.6K,scaffoldN50=4.47M

61 Falco cherrug Saker falcon 41X contigN50=9.2K,scaffoldN50=42.7K

66 Pygoscelis adeliae Adelie penguin 90X contigN50=19K,scaffoldN50=5M

67Aptenodytes forsteri Emperor penguin 67X contigN50=30K,scaffoldN50=5M

70Panthera tigris altaica Amur tiger 39X contigN50=4.1K,scaffoldN50=27.7K

71 Acinonyx jubatus Cheetah 61X contigN50=30K,scaffoldN50=3M

72 Panthera leo Lion 70X contigN50=11.6K,scaffoldN50=1.32M

82 Camelus bactrianus Bactrian camel 62X contigN50=8.4K,scaffoldN50=61.5K

Sequencing of species

mammal

reptile

fish

bird

SPECIES # SPECIES COMMON NAME DETAIL

4Polypterus senegalus Bichir sequencing

9Aristichthys nobilis Bighead carp sequencing

13Hippocampus comes Tiger tail seahorse sequencing

15Scleropages formosus Golden arowana sequencing

25Mola mola Sunfish sequencing

50Chelonia mydas Green turtle sequencing

53 Calypte anna Anna's hummingbird sample arrived

68Struthio camelus Ostrich sequencing

84Elaphurus davidianus Pere David's deer sequencing

94Tachyglossus aculeatus Short-beaked echidna sequencing

Straw webhost on genomes

http://climb.genomics.org.cn/g10k/home.jsp Please advise what kind of functions to include,

considering the fact that genomes will be available at different levels of completeness: Finished map Fine map w/ haploids solved Draft map w/ physical map anchord

http://climb.genomics.org.cn/g10k/home.jsp

http://climb.genomics.org.cn/g10k/home.jsp

Thank you!

Ngs de novo assembly progresses and challenges

Technology

Transcript of Ngs de novo assembly progresses and challenges