Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger...

Genome De Novo Assemblies Genome De Novo Assemblies and Applications in NGS and Applications in NGS

SequencingSequencing

Zemin NingZemin Ning

The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

My academic background Challenges in genome assemblies from pure

Illumina reads The Phusion2 pipeline The Tasmanian devil genome project The Devil genome assembly Other assemblies: human ,

bamboo,miscanthus, etc

Outline of the Talk:

Powder Simulation

Hair Dynamics

Genetics and Human Hair Structure Genetics and Human Hair Structure

AFRICANAFRICAN CAUCASIANCAUCASIAN EAST ASIANEAST ASIAN

SSAHA (Sequence Search and Alignment by the Hashing AlgorithmSsaha2 – Alignment tool for Solexa, 454, ABI capillary reads

ssahaSNP – SNP/indel detection, mainly for ABI capillary reads

ssahaEST – EST or cDNA alignment

ssaha_SV – Structural variation (CNVs) detection

ssaha_pileup – SNP/indel detection from next-gen data Phusion & Phusion2

Development and maintenance of the pipeline

Production of WGS assemblies:

Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes

TraceSeachPublic sequence search facility for all the traces

FuzzypathShort read assembler

Informatics Projects InvolvedInformatics Projects Involved

Challenges in Whole Genome Assembly using Pure Illumina Reads

Short read length: 2x36; 2x54; 2x75; 2x100 Large genome and huge datasets

For human: 100Gb at 30x Repetitive/Duplication structures, Alus, LINES, SVAs

30-40% such as human, mouse; 50-60% such as rice and other plant genomes.

Tandem repeats: how many copies they have?TATATATATATATATATATATATATATAGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT

De Bruijn vs Read overlapDe Bruijn vs Read overlap

Missing from de Bruijn contigsMissing from de Bruijn contigs

Missing sequencesMissing sequences

Phusion2 Assembly PipelinePhusion2 Assembly Pipeline

SolexaReads

Assembly

Reads Group

Data Process Long Insert Reads

Supercontig

Contigs

PRono

Fuzzypath

Velvet

Phrap

2x75 or 2x100

BaseCorrection

RP_Assemble

Gap-HashGap-Hash4x34x3

ATGGGCAGATGTATGGGCAGATGT

TGGCCAGTTGTTTGGCCAGTTGTT

GGCGAGTCGTTCGGCGAGTCGTTC

GCGTGTCCTTCGGCGTGTCCTTCG

ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGTTCGTCGGATCGATCAA

ATGGCGTGCAGTATGGCGTGCAGT

TGGCGTGCAGTCTGGCGTGCAGTC

GGCGTGCAGTCCGGCGTGCAGTCC

GCGTGCAGTCCAGCGTGCAGTCCA

CGTGCAGTCCATCGTGCAGTCCAT

ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA

ContiguousContiguous Base HashBase Hash

K = 12K = 12

Kmer Word HashingKmer Word Hashing

Word use distribution for the mouse sequence data at ~7.5 foldWord use distribution for the mouse sequence data at ~7.5 fold

Useful Region

Poisson Curve

Real Data Curve

Sorted List of Each k-Mer and Its Read Indices

ACAGAAAAGC 10h06.p1cACAGAAAAGC 12a04.q1cACAGAAAAGC 13d01.p1cACAGAAAAGC 16d01.p1cACAGAAAAGC 26g04.p1cACAGAAAAGC 33h02.q1cACAGAAAAGC 37g12.p1cACAGAAAAGC 40d06.p1cACAGAAAAGG 16a02.p1cACAGAAAAGG 20a10.p1cACAGAAAAGG 22a03.p1cACAGAAAAGG 26e12.q1cACAGAAAAGG 30e12.q1cACAGAAAAGG 47a01.p1c

High bits Low bits

64 -2k64 -2k 2k2k

1 2 3 4 5 6 … j … N

3

1

4

2

6

5

i

N

41 0 0 0 0

R(i,j)

Relation Matrix: R(i,j) – number of kmer Relation Matrix: R(i,j) – number of kmer words shared between read i and read jwords shared between read i and read j

41 37 0 0 0 0 37 0 22 0

0 0 22 0 0

0 0 0 0 27

0 0 0 27 0

Group 1: (1,2,3,5)Group 1: (1,2,3,5)

Group 2: (4,6)Group 2: (4,6)

Paired Reads Separated by “NN”Paired Reads Separated by “NN”

Error Bases CorrectionError Bases Correction

Mis-assembly errors: Mis-assembly errors: Contig BreakingContig Breaking

Track read pairs to walk through

repetitive regions

Read Pair Guided Local AssemblerRead Pair Guided Local Assembler

Tasmanian devil

Opo

ssum

Wal

laby

Tasm

ania

n

devi

l

http://z.about.com/d/geography/1/0/f/K/australia.jpg

Tasmanian devil facial tumour disease (DFTD)

Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils

Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults

>1yr Death in 4 – 6 months

Forestier (33)

Fentonbury (no host)

Reedy Marsh

Railton

Mangalore

Frankford

Kempton (2)

Mt William (2)

Coles Bay

Upper Natone

West Pencil Pine (3)Trowunna (2)

Narawntapu

Tarraleah

Bronte Park

2006

2007

2008

14

4

13

Nugent (2)

St Mary’s (2)

Wisedale (?)

DFTD samplesDFTD originated here c.1996

Area still DFTD free

Reedy Marsh 2007

Mangalore 2007

Mt William 2007 or 2008

Coles Bay

Upper Natone 2007

Narawntapu 2007

Strain 1, tetraploid

Strain 2

Strain 3

DFTD samples for sequencing

DFTD originated here c.1996

Area still DFTD free

Unknown strain

“Evolved”

Forestier 2007

Sequencing T. Devil on Illumina: Strategy

Tumour or normal genomic DNA

Fragments of defined size0.5, 2, 5, 7, 8, 10 kb

Sequencing

2x100bp reads short insert

2x50bp mate pairs

Alignment using bwa, ssaha2

Somatic mutations

Germline variants

fragment size distribution

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 1000 2000 3000 4000 5000 6000

size

fre

qu

en

cy

tumour 2kb

tumour 3kb

tumour 4kb

normal 2kb

normal 3kb

normal 4kb

Sequencing performed at Illumina

De novo Assembly

Solexa reads:Number of read pairs: 528 Million;Finished genome size: 3.5 GB;Read length: 2x100bp;Estimated read coverage: ~30X;Insert size: 410/50-600 bp;Mate pair data: 2k,4k,5k,6k,8k,10kNumber of reads clustered: 458 Million

Assembly features: - statsContigs Supercontigs

Total number of contigs: 1,246,970 792,099Total bases of contigs: 3.22 Gb 3,62 GbN50 contig size: 9,642 434,642Largest contig: 96,919 4,150,712 Averaged contig size: 2,578 4,564Contig coverage on genome: ~92% >99%Ratio of placed PE reads: ~92% ?

Genome Genome Assembly – T. DevilAssembly – T. Devil

Monodelphis domestica( Opossum )

Macropus eugenii (Wallaby)

Sminthopsis macroura(Dunnart)

Brown BearDog

Pipeline of Contig Gap ClosurePipeline of Contig Gap Closure

Solexa reads:Number of read pairs: 560 Million;Finished genome size: 3.0 GB;Read length: 2x100bp;Estimated read coverage: ~37X;Insert size: 500/50-700 bp;Number of reads clustered: 499 Million

Assembly features: - contig statsTotal number of contigs: 1,142,077;Total bases of contigs: 2.92 GbN50 contig size: 12,875;Largest contig: 140,463 Averaged contig size: 2,561;Contig coverage over the genome: ~94 %;Mis-assembly errors: ?

Human Human Assembly - Yoruba NA18507Assembly - Yoruba NA18507

Solexa reads:Number of read pairs: 359 Million;Finished genome size: 2.0 GB;Read length: 2x120bp;Estimated read coverage: ~43X;Insert size: 500/50-700 bp;Number of reads clustered: 316 Million

Assembly features: - contig statsTotal number of contigs: 733,465;Total bases of contigs: 1.91 GbN50 contig size: 8,163;Largest contig: 117,250 Averaged contig size: 2,592;Contig coverage over the genome: ~92 %;Mis-assembly errors: ?

Bamboo Genome Bamboo Genome Assembly Assembly TetraploidTetraploid

Solexa reads:Number of read pairs: 502 Million;Finished genome size: 2.0 GB;Read length: 2x76bp;Estimated read coverage: ~35X;Insert size: 410/50-600 bp;Mate pair data: 5KbNumber of reads clustered: 438 Million

Assembly features: - statsContigs Supercontigs

Total number of contigs: 2,241,465 2,090,385Total bases of contigs: 1.64 Gb 1.92 GbN50 contig size: 4,301 29,076Largest contig: 71,161 730,290 Averaged contig size: 732 919Contig coverage on genome: ~85% >95%Ratio of placed PE reads: ~82% ?

Genome Genome Assembly – MiscanthusAssembly – Miscanthus

Melanoma cell line COLO-829

Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

Plots of INDELs/SVs size distribution for all events detected by Pindel at single-base resolution. Left, insertions from 1bp to 60 bp. Right, deletions from 1bp to 1Mb.

Insertion

H1

Ref

B

Deletion

H1

Ref

B1 B2

Homozygous/Heterozygous Indels

(a) Insertions: Solid lines – reads with alignment terminates at the breakpoint; dashed line – reads with alignment crosses over the breakpoint. (b) Deletion: Solid line – read with alignment terminates at breakpoint; Dashed lines – reads with alignment crosses over the breakpoint.

(a) (b)

Assemblies are used to confirm Pindel predictions: (a) deletion is confirmed by aligning two flanking sequences F1 and F2 to the reference; (b) deletion is not found in the reference with flanking sequences; (c) insertion is confirmed.

Acknowledgements: Elizabeth Murchuson Erin Preasance Mike Stratton

Kai Ye

Dirk Evers Ole Schulz-Trieglaff

Qi Feng Bin Han

Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger...

Documents

Transcript of Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger...