Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger...
-
Upload
scot-wells -
Category
Documents
-
view
215 -
download
0
Transcript of Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger...
Genome De Novo Assemblies Genome De Novo Assemblies and Applications in NGS and Applications in NGS
SequencingSequencing
Zemin NingZemin Ning
The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute
My academic background Challenges in genome assemblies from pure
Illumina reads The Phusion2 pipeline The Tasmanian devil genome project The Devil genome assembly Other assemblies: human ,
bamboo,miscanthus, etc
Outline of the Talk:
Powder Simulation
Hair Dynamics
Genetics and Human Hair Structure Genetics and Human Hair Structure
AFRICANAFRICAN CAUCASIANCAUCASIAN EAST ASIANEAST ASIAN
SSAHA (Sequence Search and Alignment by the Hashing AlgorithmSsaha2 – Alignment tool for Solexa, 454, ABI capillary reads
ssahaSNP – SNP/indel detection, mainly for ABI capillary reads
ssahaEST – EST or cDNA alignment
ssaha_SV – Structural variation (CNVs) detection
ssaha_pileup – SNP/indel detection from next-gen data Phusion & Phusion2
Development and maintenance of the pipeline
Production of WGS assemblies:
Mouse, Zebrafish, Human (Venter genome), C. Briggsae, Rice, Schisto, Sea Lamprey, Gorilla, Malaria and many bacterial genomes
TraceSeachPublic sequence search facility for all the traces
FuzzypathShort read assembler
Informatics Projects InvolvedInformatics Projects Involved
Challenges in Whole Genome Assembly using Pure Illumina Reads
Short read length: 2x36; 2x54; 2x75; 2x100 Large genome and huge datasets
For human: 100Gb at 30x Repetitive/Duplication structures, Alus, LINES, SVAs
30-40% such as human, mouse; 50-60% such as rice and other plant genomes.
Tandem repeats: how many copies they have?TATATATATATATATATATATATATATAGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT
De Bruijn vs Read overlapDe Bruijn vs Read overlap
Missing from de Bruijn contigsMissing from de Bruijn contigs
Missing sequencesMissing sequences
Phusion2 Assembly PipelinePhusion2 Assembly Pipeline
SolexaReads
Assembly
Reads Group
Data Process Long Insert Reads
Supercontig
Contigs
PRono
Fuzzypath
Velvet
Phrap
2x75 or 2x100
BaseCorrection
RP_Assemble
Gap-HashGap-Hash4x34x3
ATGGGCAGATGTATGGGCAGATGT
TGGCCAGTTGTTTGGCCAGTTGTT
GGCGAGTCGTTCGGCGAGTCGTTC
GCGTGTCCTTCGGCGTGTCCTTCG
ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGTTCGTCGGATCGATCAA
ATGGCGTGCAGTATGGCGTGCAGT
TGGCGTGCAGTCTGGCGTGCAGTC
GGCGTGCAGTCCGGCGTGCAGTCC
GCGTGCAGTCCAGCGTGCAGTCCA
CGTGCAGTCCATCGTGCAGTCCAT
ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA
ContiguousContiguous Base HashBase Hash
K = 12K = 12
Kmer Word HashingKmer Word Hashing
Word use distribution for the mouse sequence data at ~7.5 foldWord use distribution for the mouse sequence data at ~7.5 fold
Useful Region
Poisson Curve
Real Data Curve
Sorted List of Each k-Mer and Its Read Indices
ACAGAAAAGC 10h06.p1cACAGAAAAGC 12a04.q1cACAGAAAAGC 13d01.p1cACAGAAAAGC 16d01.p1cACAGAAAAGC 26g04.p1cACAGAAAAGC 33h02.q1cACAGAAAAGC 37g12.p1cACAGAAAAGC 40d06.p1cACAGAAAAGG 16a02.p1cACAGAAAAGG 20a10.p1cACAGAAAAGG 22a03.p1cACAGAAAAGG 26e12.q1cACAGAAAAGG 30e12.q1cACAGAAAAGG 47a01.p1c
High bits Low bits
64 -2k64 -2k 2k2k
1 2 3 4 5 6 … j … N
3
1
4
2
6
5
i
N
41 0 0 0 0
R(i,j)
Relation Matrix: R(i,j) – number of kmer Relation Matrix: R(i,j) – number of kmer words shared between read i and read jwords shared between read i and read j
41 37 0 0 0 0 37 0 22 0
0 0 22 0 0
0 0 0 0 27
0 0 0 27 0
Group 1: (1,2,3,5)Group 1: (1,2,3,5)
Group 2: (4,6)Group 2: (4,6)
Paired Reads Separated by “NN”Paired Reads Separated by “NN”
Error Bases CorrectionError Bases Correction
Mis-assembly errors: Mis-assembly errors: Contig BreakingContig Breaking
Track read pairs to walk through
repetitive regions
Read Pair Guided Local AssemblerRead Pair Guided Local Assembler
Tasmanian devil
Opo
ssum
Wal
laby
Tasm
ania
n
devi
l
Tasmanian devil facial tumour disease (DFTD)
Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils
Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults
>1yr Death in 4 – 6 months
Forestier (33)
Fentonbury (no host)
Reedy Marsh
Railton
Mangalore
Frankford
Kempton (2)
Mt William (2)
Coles Bay
Upper Natone
West Pencil Pine (3)Trowunna (2)
Narawntapu
Tarraleah
Bronte Park
2006
2007
2008
14
4
13
Nugent (2)
St Mary’s (2)
Wisedale (?)
DFTD samplesDFTD originated here c.1996
Area still DFTD free
Reedy Marsh 2007
Mangalore 2007
Mt William 2007 or 2008
Coles Bay
Upper Natone 2007
Narawntapu 2007
Strain 1, tetraploid
Strain 2
Strain 3
DFTD samples for sequencing
DFTD originated here c.1996
Area still DFTD free
Unknown strain
“Evolved”
Forestier 2007
Sequencing T. Devil on Illumina: Strategy
Tumour or normal genomic DNA
Fragments of defined size0.5, 2, 5, 7, 8, 10 kb
Sequencing
2x100bp reads short insert
2x50bp mate pairs
Alignment using bwa, ssaha2
Somatic mutations
Germline variants
fragment size distribution
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 1000 2000 3000 4000 5000 6000
size
fre
qu
en
cy
tumour 2kb
tumour 3kb
tumour 4kb
normal 2kb
normal 3kb
normal 4kb
Sequencing performed at Illumina
De novo Assembly
Solexa reads:Number of read pairs: 528 Million;Finished genome size: 3.5 GB;Read length: 2x100bp;Estimated read coverage: ~30X;Insert size: 410/50-600 bp;Mate pair data: 2k,4k,5k,6k,8k,10kNumber of reads clustered: 458 Million
Assembly features: - statsContigs Supercontigs
Total number of contigs: 1,246,970 792,099Total bases of contigs: 3.22 Gb 3,62 GbN50 contig size: 9,642 434,642Largest contig: 96,919 4,150,712 Averaged contig size: 2,578 4,564Contig coverage on genome: ~92% >99%Ratio of placed PE reads: ~92% ?
Genome Genome Assembly – T. DevilAssembly – T. Devil
Monodelphis domestica( Opossum )
Macropus eugenii (Wallaby)
Sminthopsis macroura(Dunnart)
Brown BearDog
Pipeline of Contig Gap ClosurePipeline of Contig Gap Closure
Solexa reads:Number of read pairs: 560 Million;Finished genome size: 3.0 GB;Read length: 2x100bp;Estimated read coverage: ~37X;Insert size: 500/50-700 bp;Number of reads clustered: 499 Million
Assembly features: - contig statsTotal number of contigs: 1,142,077;Total bases of contigs: 2.92 GbN50 contig size: 12,875;Largest contig: 140,463 Averaged contig size: 2,561;Contig coverage over the genome: ~94 %;Mis-assembly errors: ?
Human Human Assembly - Yoruba NA18507Assembly - Yoruba NA18507
Solexa reads:Number of read pairs: 359 Million;Finished genome size: 2.0 GB;Read length: 2x120bp;Estimated read coverage: ~43X;Insert size: 500/50-700 bp;Number of reads clustered: 316 Million
Assembly features: - contig statsTotal number of contigs: 733,465;Total bases of contigs: 1.91 GbN50 contig size: 8,163;Largest contig: 117,250 Averaged contig size: 2,592;Contig coverage over the genome: ~92 %;Mis-assembly errors: ?
Bamboo Genome Bamboo Genome Assembly Assembly TetraploidTetraploid
Solexa reads:Number of read pairs: 502 Million;Finished genome size: 2.0 GB;Read length: 2x76bp;Estimated read coverage: ~35X;Insert size: 410/50-600 bp;Mate pair data: 5KbNumber of reads clustered: 438 Million
Assembly features: - statsContigs Supercontigs
Total number of contigs: 2,241,465 2,090,385Total bases of contigs: 1.64 Gb 1.92 GbN50 contig size: 4,301 29,076Largest contig: 71,161 730,290 Averaged contig size: 732 919Contig coverage on genome: ~85% >95%Ratio of placed PE reads: ~82% ?
Genome Genome Assembly – MiscanthusAssembly – Miscanthus
Melanoma cell line COLO-829
Paul Edwards, Departments of Pathology and Oncology, University of Cambridge
Plots of INDELs/SVs size distribution for all events detected by Pindel at single-base resolution. Left, insertions from 1bp to 60 bp. Right, deletions from 1bp to 1Mb.
Insertion
H1
Ref
B
Deletion
H1
Ref
B1 B2
Homozygous/Heterozygous Indels
(a) Insertions: Solid lines – reads with alignment terminates at the breakpoint; dashed line – reads with alignment crosses over the breakpoint. (b) Deletion: Solid line – read with alignment terminates at breakpoint; Dashed lines – reads with alignment crosses over the breakpoint.
(a) (b)
Assemblies are used to confirm Pindel predictions: (a) deletion is confirmed by aligning two flanking sequences F1 and F2 to the reference; (b) deletion is not found in the reference with flanking sequences; (c) insertion is confirmed.
Acknowledgements: Elizabeth Murchuson Erin Preasance Mike Stratton
Kai Ye
Dirk Evers Ole Schulz-Trieglaff
Qi Feng Bin Han