Bionano genome maps_feb2014
-
Upload
kstatebioinformatics -
Category
Education
-
view
801 -
download
0
description
Transcript of Bionano genome maps_feb2014
Whole genome restriction maps for nonmodel organisms: genomic resources where there were none.Sue BrownDivision of BiologyKansas State University
Tuesday, February 25, 14
Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome
maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering
• Results
Tuesday, February 25, 14
Genomes• Genomes come in many sizes• Genome assemblies come in many qualities• Draft Assemblies▫Most genomes sequenced today (nonmodel)
• Finished Assemblies▫Model organisms (lots of resources) Human Computational Genetic and genomic tools
• Genomic resources increase the value of the genome sequence▫ Reverse genetic approaches
Tuesday, February 25, 14
Many initiatives to sequence genomes• 1,000 human genomes▫ To provide a deep catalog of human genetic
variation• Genome 10K -started as an intiative to
sequence 10,000 vertebrate genomes. Database currently catalogs specimens from over 16,000 organisms▫ To understand how complex animal life evolved
through changes in DNA and use this knowledge to become better stewards of the planet
Tuesday, February 25, 14
Letter to Science Announces i5k in 2011
Tuesday, February 25, 14
Why sequence 5,000 insect genomes?• 53% of all living species• Maintenance and productivity of natural and agricultural
ecosystems• Consume or damage 25% of all agricultural, forestry and
livestock production▫ >$30 Billion in annual loss
• Vector plant, animal and human disease▫ >$50 Billion cost world wide
• Just as human and veterinary medicine now rely on personal or animal genome info, revealing info stored in their genomes will transform our ability to manage insects that threaten our health, food supply and economic security
• Improve our lives Tuesday, February 25, 14
Standard Draft Genome Assemblies
• Highly fragmented, even at deep coverage• Scaffolds terminate in repetitive regions• Relatively low N50 values• Example: • 7x Sanger-based Tribolium castaneum
genome assembly
Tuesday, February 25, 14
Tribolium castaneum genomics
• Cot analysis▫ Genome ~200Mb▫ Long stretches of unique sequence▫ Low methylation
• 9 autosomes, X and Y
Jeff Stuart, Purdue
Tuesday, February 25, 14
Standard Draft Minimally or unfiltered data, from any number of
different sequencing platforms, that are assembled into contiguous strings of bases (AGTC), with no gaps (contigs).
Science Oct 9, 2009 pp236-237
This is the minimum standard for submission to public databases.
http://compbio.pbworks.comTuesday, February 25, 14
Molecular linkage map used to anchor scaffolds in chromosome builds (ChLG)
Low X coverage, no Y, marker density varies
Tuesday, February 25, 14
Molecular linkage map used to anchor scaffolds in chromosome builds (ChLG)
Low X coverage, no Y, marker density varies
Tuesday, February 25, 14
• Number of contigs! ! ! 8,814• Contig N50! ! ! 43,511• Number of scaffolds!! ! 481• Scaffold N50! ! ! 975,455• Total number of chromosomes! 10 (-Y)• Unmapped scaffolds!! ! 352• Single contig scaffolds 1835
• (481 + 1830 = 2321 scaffolds total)
T. castaneum assembly stats
Tuesday, February 25, 14
Scaffold structure of the Tribolium genome assembly
NWAAJJ
NW NW
300K Ns 300K Ns
DSAAJJ
DSAAJJ
DSAAJJ
DSAAJJ
ChLG
Unanchored
Tuesday, February 25, 14
Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome
maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering
• Results
Tuesday, February 25, 14
Genome assembly improvements
Improving the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA MoleculesNic Herndon Kansas State University, Jennifer M. Shelton Kansas State University, Warren Andrews BioNano Genomics, Weiping WangBioNano Genomics, Susan J. Brown Kansas State University
Genome assemblies come in all qualities. Most are basically drafts of the genome, but even the most heavily curated assemblies contain misassemblies and truncations or gaps in repetitive regions. The 7x draft assembly of the Tribolium genome is based on paired-end Sanger sequencing of 4-6 Kb insert plasmid libraries, scaffolded with paired-end reads from 40Kb fosmid and ~130Mb BAC clones. The total assembled length of ~156 Mb represents 75% of the estimated genome (200Mb) and presumably lacks a significant portion of repetitive DNA. Superscaffolds or chromosome builds (ChLG 2-10 and X) were constructed by mapping molecular markers from the genetic recombination map to the assembly scaffolds, anchoring greater than 90% of the assembled sequence1 (fig1). To improve this draft assembly, we constructed physical maps of the T. castaneum genome. Using the
irys system designed by BioNano Genomics (http://www.bionanogenomics.com/). Ultra long molecules (Mb) were nicked on one strand with Nt.BspQI and labeled with fluorescent nucleotides. Individual molecules were imaged on a massively parallel scale in nanochannels etched on silicon chips. Consensus maps de novo assembled from the imaged molecules were compared with in silico maps generated from the assembly sequence. Here we report our progress on using these comparisons to validate the assembly in regions were they agree and reanalyze the assembly in regions were they do not. Additional scaffolds have been anchored to the chromosomes, order and orientation of scaffolds have been corrected, and scaffolds have been extended by spanning repetitive regions. Nature 2008 452:949-55.
Figure 1 Molecular linkage maps T. castaneum LGX cannot be ordered using this technique because of too few markers. Additionally, regions with low recombination and higher marker density can only be randomly ordered. BioNano ultra-long molecules can be assembled order and orient scaffolds regardless of recombination rates.
Incorporating unplaced scaffolds:An unknown is incorporated into Linkage group 5
ChLG5
Unknown45
ChLG5ChLG5
Unknown18 ChLGX ChLGX ChLGX ChLGX
Incorporating unplaced scaffolds and super-scaffolding difficult alignments:An unknown is incorporated into Linkage group X and many Linkage group X scaffolds are aligned across potential label-poor repeats
ChLG5ChLG5
ChLG5
Repeat
Align through repeats to find mis-assemblies:Sequence assemblers often fail around repeats but BioNano molecules can span long repeats and point to potential mis-assemblies. Here alignment of a scaffold from Linkage group 5 stops around a repetitive pattern in the BioNano molecule map when aligned to Linkage group 5 but the unaligned region of the scaffold aligns to Linkage group 7. The putative repeat may be a source for mis-assembly.
ChLG7 ChLG7 ChLG7 ChLG7
ChLG5
Figure Mis-assemblies
ChLG10 ChLG10Unknown 11
ChLG3
Incorporating unplaced scaffolds and identifying mis-assemblies:An unknown is incorporated into Linkage group 10 and a portion of Linkage group 3 appears to have been mis-assembled
Figure Unplaced scaffolds
ChLG3 ChLG3 ChLG3
Validate and expand super-scaffolds:Three scaffolds from Linkage group 3 have been supper-scaffolded with captured gaps
Figure Assembly validation
T. castaneum 4.0 and gam-ngs
Gam-ngs merged Illumina assembly and T.cas 4.0 extending several unknowns and an LGX scaffold.
length (Mb): 160.864scaffolds: 2219scaffold N50 (Mb): 1.16
T. castaneum 4.0 and gam-ngs plus BioNano maps
Sequence scaffolds were aligned to maps with IrysView the alignment was filtered and used to create new scaffolds.
length (Mb): 189.629scaffolds: 2153scaffold N50 (Mb): 3.31
T. castaneum 4.0 Illumina long distance jump-libraries extended scaffolds into gaps and capturing gaps with Atlas gap-link and gap-filler.
length (Mb): 160.862scaffolds: 2219scaffold N50 (Mb): 1.16
T. castaneum 3.0 Baylor Sanger 7x draft assembly and molecular genetic map
length (Mb): 160.466scaffolds: 2321scaffold N50 (Mb): 0.98
Figure 2 Genome refinements
481
411
411
341
multicontig scaffolds
An independent platform to validate and improve genomesTuesday, February 25, 14
How to validate a de novo assembly?• Describe assembly
# contigs, # scaffolds, total bases, N50 lengths coverage, # ESTs, # orthologs found
• But is the assembly accurate?▫ Compare to BAC sequences▫ If you have the resources
• Need independent (reasonably priced) method
Tuesday, February 25, 14
Genome maps based on landmarks
• BioNanos Genomics▫ San Diego, California
• Imaging ultra-long molecules of DNA• Labeled at restriction sites
Tuesday, February 25, 14
Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome
maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering
• Results
Tuesday, February 25, 14
Introducing the irys system
Tuesday, February 25, 14
Labeling schemaBspQ1 nicks at GCTCTTCN CGAGAAGN
10 sites /100 Kb
Tuesday, February 25, 14
Chip Design
Tuesday, February 25, 14
Samples loaded into 2 flow cells per chip
3 lasers 3 detection channelsDetect yoyo 1 in DNA backboneFluorescent nucleotides at labeled sites
Tuesday, February 25, 14
DNA molecules entering channels
Tuesday, February 25, 14
DNA molecules entering channels
Tuesday, February 25, 14
A long repeat in the Tribolium genome
Tuesday, February 25, 14
Mapping individual images back to map
• hthe
24
Regions flanking repeat are unique Some sites are polymorphic
Tuesday, February 25, 14
Limitations of the Irys system• Sample prep is very specific• Requires gram amounts of starting material• Bacterial cells, tissue culture cells, eukaryotic
nuclei• Less complex tissue is best ▫ Blood▫ Embryos
• Not applicable to transcriptomics projects• contig N50 >30Kb (5 restriction sites)
Tuesday, February 25, 14
Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome
maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering
• Results
Tuesday, February 25, 14
Assembly images into genomic maps.tiff
.bnx
.cmap
Tuesday, February 25, 14
Align BNG maps to in silico maps (.xmap)
28
Tuesday, February 25, 14
File formats are similar to generating sequence data...
29
@SRR014849.2 EIXKN4201AKDUH/2TCAAGTGGTGAACGGCAGAAA+<=B:==B:=<?6=B;<;=B=)
0! 21! 202146.41! 1096.2! 8973.8QX11! 10.0565! 11.7966QX12! 0.0187! 0.0604
Image files fastq
Image filesbnx
fasta
>conitg1TCAAGTGGTGAACGGCAGAAA
#h CMapId ContigLength NumSites SiteID LabelChannel Position StdDev Coverage Occurrence#f int float int int int float float int int 393 225073.2 21 1 1 20.0 0.0 3 3
cmap
HWI-ST330_C0NEHACXX:2:1101:17113:52802#0! 69! contig1! 2578! 0! *! =! 2578! 0!ATTACGGCCCATGGTTCAGAATAATGACGAATAGAAATACTAGTACTATATCCCCTAAAAAA!<@CFFFFFHHGFHJHIJJJJJJJJJFJJJFGFHEHIHGHJGIJHIIIJJJJJJJJIJIIJIH!YT:Z:UP
sam
#h XmapEntryID!QryContigID! RefcontigID! QryStartPos! QryEndPos! RefStartPos! RefEndPos! Orientation! Confidence! HitEnum#f int !int ! int ! float ! float ! float ! float ! string ! float ! string 1! 94! 1! 444392.7! 5839.8! 57024.0! 550038.8! -! 28.87!1M1D2M3I4D1M3I2M1I7M1I1M1I9M1I1M1I2M1I3M1D2M
xmap
basecall de novo assemble align
call labels de novo assemble align
Tuesday, February 25, 14
Visualizing an xmap
contig id
sequence-based scaffold
label alignment
BioNano contig map
coverage
Tuesday, February 25, 14
Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome
maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering
• Results
Tuesday, February 25, 14
Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome
maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering
• Results
Tuesday, February 25, 14
K-INBRE i5K Github scripts:Irys Scaffolding scripts and manuals written by Jennifer Shelton and Nic Herndon Assembly workflow was developed by Ernest Lam (BioNano)
git pull https://github.com/i5K-KINBRE-script-share/Irys-scaffolding
Tuesday, February 25, 14
Assembly pipeline34
scripts available at: i5k-KINBRE script share at GitHub: Irys-scaffoldinghttps://github.com/i5K-KINBRE-script-share/Irys-scaffolding
developed with Ernest Lam (BioNano)
Tuesday, February 25, 14
Filtering alignmentsLabel density varies throughout the genome so we created scripts to filter in two passes:
Pass 1: looks for high confidence score over at least ~30% of the total possible alignment
Pass 2: looks for low confidence score over the majority of the total possible alignment (~90%)
Pass 1 finds most high quality alignments. Pass 2 finds high-quality low-density alignments.
Tuesday, February 25, 14
Filtering alignments
Super-scaffolded scaffolds are joined in a new reference fasta file.
Overlaping scaffolds have a 30bp spacing gap between them
If a scaffold aligns more than once only the longest alignment is used
If two alignments have the same length only the highest confidence alignment is used
Tuesday, February 25, 14
Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome
maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering
• Results
Tuesday, February 25, 14
BNG restriction maps for Tcastaneum
38
• Dual nicked Bsp.QI and BbvCI• 28.6Gb = ~143x coverage of 200Mb Tribolium genome
(>150 Kb)
• N contigs: 216• Total Contig Len (Mb): 200.473• Avg. Contig Len (Mb): 0.928• Contig N50 (Mb): 1.350
• Total Ref Len (Mb): 157.186• Total Contig Len / Ref Len : 1.275
Tuesday, February 25, 14
ChLG XChLGX had 13 scaffolds. Alignment to BioNano maps captured gaps and validated order for 11 of 13 scaffolds, incorporated 2 unplaced scaffolds and identified a potential misplaced scaffold (scaffold 2 aligns with another linkage group).
Tuesday, February 25, 14
ChLG XChLGX had 13 scaffolds. Alignment to BioNano maps captured gaps and validated order for 11 of 13 scaffolds, incorporated 2 unplaced scaffolds and identified a potential misplaced scaffold (scaffold 2 aligns with another linkage group).
Tuesday, February 25, 14
ChLG XChLGX had 13 scaffolds. Alignment to BioNano maps captured gaps and validated order for 11 of 13 scaffolds, incorporated 2 unplaced scaffolds and identified a potential misplaced scaffold (scaffold 2 aligns with another linkage group).
Tuesday, February 25, 14
ChLG 7Alignment to BioNano maps captured gaps and validated order for 13 of 15 scaffolds. Scaffold 14 needs to be reversed in the super-scaffold.
Tuesday, February 25, 14
ChLG 7Alignment to BioNano maps captured gaps and validated order for 13 of 15 scaffolds. Scaffold 14 needs to be reversed in the super-scaffold.
Tuesday, February 25, 14
Additional chromosome linkage groups.
Tuesday, February 25, 14
Additional chromosome linkage groups.
ChLG 3
Tuesday, February 25, 14
Additional chromosome linkage groups.
ChLG 3
ChLG 9
Tuesday, February 25, 14
Additional chromosome linkage groups.
ChLG 3
ChLG 9
ChLG 2
Tuesday, February 25, 14
what does it cost?
• 100-500Mb genome <$5,000▫ 70-100x coverage
• 1Gb genome <$8,000▫ 70-100x coverage
• completely dependent on homogeneity of starting material
• assembly and analysis software is included in price
42
Tuesday, February 25, 14
Summary• Standard Draft Genomes are highly fragmented• BNG provides independent platform• Whole genome restriction maps• Validate assembly• Extend scaffolds/Size Gaps• Identify structural variants• Identify haplotypes• Comprehensive view of repetitive DNA (HORs)• A validated genome assembly improves
downstream analyses
43
Tuesday, February 25, 14
Thanks to:• Michelle Gordon ▫ Research Assistant: optimizing sample preps
• Jennifer Shelton▫ Biologist turned Bioinformaticist
• Nic Herndon▫ Computer scientist turned Bioinformaticist
• BioNano Genomics▫ Ernest Lam▫Weiping Wang
Tuesday, February 25, 14