Bionano genome maps_feb2014

52
Whole genome restriction maps for nonmodel organisms: genomic resources where there were none. Sue Brown Division of Biology Kansas State University Tuesday, February 25, 14

description

Whole genome restriction maps for nonmodel organisms: genomic resources where there were none.

Transcript of Bionano genome maps_feb2014

Page 1: Bionano genome maps_feb2014

Whole genome restriction maps for nonmodel organisms: genomic resources where there were none.Sue BrownDivision of BiologyKansas State University

Tuesday, February 25, 14

Page 2: Bionano genome maps_feb2014

Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome

maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering

• Results

Tuesday, February 25, 14

Page 3: Bionano genome maps_feb2014

Genomes• Genomes come in many sizes• Genome assemblies come in many qualities• Draft Assemblies▫Most genomes sequenced today (nonmodel)

• Finished Assemblies▫Model organisms (lots of resources) Human Computational Genetic and genomic tools

• Genomic resources increase the value of the genome sequence▫ Reverse genetic approaches

Tuesday, February 25, 14

Page 4: Bionano genome maps_feb2014

Many initiatives to sequence genomes• 1,000 human genomes▫ To provide a deep catalog of human genetic

variation• Genome 10K -started as an intiative to

sequence 10,000 vertebrate genomes. Database currently catalogs specimens from over 16,000 organisms▫ To understand how complex animal life evolved

through changes in DNA and use this knowledge to become better stewards of the planet

Tuesday, February 25, 14

Page 5: Bionano genome maps_feb2014

Letter to Science Announces i5k in 2011

Tuesday, February 25, 14

Page 6: Bionano genome maps_feb2014

Why sequence 5,000 insect genomes?• 53% of all living species• Maintenance and productivity of natural and agricultural

ecosystems• Consume or damage 25% of all agricultural, forestry and

livestock production▫ >$30 Billion in annual loss

• Vector plant, animal and human disease▫ >$50 Billion cost world wide

• Just as human and veterinary medicine now rely on personal or animal genome info, revealing info stored in their genomes will transform our ability to manage insects that threaten our health, food supply and economic security

• Improve our lives Tuesday, February 25, 14

Page 7: Bionano genome maps_feb2014

Standard Draft Genome Assemblies

• Highly fragmented, even at deep coverage• Scaffolds terminate in repetitive regions• Relatively low N50 values• Example: • 7x Sanger-based Tribolium castaneum

genome assembly

Tuesday, February 25, 14

Page 8: Bionano genome maps_feb2014

Tribolium castaneum genomics

• Cot analysis▫ Genome ~200Mb▫ Long stretches of unique sequence▫ Low methylation

• 9 autosomes, X and Y

Jeff Stuart, Purdue

Tuesday, February 25, 14

Page 9: Bionano genome maps_feb2014

Standard Draft Minimally or unfiltered data, from any number of

different sequencing platforms, that are assembled into contiguous strings of bases (AGTC), with no gaps (contigs).

Science Oct 9, 2009 pp236-237

This is the minimum standard for submission to public databases.

http://compbio.pbworks.comTuesday, February 25, 14

Page 10: Bionano genome maps_feb2014

Molecular linkage map used to anchor scaffolds in chromosome builds (ChLG)

Low X coverage, no Y, marker density varies

Tuesday, February 25, 14

Page 11: Bionano genome maps_feb2014

Molecular linkage map used to anchor scaffolds in chromosome builds (ChLG)

Low X coverage, no Y, marker density varies

Tuesday, February 25, 14

Page 12: Bionano genome maps_feb2014

• Number of contigs! ! ! 8,814• Contig N50! ! ! 43,511• Number of scaffolds!! ! 481• Scaffold N50! ! ! 975,455• Total number of chromosomes! 10 (-Y)• Unmapped scaffolds!! ! 352• Single contig scaffolds 1835

• (481 + 1830 = 2321 scaffolds total)

T. castaneum assembly stats

Tuesday, February 25, 14

Page 13: Bionano genome maps_feb2014

Scaffold structure of the Tribolium genome assembly

NWAAJJ

NW NW

300K Ns 300K Ns

DSAAJJ

DSAAJJ

DSAAJJ

DSAAJJ

ChLG

Unanchored

Tuesday, February 25, 14

Page 14: Bionano genome maps_feb2014

Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome

maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering

• Results

Tuesday, February 25, 14

Page 15: Bionano genome maps_feb2014

Genome assembly improvements

Improving the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA MoleculesNic Herndon Kansas State University, Jennifer M. Shelton Kansas State University, Warren Andrews BioNano Genomics, Weiping WangBioNano Genomics, Susan J. Brown Kansas State University

Genome assemblies come in all qualities. Most are basically drafts of the genome, but even the most heavily curated assemblies contain misassemblies and truncations or gaps in repetitive regions. The 7x draft assembly of the Tribolium genome is based on paired-end Sanger sequencing of 4-6 Kb insert plasmid libraries, scaffolded with paired-end reads from 40Kb fosmid and ~130Mb BAC clones.  The total assembled length of ~156 Mb represents 75% of the estimated genome (200Mb) and presumably lacks a significant portion of repetitive DNA.  Superscaffolds or chromosome builds (ChLG 2-10 and X) were constructed by mapping molecular markers from the genetic recombination map to the assembly scaffolds, anchoring greater than 90% of the assembled sequence1 (fig1). To improve this draft assembly, we constructed physical maps of the T. castaneum genome. Using the

irys system designed by BioNano Genomics (http://www.bionanogenomics.com/).  Ultra long molecules (Mb) were nicked on one strand with Nt.BspQI and labeled with fluorescent nucleotides. Individual molecules were imaged on a massively parallel scale in nanochannels etched on silicon chips. Consensus maps de novo assembled from the imaged molecules were compared with in silico maps generated from the assembly sequence.  Here we report our progress on using these comparisons to validate the assembly in regions were they agree and reanalyze the assembly in regions were they do not. Additional scaffolds have been anchored to the chromosomes, order and orientation of scaffolds have been corrected, and scaffolds have been extended by spanning repetitive regions.  Nature 2008 452:949-55.

Figure 1 Molecular linkage maps T. castaneum LGX cannot be ordered using this technique because of too few markers. Additionally, regions with low recombination and higher marker density can only be randomly ordered. BioNano ultra-long molecules can be assembled order and orient scaffolds regardless of recombination rates.

Incorporating unplaced scaffolds:An unknown is incorporated into Linkage group 5

ChLG5

Unknown45

ChLG5ChLG5

Unknown18 ChLGX ChLGX ChLGX ChLGX

Incorporating unplaced scaffolds and super-scaffolding difficult alignments:An unknown is incorporated into Linkage group X and many Linkage group X scaffolds are aligned across potential label-poor repeats

ChLG5ChLG5

ChLG5

Repeat

Align through repeats to find mis-assemblies:Sequence assemblers often fail around repeats but BioNano molecules can span long repeats and point to potential mis-assemblies. Here alignment of a scaffold from Linkage group 5 stops around a repetitive pattern in the BioNano molecule map when aligned to Linkage group 5 but the unaligned region of the scaffold aligns to Linkage group 7. The putative repeat may be a source for mis-assembly.

ChLG7 ChLG7 ChLG7 ChLG7

ChLG5

Figure Mis-assemblies

ChLG10 ChLG10Unknown 11

ChLG3

Incorporating unplaced scaffolds and identifying mis-assemblies:An unknown is incorporated into Linkage group 10 and a portion of Linkage group 3 appears to have been mis-assembled

Figure Unplaced scaffolds

ChLG3 ChLG3 ChLG3

Validate and expand super-scaffolds:Three scaffolds from Linkage group 3 have been supper-scaffolded with captured gaps

Figure Assembly validation

T. castaneum 4.0 and gam-ngs

Gam-ngs merged Illumina assembly and T.cas 4.0 extending several unknowns and an LGX scaffold.

length (Mb): 160.864scaffolds: 2219scaffold N50 (Mb): 1.16

T. castaneum 4.0 and gam-ngs plus BioNano maps

Sequence scaffolds were aligned to maps with IrysView the alignment was filtered and used to create new scaffolds.

length (Mb): 189.629scaffolds: 2153scaffold N50 (Mb): 3.31

T. castaneum 4.0 Illumina long distance jump-libraries extended scaffolds into gaps and capturing gaps with Atlas gap-link and gap-filler.

length (Mb): 160.862scaffolds: 2219scaffold N50 (Mb): 1.16

T. castaneum 3.0 Baylor Sanger 7x draft assembly and molecular genetic map

length (Mb): 160.466scaffolds: 2321scaffold N50 (Mb): 0.98

Figure 2 Genome refinements

481

411

411

341

multicontig scaffolds

An independent platform to validate and improve genomesTuesday, February 25, 14

Page 16: Bionano genome maps_feb2014

How to validate a de novo assembly?• Describe assembly

# contigs, # scaffolds, total bases, N50 lengths coverage, # ESTs, # orthologs found

• But is the assembly accurate?▫ Compare to BAC sequences▫ If you have the resources

• Need independent (reasonably priced) method

Tuesday, February 25, 14

Page 17: Bionano genome maps_feb2014

Genome maps based on landmarks

• BioNanos Genomics▫ San Diego, California

• Imaging ultra-long molecules of DNA• Labeled at restriction sites

Tuesday, February 25, 14

Page 18: Bionano genome maps_feb2014

Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome

maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering

• Results

Tuesday, February 25, 14

Page 19: Bionano genome maps_feb2014

Introducing the irys system

Tuesday, February 25, 14

Page 20: Bionano genome maps_feb2014

Labeling schemaBspQ1 nicks at GCTCTTCN CGAGAAGN

10 sites /100 Kb

Tuesday, February 25, 14

Page 21: Bionano genome maps_feb2014

Chip Design

Tuesday, February 25, 14

Page 22: Bionano genome maps_feb2014

Samples loaded into 2 flow cells per chip

3 lasers 3 detection channelsDetect yoyo 1 in DNA backboneFluorescent nucleotides at labeled sites

Tuesday, February 25, 14

Page 23: Bionano genome maps_feb2014

DNA molecules entering channels

Tuesday, February 25, 14

Page 24: Bionano genome maps_feb2014

DNA molecules entering channels

Tuesday, February 25, 14

Page 25: Bionano genome maps_feb2014

A long repeat in the Tribolium genome

Tuesday, February 25, 14

Page 26: Bionano genome maps_feb2014

Mapping individual images back to map

• hthe

24

Regions flanking repeat are unique Some sites are polymorphic

Tuesday, February 25, 14

Page 27: Bionano genome maps_feb2014

Limitations of the Irys system• Sample prep is very specific• Requires gram amounts of starting material• Bacterial cells, tissue culture cells, eukaryotic

nuclei• Less complex tissue is best ▫ Blood▫ Embryos

• Not applicable to transcriptomics projects• contig N50 >30Kb (5 restriction sites)

Tuesday, February 25, 14

Page 28: Bionano genome maps_feb2014

Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome

maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering

• Results

Tuesday, February 25, 14

Page 29: Bionano genome maps_feb2014

Assembly images into genomic maps.tiff

.bnx

.cmap

Tuesday, February 25, 14

Page 30: Bionano genome maps_feb2014

Align BNG maps to in silico maps (.xmap)

28

Tuesday, February 25, 14

Page 31: Bionano genome maps_feb2014

File formats are similar to generating sequence data...

29

@SRR014849.2 EIXKN4201AKDUH/2TCAAGTGGTGAACGGCAGAAA+<=B:==B:=<?6=B;<;=B=)

0! 21! 202146.41! 1096.2! 8973.8QX11! 10.0565! 11.7966QX12! 0.0187! 0.0604

Image files fastq

Image filesbnx

fasta

>conitg1TCAAGTGGTGAACGGCAGAAA

#h CMapId ContigLength NumSites SiteID LabelChannel Position StdDev Coverage Occurrence#f int float int int int float float int int 393 225073.2 21 1 1 20.0 0.0 3 3

cmap

HWI-ST330_C0NEHACXX:2:1101:17113:52802#0! 69! contig1! 2578! 0! *! =! 2578! 0!ATTACGGCCCATGGTTCAGAATAATGACGAATAGAAATACTAGTACTATATCCCCTAAAAAA!<@CFFFFFHHGFHJHIJJJJJJJJJFJJJFGFHEHIHGHJGIJHIIIJJJJJJJJIJIIJIH!YT:Z:UP

sam

#h XmapEntryID!QryContigID! RefcontigID! QryStartPos! QryEndPos! RefStartPos! RefEndPos! Orientation! Confidence! HitEnum#f int !int ! int ! float ! float ! float ! float ! string ! float ! string 1! 94! 1! 444392.7! 5839.8! 57024.0! 550038.8! -! 28.87!1M1D2M3I4D1M3I2M1I7M1I1M1I9M1I1M1I2M1I3M1D2M

xmap

basecall de novo assemble align

call labels de novo assemble align

Tuesday, February 25, 14

Page 32: Bionano genome maps_feb2014

Visualizing an xmap

contig id

sequence-based scaffold

label alignment

BioNano contig map

coverage

Tuesday, February 25, 14

Page 33: Bionano genome maps_feb2014

Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome

maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering

• Results

Tuesday, February 25, 14

Page 34: Bionano genome maps_feb2014

Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome

maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering

• Results

Tuesday, February 25, 14

Page 35: Bionano genome maps_feb2014

K-INBRE i5K Github scripts:Irys Scaffolding scripts and manuals written by Jennifer Shelton and Nic Herndon Assembly workflow was developed by Ernest Lam (BioNano)

git pull https://github.com/i5K-KINBRE-script-share/Irys-scaffolding

Tuesday, February 25, 14

Page 36: Bionano genome maps_feb2014

Assembly pipeline34

scripts available at: i5k-KINBRE script share at GitHub: Irys-scaffoldinghttps://github.com/i5K-KINBRE-script-share/Irys-scaffolding

developed with Ernest Lam (BioNano)

Tuesday, February 25, 14

Page 37: Bionano genome maps_feb2014

Filtering alignmentsLabel density varies throughout the genome so we created scripts to filter in two passes:

Pass 1: looks for high confidence score over at least ~30% of the total possible alignment

Pass 2: looks for low confidence score over the majority of the total possible alignment (~90%)

Pass 1 finds most high quality alignments. Pass 2 finds high-quality low-density alignments.

Tuesday, February 25, 14

Page 38: Bionano genome maps_feb2014

Filtering alignments

Super-scaffolded scaffolds are joined in a new reference fasta file.

Overlaping scaffolds have a 30bp spacing gap between them

If a scaffold aligns more than once only the longest alignment is used

If two alignments have the same length only the highest confidence alignment is used

Tuesday, February 25, 14

Page 39: Bionano genome maps_feb2014

Outline• de novo genome assembly and i5K• Improving assemblies with Bionano genome

maps▫ Irys system▫ File formats▫ Assembly pipeline▫ Alignment filtering

• Results

Tuesday, February 25, 14

Page 40: Bionano genome maps_feb2014

BNG restriction maps for Tcastaneum

38

• Dual nicked Bsp.QI and BbvCI• 28.6Gb = ~143x coverage of 200Mb Tribolium genome

(>150 Kb)

• N contigs: 216• Total Contig Len (Mb):   200.473• Avg. Contig Len  (Mb):     0.928• Contig N50       (Mb):    1.350

• Total Ref Len    (Mb):   157.186• Total Contig Len / Ref Len  : 1.275

Tuesday, February 25, 14

Page 41: Bionano genome maps_feb2014

ChLG XChLGX had 13 scaffolds. Alignment to BioNano maps captured gaps and validated order for 11 of 13 scaffolds, incorporated 2 unplaced scaffolds and identified a potential misplaced scaffold (scaffold 2 aligns with another linkage group).

Tuesday, February 25, 14

Page 42: Bionano genome maps_feb2014

ChLG XChLGX had 13 scaffolds. Alignment to BioNano maps captured gaps and validated order for 11 of 13 scaffolds, incorporated 2 unplaced scaffolds and identified a potential misplaced scaffold (scaffold 2 aligns with another linkage group).

Tuesday, February 25, 14

Page 43: Bionano genome maps_feb2014

ChLG XChLGX had 13 scaffolds. Alignment to BioNano maps captured gaps and validated order for 11 of 13 scaffolds, incorporated 2 unplaced scaffolds and identified a potential misplaced scaffold (scaffold 2 aligns with another linkage group).

Tuesday, February 25, 14

Page 44: Bionano genome maps_feb2014

ChLG 7Alignment to BioNano maps captured gaps and validated order for 13 of 15 scaffolds. Scaffold 14 needs to be reversed in the super-scaffold.

Tuesday, February 25, 14

Page 45: Bionano genome maps_feb2014

ChLG 7Alignment to BioNano maps captured gaps and validated order for 13 of 15 scaffolds. Scaffold 14 needs to be reversed in the super-scaffold.

Tuesday, February 25, 14

Page 46: Bionano genome maps_feb2014

Additional chromosome linkage groups.

Tuesday, February 25, 14

Page 47: Bionano genome maps_feb2014

Additional chromosome linkage groups.

ChLG 3

Tuesday, February 25, 14

Page 48: Bionano genome maps_feb2014

Additional chromosome linkage groups.

ChLG 3

ChLG 9

Tuesday, February 25, 14

Page 49: Bionano genome maps_feb2014

Additional chromosome linkage groups.

ChLG 3

ChLG 9

ChLG 2

Tuesday, February 25, 14

Page 50: Bionano genome maps_feb2014

what does it cost?

• 100-500Mb genome <$5,000▫ 70-100x coverage

• 1Gb genome <$8,000▫ 70-100x coverage

• completely dependent on homogeneity of starting material

• assembly and analysis software is included in price

42

Tuesday, February 25, 14

Page 51: Bionano genome maps_feb2014

Summary• Standard Draft Genomes are highly fragmented• BNG provides independent platform• Whole genome restriction maps• Validate assembly• Extend scaffolds/Size Gaps• Identify structural variants• Identify haplotypes• Comprehensive view of repetitive DNA (HORs)• A validated genome assembly improves

downstream analyses

43

Tuesday, February 25, 14

Page 52: Bionano genome maps_feb2014

Thanks to:• Michelle Gordon ▫ Research Assistant: optimizing sample preps

• Jennifer Shelton▫ Biologist turned Bioinformaticist

• Nic Herndon▫ Computer scientist turned Bioinformaticist

• BioNano Genomics▫ Ernest Lam▫Weiping Wang

Tuesday, February 25, 14