DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY...
Transcript of DiploidGenome Assembly and Comprehensive Haplotype ......ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY...
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved.
Diploid Genome Assembly and Comprehensive Haplotype Sequence ReconstructionJason Chin, Paul Peluso, David Rank, Fri tz Sedlazeck, Maria Nattestad, Michael Schatz, Greg Concepcion, Al icia Clum, Kerrie Barry, Alex Copeland, Ronan O’Malley
Acknowledgments
-All PacBio Colleagues
-Ronan O’Malley, Chongyuan Luo, Joseph Ecker (HHMI / The Salk
Institute )
-Alicia Clum, Kerrie Barry, Alex Copeland (Joint Genome Institute)
-Maria Nattestad, Fritz Sedlazeck, Michael Schatz (CSHL)
- Open source toolsets-Daligner (https://dazzlerblog.wordpress.com), Gene Myers-BLASR (https://github.com/PacificBiosciences/blasr), Mark Chaisson-Python, NetworkX for rapid algorithm protyping-Gephi, Graphviz for graph visualization-FALCON (https://github.com/PacificBiosciences/falcon,
https://github.com/PacificBiosciences/falcon)
SOLVING THE DIPLOID ASSEMBLY PROBLEM
- Falcon (a polyploid-aware assembler) : generating the contigs through the bubbles- Falcon Unzip: identifying smaller variants and using them to separate the haplotypes
• Bubbles = big variants between the haplotypes
• Collapsed Path = smaller variants between the haplotypes
WHY DO WE SEE BUBBLES?
SNPs SNPs SNPsSVsSVs
SNPsSNPs SNPs
SVsSVs
Haplotype 1
Haplotype 2
Genome Sequences
Assembly Graph
In most OLC assembler design, the overlapper does not catch differences at SNP level but structural variations are naturally segregated.
THE FALCON UNZIP PROCESS
SNPs SNPs SNPsSVsSVs
Associate contig 1(Alternative allele)
Associate contig 2(Alternative allele)
SNPs SNPs SNPsSVsSVs
Primary contig
Augmented with haplotype information of each reads
FALCON
FALCON-Unzip
Updated primary contig + “associate haplotigs”
PHASING READ INTO HAPLOTYPE GROUPS
Haplotype 0
Haplotype 1
Identify het-SNPs
Phase het-SNPs
Group reads with
phased SNPs
Reconstruct haplotypes
Align SMRT reads to the initial primary contig
More het-SNPs in longer reads: 8% to 15% sequence error rate is not an issuesgiven enough long read coverage for phasing.
QUESTION: HOW TO RESOLVE STRUCTURAL VARIATIONS & HET-SNPS PHASING AT ONCE
3 kb – 100 kb
300 b – 10 kb
Structural Variations
het-SNP
ü Overlap-layout process catches SV haplotypes
✗ Collapsed paths when there is no SV
ü Easy to group SNPs/reads into different haplotypes
✗ No phasing information associated with SVs
ü Nearby SVs may be phased automatically
✗ Haplotype-fused paths
ü Haplotype-specific paths✗ More fragmented contigs
Information Sources Pros & Cons Assembly graph features
MERGE HAPLOTYPE INFORMATION AND “UNZIP”
Tiling path of haplotype 0
Tiling path of haplotype 1
Remove edges connectingdifferent haplotypes
PUT EVERYTHING TOGETHER
“Falcon Unzip P
rocess”
~ 4.80 Mb
Add missing haplotype specific nodes & edges
Remove edges that connect different haplotypes
The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs.
4 major haplotype phased blocks determined by het-SNPs Un-phased region
POLISHING: ALLELE-SPECIFIC ALIGNMENT FOR FINAL CONSENSUS
“Augmented alignment”: Each read has extra attribute (e.g., contig identifier, phasing block, haplotype phase), an aligner uses those information to place the read to specific reference sequence or regions.
Align the “red” haplotig
Align the “blue” haplotig
Read from same regionbut different haplotypes
CONSTRUCT ARABIDOPSIS THALIANA COL-0 X CVI-0 DIPLOID F1 LINE
Image credits: Pajoro, et al, Trends in plant science 21.1 (2016): 6-8.
Col-0 Cvi-0
Col-0 x Cvi-0
• Two inbred lines sequenced in 2013 (P4 chemistry), assembled as haploid genomes
• F1 line constructed and sequenced in 2015 (P6 chemistry), assembled with FALCON and FALCON-Unzip
Col-0 x Cvi-0 assembly
DIPLOID ASSEMBLY PRIMARY CONTIGS AND HAPLOTIGS
.
Col-0 chromosome
Cvi-0 chromosome
haplotigs
primary contig
• Primary contigs ~ 1n representation of the genome• Haplotigs ~ phased sequences from where the homologuous
chromosomes are distinguishable
ARABIDOPSIS THALIANA F1 DIPLOID ASSEMBLY STATISTICS
Strain InbredCol-0 InbredCvi-0Col-0 xCvi-0
F1
Assembler CA/HGAP CA/HGAPFALCON FALCON-Unzip FALCON-Unzip
primary contigs primary contigs haplotigsAssemblySize(Mb) 126 119 143 140 105#contigs 1325 194 426 172 248N50size(Mb) 6.210 4.79 7.92 7.96 6.92MaxContig size(Mb) 10.25 11.25 13.39 13.32 11.65
126
119
143
57
140
105
0 20 40 60 80 100 120 140 160
Inbred Col-0
Inbred Cvi-0
F1 FALCON p-contigs
F1 FALCON a-contigs
F1 Unzip p-contigs
F1 Unzip haplotigs
Assembly Size (Mb)
6.21
4.79
7.92
0.146
7.96
6.92
0 2 4 6 8 10
N50 size (Mb)
Col-0 x Cvi-0 assembly
EVALUATE THE DIPLOID ASSEMBLY RESULT
.
Col-0 assembly
Cvi-0 assembly
haplotigs
primary contig
Haploid-like contig in the inbred-line assemblies
Many variations
Few or no variations
Few or no variations
Many variations
By aligning the haplotigs to the parental genome assemblies, we can evaluate the haplotigs’ quality, e.g. haplotyping accuracy and CDS prediction consistency.
COMPARE F1 ASSEMBLY TO THE INBRED ASSEMBLIES
- We call the SNP and SVs against the parental inbred assemblies for all primary contigs and haplotigs.
- Most haplotigs can be fully assigned to one of the parental haplotypes.
COMPARE F1 ASSEMBLY TO THE INBRED ASSEMBLIES
- We call the SNP and SVs against the parental inbred assemblies for all primary contigs and haplotigs.
- Most haplotigs can be fully assigned to one of the parental haplotypes.
Cvi-0Col-0
Primary Contigs
Haplotigs
Col-0Cvi-0
ANNOTATION COMPARISION
Predicted Coding Sequences
TAIR 10 Genome &
Predicted Transcripts
Homopolymer Length Distributions
Compare de novo gene prediction (with AUGUSTUS (Stanke 2003)) between different assemblies
Assemblies TAIR10 Col-0 Cvi-0Numberofpredicted
CDS 27,946 30,006 27,393
100%indel-free fulllengthoverlaps
Col-0 3000625,966(92.9%)
Col-0xCvi-0 5677525,865(92.5%)
26,537(88.4%)
27,370(99.9%)
OTHER SMALLER AND LARGER DIPLOID GENOMES
Clavicoronapyxidata
(Coral Fungus)Cabernet
Sauvignon+* Human*Haploid Genome Size: ~ 44 Mb ~ 500 Mb ~ 3 Gb
FALCON-Unzip Results:
Primary contig size 41.9 Mb 591.0 Mb 2.76 GbPrimary contig N50 1.5 Mb 2.2 Mb 22.9 Mb
Haplotig size 25.5 Mb 372.2 Mb 2.0 GbHaplotig N50 872 kb 767 kb 330 kb
+Led by Cantu lab, UC Davis and Cramer lab, UN Reno
*Preliminary results. Fast file system and efficient computational infrastructure are currently needed for large genomes.
SUMMARY
-Single data type for routine diploid assembly-Large genomes are more computationally challenging but it is mostly an
engineering problem now:-Haplotype phasing improvement, incorporate 3rd party phasing code -Develop a sequence aligner for “augmented alignment” for faster
Quiver consensus process - FALCON-Unzip code: (No code, No truth!!) if you like to hack it for now,
email me ([email protected])- Want to attack the algorithm problem for polyploid assembly? Let us
help you!
Thanks for your attention!
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx.
All other trademarks are the sole property of their respective owners.
www.pacb.com