Jan2016 pac bio giab
-
Upload
genomeinabottle -
Category
Health & Medicine
-
view
996 -
download
0
Transcript of Jan2016 pac bio giab
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved.
NIST Genome in a Bottle (GIAB) Consortium
Workshop at Stanford University Luke Hickey – Senior Director, Human BioMedical Sciences, PacBio January 29, 2016
Topics
- PacBio SMRT Sequencing Technology Development
- Human Genome Sequencing with PacBio Systems
- The Role of NIST GIAB Reference Material in PacBio
Sequencing Technology Development, Optimization
and Demonstration
PacBio SMRT Sequencing Technology
PACIFIC BIOSCIENCES® CONFIDENTIAL
SINGLE MOLECULE, REAL-TIME (SMRT) DNA SEQUENCING
PACIFIC BIOSCIENCES® CONFIDENTIAL
Long Reads
- Average >10,000 bases
High Consensus Accuracy
- Achieves >99.999% (30x)
Uniform, Unbiased Coverage
- Lack of GC% or sequence
complexity bias
DNA Modification Detection
- Epigenome characterization
SMRT SEQUENCING DATA CHARACTERISTICS
PACIFIC BIOSCIENCES® CONFIDENTIAL
AREAS OF PACBIO TECHNOLOGY DEVELOPMENT
Library Preparation
Sequencing Data Analysis
Instruments
SMRT Cells Zero-Mode
Waveguides
Phospholinked
Nucleotides
DNA Shearing
Size Selection
SMRTbell™
Library
Preparation
Primary Analysis
- Base calling
Secondary & Tertiary Analysis
- Mapping
(daligner/BLASR)
- Consensus accuracy
(Quiver / HGAP)
- De novo assembly
(Falcon / MHAP)
- SV calling
- Phasing
- Epigenetic analysis
Consumables
PacBio® RS II SEQUEL™ SYSTEM
PACIFIC BIOSCIENCES® CONFIDENTIAL
PRODUCT RELEASES OVER THE LAST FOUR YEARS
7
Feb 2012
C2 Launch
May 2012
v1.3.1 SW Release – Base Mods
Aug 2012
v1.3.2 MagBead Release
Nov 2012
v1.3.3
Microbial Base Modification
XL Chemistry
Stage Start
Jan 2013
SMRT® Cells v3
HGAP/Quiver
Oct 2013
v2.1
• P5-C3 release
• HGAP 2.0
Apr 2013
RS II Product Release
• 75K to 150K ZMW
• 2x Throughput
Mar 2014
v2.2
• IsoSeq™
• HLA-Typing Oct 2015
Sequel System
Oct 2014
v2.3
• P6-C4 release
Apr 2015
Barcode Support
Increased throughput by over 100x
0
2000
4000
6000
8000
10000
12000
14000
HISTORY OF READ LENGTH PERFORMANCE A
vera
ge R
ead L
ength
(b
p)
2008 2009 2010 2011 2012 2013 2014 2015
Early PacBio chemistries
453 1012 1734 LPR
FCR
ECR2
C2–C2
P4–C2
P5–C3
Average Read Length: 10,000 - 15,000 bp
Throughput / SMRT® Cell: 750 Mb – 1.25 Gb
Consensus Accuracy: QV50 @30-fold P6–C4
NIST GIAB REFERENCE MATERIAL 8398
- Serves as a well characterized control material to facilitate development of novel library
preparation and sequencing methods for human genomes at PacBio.
PACIFIC BIOSCIENCES® CONFIDENTIAL
LIBRARY PREPARATION
DNA Sample
Building of the
SMRTbell Template Sample Preparation
Repair Ends
Ligate Adapters
Purify DNA
Binding
Fragment DNA
ASSESSING THE IMPACT OF DNA QUALITY
ON READ LENGTH
Human gDNA samples from NIST GIAB: NA12878: CEPH/Utah Pedigree 1463, Lot K6
Thanks Dave Hsu!
E. coli K12 gDNA is mostly >40 kb (same gel)
Both NA12878 samples show significant degradation
Look similar to Coriell samples
PFGE conditions:
Bio-Rad CHEF Mapper XA System
1% PFG-certified agarose gel in 0.5x TBE
~200 ng DNA per lane
Auto-algorithm program
Low = 5 kb
High = 150 kb
Markers:
1 kb Extension Ladder (Invitrogen)
5 kb DNA Ladder (Bio-Rad)
EtBr stained post-electrophoresis
Typhoon imaging:
Fluorescence mode, EtBr channel
100 microns resolution
+3 mm focal plane
- Initial QC of human gDNA samples (NIST/Stanford)
Performance of NIST/NA12878 Libraries and E.coli K12
Metrics from SMRT Portal RS.PreAssembler.2
>15 kb libraries loaded at 25 pM on-chip (OCPW)
>30 and >40 kb libraries loaded at 75 pM on-chip (OCPW)
Sample nReads #Bases Mean RL RL N50
NA12878_15kb 84,969 1,150 Mb 13,533 18,622
K12_15kb 24,941 378 Mb 15,161 21,140
K12_30kb_DDR 60,460 1,031 Mb 17,055 24,745
K12_40kb_DDR 51,679 922 Mb 17,835 26,282
TYPICAL P6-C4 CHEMISTRY READ LENGTH
PERFORMANCE ON A HUMAN GENOME
Data per SMRT Cell: 0.5 – 1 Gb
20 kb size-selected human library
4 hour movie
P6-C4 chemistry
NEW LARGE INSERT LIBRARY PREPARATION
PROTOCOLS
http://www.pacb.com/wp-content/uploads/2015/09/Unsupported-Preparing-Greater-than-30kb-SMRTbell-Libraries-Megaruptor-Shearing.pdf
http://www.pacb.com/wp-content/uploads/2015/09/Unsupported-Preparing-Greater-than-30kb-SMRTbell-Libraries-Needle_Shearing.pdf
Sequencing Human Genomes So, you sequenced a human genome … how well did you do?
THE HUMAN GENOME – FEBRUARY 2001
Source: Science. 2001 Feb 16;291(5507):1304-51., Nature. 2001 Feb 15;409(6822):860-921.
THE HUMAN GENOME
- Over 6 billion base pairs
- Organized into 23 chromosomes
- With 2 copies of each
- One maternal, one paternal
- Carrying 20,000 genes
- Each encoding an average of 3 proteins
Source: NHGRI fact sheet
Accessing variation in the human genome enables genetic research.
“Much of the missing heritability (the 'dark matter' of the
genome) will probably turn up as the technology advances.”
- Francis Collins
Nature 464, 674-675 (1 April 2010)
PACIFIC BIOSCIENCES® CONFIDENTIAL
TYPES OF INFORMATION COLLECTED FROM
PACBIO SEQUENCING OF A HUMAN GENOME
DNA
- Single-Nucleotide Variation (SNPs) ← Illumina “$1000 Genome”
- Structural Variation (SVs) ← Illumina “$1000 Genome”
- Haplotype Phasing ← Cloning/Sanger sequencing
- Epigenetics ← Illumina + bisulfite sequencing
- De Novo Genome Assembly ← Illumina + Hi-C/Dovetail
RNA
- Expression Quantitation ← Illumina
- Isoform Characterization ← PacBio
PacBio Genome
PACBIO SEQUENCING AND ASSEMBLY OF NA12878
“We sequenced NA12878 genomic DNA across 851
Pre P5-C3 and 162 P5-C3 [SMRT Cells] to generate
24× and 22× coverage with aligned mean read
lengths of 2,425 and 4,891 base pairs, respectively.”
TABLE 1. NA12878 – PACBIO ASSEMBLY RESULTS
FIGURE 2. TANDEM-REPEAT DETECTION FROM SINGLE
MOLECULES PREDICTS A LARGE DIVERGENCE FROM
REFERENCE.
REPEAT EXPANSION DISEASES
Sergei M. Mirkin (2007). Expandable DNA repeats and human disease, Nature 447, 932-940
“It is time to stop thinking
that merely more DNA
sequencing will give us the
variants that determine
human traits”
“We encourage the use of a
range of sequencing
technologies to explore
highly variable and complex
genomic regions in a large
number of human samples.”
http://www.nature.com/ng/journal/v47/n9/pdf/ng.3397.pdf
SEPTEMBER 2015 -
“Full resolution of variation
is only guaranteed by
complete de novo assembly
of a genome.”
“We … emphasize the
importance of complete de
novo assembly as opposed
to read mapping as the
primary means to
understanding the full range
of human genetic variation.”
VOLUME 16 | NOVEMBER 2015 | 627
Source: www.nature.com/nrg/journal/v16/n11/full/nrg3933.html
COST-PER-GENOME DILEMMA (QUANTITY VS. QUALITY)
NCBI-34
Contig N50 29 Mb
HuRef: 107 kb
BGI YH: 7.4 kb
KB1: 5.5 kb
NA12878: 24 kb
CHM1: 144 kb
RP11: 127 kb
According to NHGRI
website, the definition
of “sequencing a
genome” changed in
the year 2008 to refer
to “re-sequencing” in
lieu of “de novo
assembly.”
- Obtaining a de novo human genome that has the same scientific quality standard as
the initial HGP work has NOT followed Moore’s law.
Source: NHGRI – Genome Sequencing Costs - http://www.genome.gov/sequencingcosts/
NHGRI GenomeTV: https://www.youtube.com/watch?v=PdVdlzWhaLE
NHGRI GenomeTV: https://www.youtube.com/watch?v=PdVdlzWhaLE
Source: McDonnell Genome Institute http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
REFERENCE ASSEMBLY QUALITY STANDARDS
Source: McDonnell Genome Institute http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
MGI METHOD FOR IMPROVING REFERENCE GENOMES
Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/
20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/
early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)
HUMAN GENOME DE NOVO ASSEMBLIES
Year Technology Assembler Sample
2007 ABI 3730 Celera HuRef
2009 Illumina GA SOAP
de novo BGI YH
2010 454 GS Flx
Titanium Newbler KB1
2010 Illumina GA ALLPATHS-LG NA12878
2013 454 GS, HiSeq,
MiSeq Newbler RP11_0.7
2014 HiSeq, BAC
clones
Reference-
guided CHM1
2014 PacBio RS II FALCON CHM1
2015 PacBio RS II FALCON CHM13
2015 PacBio RS II FALCON AK1
2015 PacBio RS II FALCON HuRef
2015 PacBio RS II FALCON PC-9*
2015 PacBio RS II FALCON SK-BR-3*
*cancer cell lines
0.11
0.007
0.006
0.024
0.13
0.14
4.38
12.98
7.28
10.38
3.58
2.56
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Contig N50 (Mb)
26.9 Mb - NCBI: GCA_001297185.1
THE HUMAN GENOME - 2015
http://www.ncbi.nlm.nih.gov/assembly/GCA_001297185.1/
Contig N50 26.9 MB
TOWARDS PLATINUM GENOMES: PACBIO RELEASES A
NEW, HIGHER QUALITY CHM1 ASSEMBLY TO NCBI
Figure 1. The PacBio CHM1 assembly resolves the q arms of
chromosomes 2 and 6 into very few contigs, with max contigs
107 Mbp and 109 Mbp long, respectively.
Posted: Friday, October 2, 2015
Source: PacBio blog post, Tuesday September 29, 2015, http://pacb.com/blog
Source: MGI http://genome.wustl.edu/projects/detail/reference-genomes-improvement/
REFERENCE GENOME IMPROVEMENT
NIST GENOME IN A BOTTLE (GIAB) PROJECT
34
Ashkenazim Trio de novo Genome Sequencing Project Collaborative project with Icahn School of Medicine at Mt. Sinai, New York City
Sequencing: • Generated PacBio de novo human sequencing from the GIAB Ashkenazim son-father-
mother trio from the Personal Genome Project (HG002, HG003, HG004). • The AJ genomes are candidate NIST Reference Materials planned for release in 2016. • PacBio coverage is 69X, 32X, and 30X for HG002, HG003, and HG004, respectively. • A paper describing these data and other data from GIAB is now on biorxiv Sequencing data publicly posted on NCBI: • NIST Human HG002 NA24385 (Ashkenazim Trio Son) on NCBI FTP site here. • NIST Human HG003 NA24149 (Ashkenazim Trio Father) on NCBI FTP site here. • NIST Human HG004 NA24143 (Ashkenazim Trio Mother) on NCBI FTP site here.
https://github.com/PacificBiosciences/DevNet/wiki/Genome-in-a-Bottle-Ashkenazim-Trio
GIAB PacBio Assembly Summary with SV calls derived from de novo
assemblies
Mount Sinai: Ali Bashir, Matthew Pendleton, Ryan Neff
Pacific Biosciences: Jason Chin
Reed College: Anna Ritz
Overview
• Steps for SV calling
– De novo Falcon assembly
– Reference-based comparison
• Mapping with BLASR and Nucmer – Secondary refined using HMM
– Re-examination of potential deviations in the reference with raw-reads
• Currently extending MultiBreak-SV
PacBio Falcon Assembly Stats Trio
Sample Contigs Average N50 Max Total Size HG002 13231 230Kb 4.1 Mb 31.6 Mb 3.04 Gb HG003 17873 172kb 4.6 Mb 21.5Mb 3.08 Gb HG004 16487 185kb 5.3 Mb 22.6 Mb 3.05 Gb
Log y-scale Log x-scale
Both high/low coverage AJ assemblies highly consistent with GRCh38
HG002
Both high/low coverage AJ assemblies highly consistent with GRCh38
HG003
Both high/low coverage AJ assemblies highly consistent with GRCh38
HG004
PacBio Assembly Based SV Calls
Sample Deletion Insertion Other Total HG002 9237 12489 2534 24260
HG003 9356 12299 2580 24235 HG004 9189 12290 2589 24068
PacBio Assembly Based SV Calls
Sample Deletion Insertion Other Total HG002 9237 12489 2534 24260
HG003 9356 12299 2580 24235 HG004 9189 12290 2589 24068
Note: Log x-scale to show full event sizes
SV calls consistent between assembly approaches (Falcon vs. Celera)
Insertion Deletion
Other
Ongoing
• Refining raw read-based analysis: – Build new calls – Mark false-positives – Identifying discrepancies between two assemblies – Force calling trios
• Improving heterozygous calls missed via local assembly
• Refining “other” categories – e.g. splitting out simple and complex inversions
• Merging BioNano/10X calls with PacBio data
ROLE OF NIST GIAB AJ TRIO PROJECT AND REFERENCE
MATERIAL IN PACBIO TECHNOLOGY DEVELOPMENT
- PacBio characterization data serves as a public resource for data analysis methods development by community:
- Structural variation
- SNV calling
- De novo assembly
- Phasing & haplotype reconstruction
- Methylation / Epigenetic analysis
- Analytical data from multiple-platforms serves as validation for algorithm development
- Characterization data and reference material provide a benchmark for development of novel methods
- New chemistry development to increase read-length and accuracy (e.g., library prep methods, polymerase, etc.)
- Scaffolding using novel library perpetration methods
- Rare variant calling with dilution analysis
- Well-characterized RM will serve as a resource for future use in internal quality testing
- Consumables
- Instruments
- Analysis methods
PACIFIC BIOSCIENCES® CONFIDENTIAL
1000+ PUBLICATIONS TO DATE FEATURING PACBIO
SEQUENCING
0
100
200
300
400
500
600
700
800
2011 2012 2013 2014 2015
Human Biomedical
Plant & Animal
Microbiology
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio,
SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx.
All other trademarks are the sole property of their respective owners.
www.pacb.com
PACBIO RS II
150+ PLACEMENTS
Some pins represent multiple placements