Air Pollution By: Chrissy Badalamenti, Amie Maxwell and Mitch Miller.
Badalamenti PacBio tutorial 12-10-2014 · .h5 files contain a lot more than just basecalls and...
-
Upload
nguyendien -
Category
Documents
-
view
213 -
download
0
Transcript of Badalamenti PacBio tutorial 12-10-2014 · .h5 files contain a lot more than just basecalls and...
upcoming tutorials
Today, December 10 – 2:30 PM Friday, December 12 – 1:00 PM Wednesday, December 17 – 1:00 PM Wednesday, January 7 – 1:00 PM Tuesday, January 13 – 10:00 AM All sessions to be held in 138 Cargill register at msi.umn.edu
PRE-
PRO
CESS
ING
ASS
EMBL
YPO
LISH
ING
Short Reads (Illumina) - graph assembly
adapterremoval
qualitytrimming
de Bruijn or string graph construction
errorcorrection
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - HGAP assembly
read length
read
s
read self-correction
overlap-layout-consensusassembly
consensus calling withquiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing/polishing
the overall assembly strategy is the same…
…but the data and tools are fundamentally different
.h5 files contain a lot more than just basecalls and quality scores
$ h5dump –n <smrtcell_data_file>.bax.h5
2500
2000
1500
1000
500
5000 10000 15000 20000 25000
subread length
subr
eads
50
100
150
200
250
Mb
> su
brea
d le
ngth
Typical (size-selected) read length distribution, P4-C2 chemistry
data from 1 SMRTcell
• PacBio data cannot (currently) be assembled in its raw state
• several strategies exist for correcting reads prior to assembly • correction without complementary technology used to be
difficult – until recently, was limited by computational power and SMRT cell
throughput
PacBio data is noisy
Koren & Philippy Curr Op Micro 2014
30000
25000
10000
5000
5000 10000 15000 20000 25000
subread length
subr
eads
50
100
150
200
250
Mb
> su
brea
d le
ngth
20000
15000
before size selection
data from 1 SMRTcell, ~4 Mb genome
size matters
mean 1,527 bp N50 1,866 bp
2500
2000
1500
1000
500
5000 10000 15000 20000 25000
subread length
subr
eads
50
100
150
200
250
Mb
> su
brea
d le
ngth
…after size selection
data from 1 SMRTcell, ~4 Mb genome
size matters
mean 4,505 bp N50 6,591 bp
other options for assembling PacBio reads
https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads
• files typically transferred as gzipped tarballs (.tgz) • deposited by Matt Bockol (Mayo) onto MSI to
/project/scratch/bockolm2 MSI has plans to streamline data delivery • recommend organizing chronologically by run • create separate project/sample directories with symbolic links • SMRT cell directory names are not informative • within untarred parent directory, run
data delivery and organization
$ get_smrtcell_info.shbadalame@login02 [/home/bonddr/shared/pacbio_data/runs/2014-10-15] % get_smrtcell_info.sh 2014-10-15 A01_1 WT_Gsul_BPSS_repeat_0.050_nM 2014-10-15 A01_2 WT_Gsul_BPSS_repeat_0.050_nM 2014-10-15 B01_1 WTL_BPSS_repeat_0.050_nM 2014-10-15 B01_2 WTL_BPSS_repeat_0.050_nM 2014-10-15 C01_1 CES_3_BPSS_0.050_nM 2014-10-15 C01_2 CES_3_BPSS_0.050_nM 2014-10-15 D01_1 JG233_raw_BPSS_0.050_nM 2014-10-15 D01_2 JG233_raw_BPSS_0.050_nM 2014-10-15 E01_1 JG233_S_C_BPSS_0.050_nM 2014-10-15 E01_2 JG233_S_C_BPSS_0.050_nM
• gather, organize, and verify data • start isub session within NX Client on MSI
• import SMRT cells
• run subread filtering / standard QC • run HGAP with length cutoff to provide 100x coverage
• interpret results / re-run with modifications • circularize chromosome(s) and plasmids
– reorient to begin at replication origin if desired – upload as new “reference” sequence
• run base modification and motif detection
• iteratively run quiver until QV > 50
• final polish with short reads (if available) using Pilon • annotate
typical workflow de novo microbial assembly
pull reads into other pipeline(s)
import SMRT cells • any readable file path can be scanned • three options for importing data
1. physically move or copy SMRT cell data to /smrtanalysis/userdata/inputs_dropboxthis is dangerous if you ever need to remove smrtanalysis from your home directory
2. create symbolic links to SMRT cell data in inputs_dropbox
$ ln –s /path/to/smrtcells ~/smrtanalysis/userdata/inputs_dropboxbetter option
3. scan defined file path(s) for SMRT cell data, e.g.
/home/PIjoe/shared/pacbio_data/projects/sampleID/smrtcellsbest option – allows for personalized data organization outside SMRT Portal
once imported, SMRT cells cannot be (easily) removed from the available list
key terms:
SMRT bell library
QC – adapter removal and subread filtering
Travers et al. Nucl. Acids Res. (2010) !
polymerase read
adapter
full pass subread
subreads
filtered subreads – subreads passing specified length and quality filters
1. generate amplicon
2. ligate adaptors
3. sequence
4. data analysisraw long readprocessed long read
single-molecule fragments
circular consensus sequence (ccs)
SMRTbell
5‘ forward strand 3‘
3‘ reverse strand 5‘
DNApolymerase
template
1 o analysis
Fichot et al. Microbiome 1:10 (2013) !
CCS reads
running HGAP always run with 100x coverage of longest reads key parameters: • minimum subread length
– set to value that provides ~100x coverage based on subread filtering curve
• minimum polymerase read quality – some pipelines default to 0.75, but I always set to 0.8 unless limited
by coverage
• anticipated genome size – your best guess based on related species
pre-assembly • HGAP automatically sets length cutoff providing 30x
coverage in longest reads • blasr maps shorter reads to longer reads • pbdagcon calls consensus and spits out corrected long
reads – these can be useful for other pipelines – some long reads get shorter!
• pre-assembled yield – the fraction of total seed bases (i.e. 30x in longest reads)
that survived self correction – can result from ends being truncated and/or long reads
unable to be corrected
Polymerase Read Bases 370,004,973 Length Cutoff 12,944 Seed Bases 114,063,209 Pre-Assembled bases 76,895,237 Pre-Assembled Yield 0.674 Pre-Assembled Reads 7,719 Pre-Assembled Reads Length 9,961 Pre-Assembled N50 13,168
0
10000
20000
30000
Read
Leng
th
interpreting HGAP results – final assembly my assembly has always has a ~4kb plasmid?! (not really) • check coverage plots
– should be even, without large spikes (collapsed repeats) or dips
• check for plasmids – previous versions sent plasmids to separate file
• why might you have multiple contigs? (for microbes) – anticipated genome size was incorrect – long, unresolvable/complex repeats – low pre-assembled yield – some contigs with abnormally low or high coverage might be
spurious and can possibly be ignored
• BLAST any small contigs • sum lengths of contigs and re-run HGAP if necessary • try HGAP.2 (slower, but more accurate)
when an assembly returns a circular genome
See http://files.pacb.com/Training/CircularContigConfirmationGepard/story.html
script for separating contigs for individual circularization: $ extract_contigs.sh
uploading reference sequences unlike SMRT cell data, reference sequences cannot by symlinked! • make copies of reference genomes in
~/smrtanalysis/userdata/references_dropbox • larger genomes can take several minutes to finish uploading
• sequence(s) should be in a single .fasta file (including plasmids)
• makes use of real-time kinetic data to evaluate potential based modifications based on – long active site residence time – interpulse duration
• pipeline also runs RS_Resequencing (i.e. quiver) by default https://github.com/PacificBiosciences/SMRT-Analysis/wiki/SMRT-Pipe-Reference-Guide-v2.2.0
base modification and motif detection
quiver isn’t perfect using Pilon to polish remaining indels
• makes use of short read mapping to identify potential indels, SNPs, ambiguous bases, local misassemblies
$ java -Xmx16G –jar path/to/pilon-1.8.jar \ --genome path/to/fasta --unpaired path/to/mapping.bam \ --output sample_name --changes --variant --tracks \ --mindepth 100
Pilon removed 128 remaining indels in 3.8 Mbp genome despite Quiver calling > QV 55 consensus
quiver isn’t perfect using Pilon to polish remaining indels
final quiver polish 3,820,756 bp 99.9999% (QV 60)
pilon 128 indels detected 3,820,884 bp
re-run quiver 3,820,866 bp
Sequence Position Variant Type Coverage Confidence Genotype unitig_0|quiver|quiver|quiver|quiver|pilon 3328288 3328288delA DEL 100 50 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3782112 3782112delG DEL 100 50 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 1370128 1370128delC DEL 100 49 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2555272 2555272delG DEL 100 49 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3063922 3063922delG DEL 100 49 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2620561 2620561delG DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2782988 2782988delG DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2924523 2924523delT DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2962387 2962387delC DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3342764 3342764delA DEL 100 48 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 218678 218678delG DEL 100 47 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 731966 731966delG DEL 100 47 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 2962119 2962119delC DEL 100 47 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 520394 520394delC DEL 100 46 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 1081259 1081259delG DEL 100 45 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3038349 3038349delC DEL 100 44 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 830503 830503delG DEL 100 43 haploid unitig_0|quiver|quiver|quiver|quiver|pilon 3790899 3790899delC DEL 100 41 haploid
where is my data?
• all secondary analysis data resides in
~/smrtanalysis/userdata/jobs
• each job is assigned a six-digit ID corresponding to its
directory name • graphs and HTML files are in /results• filtered subreads, assembly .fasta files and anything else are
in /data
typical run times on MSI
Pipeline Genome size
(Mbp) #
contigs # SMRT
cells coverage wall time
RS_Subreads 3.7 n/a 8 580x 34 m
RS_HGAP.3 3.7 2 8 100x 2h 45 m
RS_HGAP.3 7.2 1 11 100x 13h 45 m
RS_Modification_and Motif 3.7 2 8 580x 10h 12 m
RS_Resequencing 3.7 2 8 140x 2h 7m
NOTE: all pipelines begin with adapter removal and subread filtering
visualizing results in SMRT View
• start an isub session with extra memory:
$ isub –m 32gb
• click on the SMRT View button within the job • launches a java web application• must be run from an active SMRT Portal session
troubleshooting
• stop SMRT Portal $ /panfs/roc/pacbio/stop_user_portal.sh
• LAST RESORT $ pbsave.sh$ /panfs/roc/pacbio/delete_user_portal.sh
THIS WILL REMOVE ALL ANALYSIS DATA UNLESS SAVED! contact [email protected]
resources and additional information visit http://github.com/PacificBiosciences email [email protected]
[email protected] [email protected]
check Twitter! @PacBio @SageSci @UMNmsi @lexnederbragt @aphillippy @sergekoren @mike_schatz @pathogenomenick @BaCh_mira @LizzyWilbanks @TheGeneMyers @infoecho @BioInfoBrett @OmicsOmicsBlog @ctitusbrown
data for 5 organisms just released freely available