2014 anu-canberra-streaming
-
Upload
ctitusbrown -
Category
Science
-
view
240 -
download
0
description
Transcript of 2014 anu-canberra-streaming
Memory- and time-efficient approaches to sequence analysis with streaming
algorithmsC. Titus [email protected]
Part I: Digital normalization
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Problem: De Bruijn assembly graphs scale with data size, not information.
This is the effect of errors:
Single nucleotide variations cause long branches
This is the effect of errors:
Single nucleotide variations cause long branches;They don’t rejoin quickly.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Can we change this scaling behavior?
An apparent digression:Much of next-gen sequencing is redundant.
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
An apparent digression:Much of next-gen sequencing is redundant.
Can we eliminate this redundancy?
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Basic diginorm algorithm
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard readNote, single pass; sublinear memory.
We can build the approach on anything that lets us estimate coverage of a read.
The median k-mer count in a “sentence” is a ~good estimator of coverage.
This gives us a reference-free
measure of coverage.
Digital normalization is streaming
Digital normalization is streaming
Digital normalization is streaming
Digital normalization is streaming
Digital normalization is streaming
Digital normalization is streaming
Digital normalization retains information, while discarding data and errors
Digital normalization is streaming error correction
Digital normalization retains information, while discarding data and errors
Contig assembly now scales with underlying genome size
Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with ~identical or improved results.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Victory! (?)
A few “minor” drawbacks…
1. Repeats are eliminated preferentially.
2. Genuine graph tips are truncated.
3. Polyploidy is downsampled.
4. It’s not clear what happens to polymorphism.
(For these reasons, we have been pursuing alternate approaches.)
Partially discussed in Brown et al., 2012 (arXiv)
But still quite useful…1. Assembling soil metagenomes.
Howe et al., PNAS, 2014 (w/Tiedje)
2. Understanding bone-eating worm symbionts.Goffredi et al., ISME, 2014.
3. An ultra-deep look at the lamprey transcriptome.Scott et al., in preparation (w/Li)
4. Understanding development in Molgulid ascidians. Stolfi et al, eLife 2014; etc.
…and widely used (?)
Estimated ~1000 users of our software.
Diginorm algorithm now included in Trinity software from Broad Institute (~10,000 users)
Illumina TruSeq long-read technology now incorporates our approach (~100,000
users)
Part II: Wait, did you say streaming?
Diginorm can detect graph saturation
Graph saturation
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# high coverage read: do something clever!
“Few-pass” approachBy 20% of the way through 100x data set,
more than half the reads are saturated to 20x
Graph saturation
for read in dataset:
if estimated_coverage(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# high coverage read: do something clever!
(A) Streaming error detection for metagenomes and transcriptomes
Illumina has between 0.1% and 1% error rate.
These errors confound mapping, assembly, etc.
(Think: what if you had error free reads? Life would be much better)
Spectral error detection for genomes
Chaisson et al., 2009
Erroneous k-mers
True k-mers
Spectral error detection on reads --
Error location!
…spectral error detection for reads => transcriptome, metagenome
Chaisson et al., 2009
Erroneous k-mers
True k-mers
Spectral error detection on variable coverage data
f saturated Specificity Sensitivity
Genome 100% 71.4% 77.9%Transcriptome 92% 67.7% 63.8%Metagenome 96% 71.2% 68.9%
Real E. coli 100% 51.1% 72.4%
How many of the errors can we pinpoint exactly?
(B) Streaming error trimming for all shotgun data
f saturated error rate total bases trimmed
errors remaining
Genome 100% 0.63% 31.90% 0.00%
Transcriptome92% 0.65% 34.34% 0.07%
Metagenome96% 0.62% 31.70% 0.04%
Real E. coli 100% 1.59% 12.96% 0.05%
We can trim reads at first error.
(C) Streaming error correction
Once you can do error detection and trimming on a streaming basis, why not error correction?
…using a new approach…
Streaming error correction of genomic, transcriptomic, metagenomic data via
graph alignment
Jason Pell, Jordan Fish, Michael Crusoe
Pair-HMM-based graph alignment
Jordan Fish and Michael Crusoe
…a bit more complex...
Jordan Fish and Michael Crusoe
Error correction on simulated E. coli data
1% error rate, 100x coverage.
Michael Crusoe, Jordan Fish, Jason Pell
TP FP TN FN
Streaming 3,494,631 3,865 460,601,171 5,533
(corrected) (mistakes) (OK) (missed)
A few additional thoughts --
Sequence-to-graph alignment is a very general concept.
Could replace mapping, variant calling, BLAST, HMMER…
“Ask me for anything but time!”
-- Napoleon Bonaparte
(D) Calculating read error rates by position within readShotgun data is randomly
sampled;
Any variation in mismatches with reference by position is likely due to errors or bias.
Reads from Shakya et al., pmid 23387867
Sequencing run error profiles
Via bowtie mapping against reference --
We can do this sub-linearly from data w/no reference!
Reads from Shakya et al., pmid 23387867
Reference-free error profile analysis
1. Requires no prior information!
2. Immediate feedback on sequencing quality (for cores & users)
3. Fast, lightweight (~100 MB, ~2 minutes)
4. Works for any shotgun sample (genomic, metagenomic, transcriptomic).
5. Not affected by polymorphisms.
Reference-free error profile analysis
7. …if we know where the errors are, we can trim them.
8. …if we know where the errors are, we can correct them.
9. …if we look at differences by graph position instead of by read position, we can call variants.
=> Streaming, online variant calling?
Future thoughts / streaming
How far can we take this?
Streaming approach supports more compute-intensive interludes – remapping, etc.
Rimmer et al., 2014
Streaming online reference-free variant calling.
Single pass, reference free, tunable, streaming online variant calling.
Streaming with reads…
Analysis is done after sequencing.
Streaming with bases
Integrate sequencing and analysis
Directions for streaming graph analysis
Generate error profile for shotgun reads;
Variable coverage error trimming;
Streaming low-memory error correction for genomes, metagenomes, and transcriptomes;
Strain variant detection & resolution;
Streaming variant analysis.
Michael Crusoe, Jordan Fish & Jason Pell
Our software is open source
Methods that aren’t broadly available are limited in their utility!
Everything I talked about is in our github repository,
http://github.com/ged-lab/khmer
…it’s not necessarily trivial to use…
…but we’re happy to help.
We have recipes!
Planned work: distributed graph database server
ivory.idyll.org/blog/2014-moore-ddd-talk.html
Thanks for listening!