2014 anu-canberra-streaming

Memory- and time-efficient approaches to sequence analysis with streaming

algorithmsC. Titus [email protected]

Part I: Digital normalization

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Problem: De Bruijn assembly graphs scale with data size, not information.

This is the effect of errors:

Single nucleotide variations cause long branches

This is the effect of errors:

Single nucleotide variations cause long branches;They don’t rejoin quickly.



Can we change this scaling behavior?

An apparent digression:Much of next-gen sequencing is redundant.

Shotgun sequencing and coverage

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Random sampling => deep sampling needed

Typically 10-100x needed for robust recovery (300 Gbp for human)

An apparent digression:Much of next-gen sequencing is redundant.

Can we eliminate this redundancy?

Digital normalization

Basic diginorm algorithm

for read in dataset:

if estimated_coverage(read) < CUTOFF:

update_kmer_counts(read)

save(read)

else:

# discard readNote, single pass; sublinear memory.

We can build the approach on anything that lets us estimate coverage of a read.

The median k-mer count in a “sentence” is a ~good estimator of coverage.

This gives us a reference-free

measure of coverage.

Digital normalization is streaming

Digital normalization retains information, while discarding data and errors

Digital normalization is streaming error correction

Digital normalization retains information, while discarding data and errors

Contig assembly now scales with underlying genome size

Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with ~identical or improved results.



Victory! (?)

A few “minor” drawbacks…

1. Repeats are eliminated preferentially.

2. Genuine graph tips are truncated.

3. Polyploidy is downsampled.

4. It’s not clear what happens to polymorphism.

(For these reasons, we have been pursuing alternate approaches.)

Partially discussed in Brown et al., 2012 (arXiv)

But still quite useful…1. Assembling soil metagenomes.

Howe et al., PNAS, 2014 (w/Tiedje)

2. Understanding bone-eating worm symbionts.Goffredi et al., ISME, 2014.

3. An ultra-deep look at the lamprey transcriptome.Scott et al., in preparation (w/Li)

4. Understanding development in Molgulid ascidians. Stolfi et al, eLife 2014; etc.

…and widely used (?)

Estimated ~1000 users of our software.

Diginorm algorithm now included in Trinity software from Broad Institute (~10,000 users)

Illumina TruSeq long-read technology now incorporates our approach (~100,000

users)

Part II: Wait, did you say streaming?

Diginorm can detect graph saturation

Graph saturation




save(read)

else:

# high coverage read: do something clever!

“Few-pass” approachBy 20% of the way through 100x data set,

more than half the reads are saturated to 20x

Graph saturation




save(read)

else:

# high coverage read: do something clever!

(A) Streaming error detection for metagenomes and transcriptomes

Illumina has between 0.1% and 1% error rate.

These errors confound mapping, assembly, etc.

(Think: what if you had error free reads? Life would be much better)

Spectral error detection for genomes

Chaisson et al., 2009

Erroneous k-mers

True k-mers

Spectral error detection on reads --

Error location!

…spectral error detection for reads => transcriptome, metagenome

Chaisson et al., 2009

Erroneous k-mers

True k-mers

Spectral error detection on variable coverage data

f saturated Specificity Sensitivity

Genome 100% 71.4% 77.9%Transcriptome 92% 67.7% 63.8%Metagenome 96% 71.2% 68.9%

Real E. coli 100% 51.1% 72.4%

How many of the errors can we pinpoint exactly?

(B) Streaming error trimming for all shotgun data

f saturated error rate total bases trimmed

errors remaining

Genome 100% 0.63% 31.90% 0.00%

Transcriptome92% 0.65% 34.34% 0.07%

Metagenome96% 0.62% 31.70% 0.04%

Real E. coli 100% 1.59% 12.96% 0.05%

We can trim reads at first error.

(C) Streaming error correction

Once you can do error detection and trimming on a streaming basis, why not error correction?

…using a new approach…

Streaming error correction of genomic, transcriptomic, metagenomic data via

graph alignment

Jason Pell, Jordan Fish, Michael Crusoe

Pair-HMM-based graph alignment

Jordan Fish and Michael Crusoe

…a bit more complex...

Jordan Fish and Michael Crusoe

Error correction on simulated E. coli data

1% error rate, 100x coverage.

Michael Crusoe, Jordan Fish, Jason Pell

TP FP TN FN

Streaming 3,494,631 3,865 460,601,171 5,533

(corrected) (mistakes) (OK) (missed)

A few additional thoughts --

Sequence-to-graph alignment is a very general concept.

Could replace mapping, variant calling, BLAST, HMMER…

“Ask me for anything but time!”

-- Napoleon Bonaparte

(D) Calculating read error rates by position within readShotgun data is randomly

sampled;

Any variation in mismatches with reference by position is likely due to errors or bias.

Reads from Shakya et al., pmid 23387867

Sequencing run error profiles

Via bowtie mapping against reference --

We can do this sub-linearly from data w/no reference!

Reads from Shakya et al., pmid 23387867

Reference-free error profile analysis

1. Requires no prior information!

2. Immediate feedback on sequencing quality (for cores & users)

3. Fast, lightweight (~100 MB, ~2 minutes)

4. Works for any shotgun sample (genomic, metagenomic, transcriptomic).

5. Not affected by polymorphisms.

Reference-free error profile analysis

7. …if we know where the errors are, we can trim them.

8. …if we know where the errors are, we can correct them.

9. …if we look at differences by graph position instead of by read position, we can call variants.

=> Streaming, online variant calling?

Future thoughts / streaming

How far can we take this?

Streaming approach supports more compute-intensive interludes – remapping, etc.

Rimmer et al., 2014

Streaming online reference-free variant calling.

Single pass, reference free, tunable, streaming online variant calling.

Streaming with reads…

Analysis is done after sequencing.

Streaming with bases

Integrate sequencing and analysis

Directions for streaming graph analysis

Generate error profile for shotgun reads;

Variable coverage error trimming;

Streaming low-memory error correction for genomes, metagenomes, and transcriptomes;

Strain variant detection & resolution;

Streaming variant analysis.

Michael Crusoe, Jordan Fish & Jason Pell

Our software is open source

Methods that aren’t broadly available are limited in their utility!

Everything I talked about is in our github repository,

http://github.com/ged-lab/khmer

…it’s not necessarily trivial to use…

…but we’re happy to help.







We have recipes!

Planned work: distributed graph database server

ivory.idyll.org/blog/2014-moore-ddd-talk.html

Thanks for listening!

2014 anu-canberra-streaming

Science

Transcript of 2014 anu-canberra-streaming