2014 anu-canberra-streaming

68
Memory- and time- efficient approaches to sequence analysis with streaming algorithms C. Titus Brown [email protected]

description

Talk at ANU.

Transcript of 2014 anu-canberra-streaming

Page 1: 2014 anu-canberra-streaming

Memory- and time-efficient approaches to sequence analysis with streaming

algorithmsC. Titus [email protected]

Page 2: 2014 anu-canberra-streaming

Part I: Digital normalization

Page 3: 2014 anu-canberra-streaming

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Problem: De Bruijn assembly graphs scale with data size, not information.

Page 4: 2014 anu-canberra-streaming

This is the effect of errors:

Single nucleotide variations cause long branches

Page 5: 2014 anu-canberra-streaming

This is the effect of errors:

Single nucleotide variations cause long branches;They don’t rejoin quickly.

Page 6: 2014 anu-canberra-streaming

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Can we change this scaling behavior?

Page 7: 2014 anu-canberra-streaming

An apparent digression:Much of next-gen sequencing is redundant.

Page 8: 2014 anu-canberra-streaming

Shotgun sequencing and coverage

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Page 9: 2014 anu-canberra-streaming

Random sampling => deep sampling needed

Typically 10-100x needed for robust recovery (300 Gbp for human)

Page 10: 2014 anu-canberra-streaming

An apparent digression:Much of next-gen sequencing is redundant.

Can we eliminate this redundancy?

Page 11: 2014 anu-canberra-streaming

Digital normalization

Page 12: 2014 anu-canberra-streaming

Digital normalization

Page 13: 2014 anu-canberra-streaming

Digital normalization

Page 14: 2014 anu-canberra-streaming

Digital normalization

Page 15: 2014 anu-canberra-streaming

Digital normalization

Page 16: 2014 anu-canberra-streaming

Digital normalization

Page 17: 2014 anu-canberra-streaming

Basic diginorm algorithm

for read in dataset:

if estimated_coverage(read) < CUTOFF:

update_kmer_counts(read)

save(read)

else:

# discard readNote, single pass; sublinear memory.

We can build the approach on anything that lets us estimate coverage of a read.

Page 18: 2014 anu-canberra-streaming

The median k-mer count in a “sentence” is a ~good estimator of coverage.

This gives us a reference-free

measure of coverage.

Page 19: 2014 anu-canberra-streaming

Digital normalization is streaming

Page 20: 2014 anu-canberra-streaming

Digital normalization is streaming

Page 21: 2014 anu-canberra-streaming

Digital normalization is streaming

Page 22: 2014 anu-canberra-streaming

Digital normalization is streaming

Page 23: 2014 anu-canberra-streaming

Digital normalization is streaming

Page 24: 2014 anu-canberra-streaming

Digital normalization is streaming

Page 25: 2014 anu-canberra-streaming

Digital normalization retains information, while discarding data and errors

Page 26: 2014 anu-canberra-streaming

Digital normalization is streaming error correction

Page 27: 2014 anu-canberra-streaming

Digital normalization retains information, while discarding data and errors

Page 28: 2014 anu-canberra-streaming

Contig assembly now scales with underlying genome size

Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with ~identical or improved results.

Page 29: 2014 anu-canberra-streaming

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Victory! (?)

Page 30: 2014 anu-canberra-streaming

A few “minor” drawbacks…

1. Repeats are eliminated preferentially.

2. Genuine graph tips are truncated.

3. Polyploidy is downsampled.

4. It’s not clear what happens to polymorphism.

(For these reasons, we have been pursuing alternate approaches.)

Partially discussed in Brown et al., 2012 (arXiv)

Page 31: 2014 anu-canberra-streaming

But still quite useful…1. Assembling soil metagenomes.

Howe et al., PNAS, 2014 (w/Tiedje)

2. Understanding bone-eating worm symbionts.Goffredi et al., ISME, 2014.

3. An ultra-deep look at the lamprey transcriptome.Scott et al., in preparation (w/Li)

4. Understanding development in Molgulid ascidians. Stolfi et al, eLife 2014; etc.

Page 32: 2014 anu-canberra-streaming

…and widely used (?)

Estimated ~1000 users of our software.

Diginorm algorithm now included in Trinity software from Broad Institute (~10,000 users)

Illumina TruSeq long-read technology now incorporates our approach (~100,000

users)

Page 33: 2014 anu-canberra-streaming

Part II: Wait, did you say streaming?

Page 34: 2014 anu-canberra-streaming

Diginorm can detect graph saturation

Page 35: 2014 anu-canberra-streaming

Graph saturation

for read in dataset:

if estimated_coverage(read) < CUTOFF:

update_kmer_counts(read)

save(read)

else:

# high coverage read: do something clever!

Page 36: 2014 anu-canberra-streaming

“Few-pass” approachBy 20% of the way through 100x data set,

more than half the reads are saturated to 20x

Page 37: 2014 anu-canberra-streaming

Graph saturation

for read in dataset:

if estimated_coverage(read) < CUTOFF:

update_kmer_counts(read)

save(read)

else:

# high coverage read: do something clever!

Page 38: 2014 anu-canberra-streaming

(A) Streaming error detection for metagenomes and transcriptomes

Illumina has between 0.1% and 1% error rate.

These errors confound mapping, assembly, etc.

(Think: what if you had error free reads? Life would be much better)

Page 39: 2014 anu-canberra-streaming

Spectral error detection for genomes

Chaisson et al., 2009

Erroneous k-mers

True k-mers

Page 40: 2014 anu-canberra-streaming

Spectral error detection on reads --

Error location!

Page 41: 2014 anu-canberra-streaming

…spectral error detection for reads => transcriptome, metagenome

Chaisson et al., 2009

Erroneous k-mers

True k-mers

Page 42: 2014 anu-canberra-streaming

Spectral error detection on variable coverage data

f saturated Specificity Sensitivity

Genome 100% 71.4% 77.9%Transcriptome 92% 67.7% 63.8%Metagenome 96% 71.2% 68.9%

Real E. coli 100% 51.1% 72.4%

How many of the errors can we pinpoint exactly?

Page 43: 2014 anu-canberra-streaming

(B) Streaming error trimming for all shotgun data

f saturated error rate total bases trimmed

errors remaining

Genome 100% 0.63% 31.90% 0.00%

Transcriptome92% 0.65% 34.34% 0.07%

Metagenome96% 0.62% 31.70% 0.04%

Real E. coli 100% 1.59% 12.96% 0.05%

We can trim reads at first error.

Page 44: 2014 anu-canberra-streaming

(C) Streaming error correction

Once you can do error detection and trimming on a streaming basis, why not error correction?

…using a new approach…

Page 45: 2014 anu-canberra-streaming

Streaming error correction of genomic, transcriptomic, metagenomic data via

graph alignment

Jason Pell, Jordan Fish, Michael Crusoe

Page 46: 2014 anu-canberra-streaming

Pair-HMM-based graph alignment

Jordan Fish and Michael Crusoe

Page 47: 2014 anu-canberra-streaming

…a bit more complex...

Jordan Fish and Michael Crusoe

Page 48: 2014 anu-canberra-streaming

Error correction on simulated E. coli data

1% error rate, 100x coverage.

Michael Crusoe, Jordan Fish, Jason Pell

TP FP TN FN

Streaming 3,494,631 3,865 460,601,171 5,533

(corrected) (mistakes) (OK) (missed)

Page 49: 2014 anu-canberra-streaming
Page 50: 2014 anu-canberra-streaming
Page 51: 2014 anu-canberra-streaming

A few additional thoughts --

Sequence-to-graph alignment is a very general concept.

Could replace mapping, variant calling, BLAST, HMMER…

“Ask me for anything but time!”

-- Napoleon Bonaparte

Page 52: 2014 anu-canberra-streaming

(D) Calculating read error rates by position within readShotgun data is randomly

sampled;

Any variation in mismatches with reference by position is likely due to errors or bias.

Page 53: 2014 anu-canberra-streaming

Reads from Shakya et al., pmid 23387867

Sequencing run error profiles

Via bowtie mapping against reference --

Page 54: 2014 anu-canberra-streaming

We can do this sub-linearly from data w/no reference!

Reads from Shakya et al., pmid 23387867

Page 55: 2014 anu-canberra-streaming

Reference-free error profile analysis

1. Requires no prior information!

2. Immediate feedback on sequencing quality (for cores & users)

3. Fast, lightweight (~100 MB, ~2 minutes)

4. Works for any shotgun sample (genomic, metagenomic, transcriptomic).

5. Not affected by polymorphisms.

Page 56: 2014 anu-canberra-streaming

Reference-free error profile analysis

7. …if we know where the errors are, we can trim them.

8. …if we know where the errors are, we can correct them.

9. …if we look at differences by graph position instead of by read position, we can call variants.

=> Streaming, online variant calling?

Page 57: 2014 anu-canberra-streaming

Future thoughts / streaming

How far can we take this?

Page 58: 2014 anu-canberra-streaming

Streaming approach supports more compute-intensive interludes – remapping, etc.

Rimmer et al., 2014

Page 59: 2014 anu-canberra-streaming

Streaming online reference-free variant calling.

Single pass, reference free, tunable, streaming online variant calling.

Page 60: 2014 anu-canberra-streaming

Streaming with reads…

Page 61: 2014 anu-canberra-streaming

Analysis is done after sequencing.

Page 62: 2014 anu-canberra-streaming

Streaming with bases

Page 63: 2014 anu-canberra-streaming

Integrate sequencing and analysis

Page 64: 2014 anu-canberra-streaming

Directions for streaming graph analysis

Generate error profile for shotgun reads;

Variable coverage error trimming;

Streaming low-memory error correction for genomes, metagenomes, and transcriptomes;

Strain variant detection & resolution;

Streaming variant analysis.

Michael Crusoe, Jordan Fish & Jason Pell

Page 65: 2014 anu-canberra-streaming

Our software is open source

Methods that aren’t broadly available are limited in their utility!

Everything I talked about is in our github repository,

http://github.com/ged-lab/khmer

…it’s not necessarily trivial to use…

…but we’re happy to help.

Page 66: 2014 anu-canberra-streaming

We have recipes!

Page 67: 2014 anu-canberra-streaming

Planned work: distributed graph database server

ivory.idyll.org/blog/2014-moore-ddd-talk.html

Page 68: 2014 anu-canberra-streaming

Thanks for listening!