Download - PacMin @ AMPLab All-Hands

PacMin: rethinking genome analysis with long reads

Frank Austin Nothaft, AMPLab Joint work with Adam Bloniarz

10/14/2014

Note:• This talk is mostly speculative.

• I.e., the methods we’ll talk about are partially* implemented.

• This means you have an opportunity to steer the direction of this work!

* I’m being generous to myself.

• Most sequence data today comes from Illumina machines, which perform sequencing-by-synthesis

!

!

!

• We get short (100-250 bp) reads, with high accuracy

• Reads are (usually) paired

Sequencing 101

http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png

Current Pipelines are Reference Based

• Map subsequences to a “reference genome”

• Compute variants (diffs) against the reference

From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices

An aside: What is the reference genome?

• Pool together n individuals, and assemble their genomes together

• A few problems:

• How does the reference genome handle polymorphisms?

• What about structural rearrangements?

• Subpopulation specific alternate haplotypes?

• It has gaps. 14 years after the first human reference genome was released, it is still incomplete.*

* This problem is Hard.

The Sequencing Abstraction

• Sample poisson distributed substrings from a larger string

• Reads are more or less unique and correct

It was the best of times, it was the worst of times…It was the

the best oftimes, it was

the worst ofworst of times

Metaphor borrowed from Michael Schatz

best of times was the worst

…is a leaky abstraction• We frequently encounter “gaps” in the sequence

Ross et al, Genome Biology 2013

…is a leakier abstraction• We preferentially sequence from “biased” regions:

Ross et al, Genome Biology 2013

A very leaky abstraction!

• Reads aren’t actually correct

• >2% error (expect 0.1% variation)

• Error probability estimates are cruddy

• Reads aren’t actually unique

• >7% of the genome is not unique (K. Curtis, SiRen)

The State of Analysis• We’re really good at calling SNPs!

• But, we’re still pretty bad at calling INDELs, and SVs

• And we’re also bad at expressing diffs

• Hence, SMaSH! But really, reference + diff format need to be burnt to the ground and redesigned.

• And, its slow. 2 weeks to sequence, 1 week to analyze. Not fast enough for practical clinical use.

Opportunities

• New read technologies are available

• Provide much longer reads (250bp vs. >10kbp)

• Different error model… (15% INDEL errors, vs. 2% SNP errors)

• Generally, lower sequence specific biasLeft: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/

If long reads are available…• We can use conventional methods:

Carneiro et al, Genome Biology 2012

But!• Why not make raw assemblies out of the reads?

=?

Find overlapping reads Find consensus sequencefor all pairs of reads (i,j):

i j

…ACACTGCGACTCATCGACTC…

• Problems:

1. Overlapping is O(n2) and single evaluation is expensive anyways

2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?

Fast Overlapping with MinHashing

• Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem

• Use MinHashing to approximate similarity:

1: Berlin et al, bioRxiv 2014

Per document/read, compute signature:!!

1. Cut into shingles 2. Apply random

hashes to shingles 3. Take min over all

random hashes

Hash into buckets:!!Signatures of length l can be hashed into b buckets, so we expect

to compare all elements with similarity ≥ (1/b)^(b/l)

Compare:!!For two documents with signatures of length l, Jaccard similarity is

estimated by (# equal hashes) / l

!

• Easy to implement in Spark: map, groupBy, map, filter

Overlaps to Assemblies• Finding pairwise overlaps gives us a directed

graph between reads (lots of edges!)

Transitive Reduction• We can find a consensus between clique members

• Or, we can reduce down:

• Via two iterations of Pregel!

Actually Making Calls• From here, we need to call copy number per edge

• Probably via Newton-Raphson based on coverage; we’re not sure yet.

• Then, per position in each edge, call alleles:

Notes:!Equation is from Li, Bioinformatics 2011

g = genotype state m = ploidy

𝜖 = probability allele was erroneously observed k = number of reads observed

l = number of reads observed matching “reference” allele TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…

Output• Current assemblers emit FASTA contigs

• In layperson’s speak: long strings

• We’ll emit “multigs”, which we’ll map back to reference graph

• Multig = multi-allelic (polymorphic) contig

• Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team

1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.