PacMin: rethinking genome analysis with long reads
Frank Austin Nothaft, AMPLab Joint work with Adam Bloniarz
10/14/2014
Note:• This talk is mostly speculative.
• I.e., the methods we’ll talk about are partially* implemented.
• This means you have an opportunity to steer the direction of this work!
* I’m being generous to myself.
• Most sequence data today comes from Illumina machines, which perform sequencing-by-synthesis
!
!
!
• We get short (100-250 bp) reads, with high accuracy
• Reads are (usually) paired
Sequencing 101
http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png
Current Pipelines are Reference Based
• Map subsequences to a “reference genome”
• Compute variants (diffs) against the reference
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
An aside: What is the reference genome?
• Pool together n individuals, and assemble their genomes together
• A few problems:
• How does the reference genome handle polymorphisms?
• What about structural rearrangements?
• Subpopulation specific alternate haplotypes?
• It has gaps. 14 years after the first human reference genome was released, it is still incomplete.*
* This problem is Hard.
The Sequencing Abstraction
• Sample poisson distributed substrings from a larger string
• Reads are more or less unique and correct
It was the best of times, it was the worst of times…It was the
the best oftimes, it was
the worst ofworst of times
Metaphor borrowed from Michael Schatz
best of times was the worst
…is a leaky abstraction• We frequently encounter “gaps” in the sequence
Ross et al, Genome Biology 2013
…is a leakier abstraction• We preferentially sequence from “biased” regions:
Ross et al, Genome Biology 2013
A very leaky abstraction!
• Reads aren’t actually correct
• >2% error (expect 0.1% variation)
• Error probability estimates are cruddy
• Reads aren’t actually unique
• >7% of the genome is not unique (K. Curtis, SiRen)
The State of Analysis• We’re really good at calling SNPs!
• But, we’re still pretty bad at calling INDELs, and SVs
• And we’re also bad at expressing diffs
• Hence, SMaSH! But really, reference + diff format need to be burnt to the ground and redesigned.
• And, its slow. 2 weeks to sequence, 1 week to analyze. Not fast enough for practical clinical use.
Opportunities
• New read technologies are available
• Provide much longer reads (250bp vs. >10kbp)
• Different error model… (15% INDEL errors, vs. 2% SNP errors)
• Generally, lower sequence specific biasLeft: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
If long reads are available…• We can use conventional methods:
Carneiro et al, Genome Biology 2012
But!• Why not make raw assemblies out of the reads?
=?
Find overlapping reads Find consensus sequencefor all pairs of reads (i,j):
i j
…ACACTGCGACTCATCGACTC…
• Problems:
1. Overlapping is O(n2) and single evaluation is expensive anyways
2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?
Fast Overlapping with MinHashing
• Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem
• Use MinHashing to approximate similarity:
1: Berlin et al, bioRxiv 2014
Per document/read, compute signature:!!
1. Cut into shingles 2. Apply random
hashes to shingles 3. Take min over all
random hashes
Hash into buckets:!!Signatures of length l can be hashed into b buckets, so we expect
to compare all elements with similarity ≥ (1/b)^(b/l)
Compare:!!For two documents with signatures of length l, Jaccard similarity is
estimated by (# equal hashes) / l
!
• Easy to implement in Spark: map, groupBy, map, filter
Overlaps to Assemblies• Finding pairwise overlaps gives us a directed
graph between reads (lots of edges!)
Transitive Reduction• We can find a consensus between clique members
• Or, we can reduce down:
• Via two iterations of Pregel!
Actually Making Calls• From here, we need to call copy number per edge
• Probably via Newton-Raphson based on coverage; we’re not sure yet.
• Then, per position in each edge, call alleles:
Notes:!Equation is from Li, Bioinformatics 2011
g = genotype state m = ploidy
𝜖 = probability allele was erroneously observed k = number of reads observed
l = number of reads observed matching “reference” allele TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…
Output• Current assemblers emit FASTA contigs
• In layperson’s speak: long strings
• We’ll emit “multigs”, which we’ll map back to reference graph
• Multig = multi-allelic (polymorphic) contig
• Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team
1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.
Top Related