Parallel Pair-HMM SNP Detection

PARALLEL PAIR-HMM SNP

DETECTION

GNUMAP-SNP

Nathan ClementThe University of TexasAustin, TX, USA

Outline Motivation

NGS Issues and RequirementsPair-HMM

Memory Optimizations Results Conclusion

MotivationMutation Detection: SNP discovery

HapMap and resequencing Species Identification Bisulfite Sequencing

Epigenetic influencesRNA editing

Error Rates*Instrument Run Time Mb/run Bases/

readPrimary Error Type

Error Rate (%)

3730xl (Capillary)

2 h 0.06 650 Substitution 0.1-1

454 FLX+ 18-20 h 900 700 Indel 1

Illumina HiSeq2000

10 days ≤ 600,000 100+100 Substitution ≥0.1

Ion Torrent – 318 chip

2 h >1000 >100 Indel ~1

PacBio RS 0.5-2h 5-10 860-1100 CG Deletions

* Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011

Pair-HMM

Pair-HMM (Mathematics) Match

Gap (in both directions)

Pair-HMM (M)

a t a c g a c ta 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

g 0.00 0.68 0.00 0.00 0.00 0.00 0.00 0.00

t 0.00 0.32 0.68 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.32 0.68 0.00 0.00 0.00 0.00

g 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

Pair-HMM (X)a t a c g a c t

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

g 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00

t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Pair-HMM (Y)

a t a c g a c ta 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

t 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.31 0.00 0.00 0.00 0.00

g 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

a 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

c 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Pair-HMMA C G T

a 1.00 0.00 0.00 0.00

g 0.00 0.00 0.68 0.31

t 0.32 0.00 0.00 0.68

a 0.99 0.00 0.00 0.00

g 0.00 0.00 1.00 0.00

a 1.00 0.00 0.00 0.00

c 0.00 1.00 0.00 0.00

Expected ResultsCHR POS TOT A C G T SNP? PVALchrX 1755234 17.00 0.00 0.00 17 0.00 N

chrX 1755235 18.00 0.00 18.00 0.00 0.00 N

chrX 1755236 19.00 9.99 0.00 9.00 0.01 Y:g->a/g 2.54e-08

chrX 1755237 19.50 0.00 0.00 0.00 19.50 N

chrX 1755238 19.50 0.00 0.00 19.50 0.00 N

chrX 1755239 46.00 0.01 19.49 0.00 0.00 N

Why Inline SNP Calling? Post-Processing

Disk space, less memory Inline

Requires more memoryLess disk spaceCan include specifics probabilities for each

Previous Optimizations Two methods for speeding up mapping:

1. Entire genome on one machine2. Split memory among different machines

○ Must normalize across all genome portions○ MPI reduction

Previous Optimizations

Memory Requirements Human Genome (3gb)

HashMap ≈ 12GB4 bits/character = 1.5GB5 floating point values per base (plus N) =

sizeof(float)*5 * 3GB=60GBAlso stores total for easy computation =

sizeof(float) * 3GB = 12GB Total of ≈ 90GB per run

Three Memory Optimizations Normal (no optimization) Integer discretization Centroid discretization

Integer Discretization Only need one floating point value (for

total) and 1 byte/nucleotide. “Parts per 255” Biggest hit: Going into and out of

“integer space”

Integer DiscretizationAdded from ri:1.0 0.00 0.68 0.31 0.01 0.00

Step 1: Convert from Integer Space

Step 2: Add from ri to Genome

Step 3: Convert back to Integer Space

Genome

Total A C G T N12.0 3 231 7 12 3

Total A C G T N12.0 0.15 10.9 0.33 0.56 0.15

Total A C G T N13.0 0.15 11.6 0.64 0.57 0.15

Total A C G T N13.0 2 228 13 11 2

Centroid Discretization Many states not used:

[255, 255, 255, 255, 255][0, 0, 0, 0, 0]

Many states not biologically relevantSNP transition (common) vs transversion

(not likely) MSA uses this compression to perform

fast alignment of one-to-many alignment

Centroid Discretization (cont)

Centroid Discretization (cont) Benefits

Doesn’t waste impossible or infrequently used space

Much smaller memory footprint Drawbacks:

Slight overhead in converting from centroid to floating point spaces

Rounding error (how significant?)

Speed Comparison

Optimization Stats (chrX)

Optimization Memory Mem % Wallclock TP FPNormal 4.76GB 100% 04:25:55 1309 127CharDisc 2.58GB 54.2% 04:36:58 677 0CentDisc 2.01GB 42.2% 04:27:29 166 9058

Conclusion For high error rates, HMM approach is

ideal, but requires more memoryDistributing the genome across processors

doesn’t scale linearly Discretization methods provide good

memory reductions (up to 42%)Centroid discretization performs poorlyInteger discretization can be used when

available memory is low

Questions

Parallel Pair-HMM SNP Detection

Documents

Transcript of Parallel Pair-HMM SNP Detection

Hmm 080728

Introduction to hmm

HMM Introduction

Hmm Revisited

2009 spie hmm

Chapter4.1 HMM

Thinkpad T60 HMM

Aocr Hmm Presentation

SNP-l5233H/l5233€¦ · Security Dimensions (WxH) Weight SNP-l5233H/l5233 SNP-l5233HN/HP SNP-l5233N/P 1.3M HD 23x Network PTZ Dome Camera SNP-L5233H SNP-L5233 key Features

HMM for multiple sequences. Pair HMM HMM for pairwise sequence alignment, which incorporates affine gap scores. “Hidden” States Match (M) Insertion in.

Overview of HMM

PTZ/Dome Mounts...SCP-2370TH/2370H SCP-2370RH/2330H SCP-2273H/2271H SCP-2270H/2250H SNP-6321H/6320H/6201H SNP-6200RH/6200H SNP-5430H/5321H SNP-5300H/5200H SNP-3371TH/3371H SNP-3302H/3120VH

Hmm powerpoint

Ideapad u260 - Hmm

HMM Presentation

Special Needs Plan (C-SNP/D-SNP) - MedStar Provider Networkmedstarprovidernetwork.org/sites/default/files/attachments/MedStar... · Special Needs Plan (C-SNP/D-SNP) Objectives Upon

Principles of comparative bioinformatics...Backward sampling algorithm of a pair HMM (see for example, Durbin et al., Biological Sequence Analysis : Probabilistic Models of Proteins

Protein homology detection by HMM–HMM comparison Johannes Söding

A Genome-wide SNP Genotyping Arrayschluter/reprints/jones... · A Genome-wide SNP Genotyping Array Reveals Patterns of Global and Repeated Species-Pair Divergence in Sticklebacks

What is a SNP?. Lecture topics What is a SNP? What use are they? SNP discovery SNP genotyping Introduction to Linkage Disequilibrium.