Param selection phase1summary_v2

104
Systematic Analysis of Parameter Selection for Sequence Aligment Algorithms Project Recap and Phase 1 Summary Aaron Smalter Hall Molecular Graphics and Modeling Laboratory University of Kansas June 26, 2013

description

Aarron Smalter Hall KU http://msg.dept.ku.edu/webs/msg/mgm/contacts.shtml Results of the aligner parameter selection study.

Transcript of Param selection phase1summary_v2

Page 1: Param selection phase1summary_v2

Systematic Analysis of Parameter Selection for Sequence Aligment Algorithms

Project Recap and Phase 1 Summary

Aaron Smalter HallMolecular Graphics and Modeling Laboratory

University of KansasJune 26, 2013

Page 2: Param selection phase1summary_v2

Motivation

● Genomics has become heavily dependent on the use of sequence alignment tools

● Performance of sequence alignment is directly dependent on parameters

● To date there is no systematic analysis of sequence alignment parameters and their effects on alignment performance

Page 3: Param selection phase1summary_v2

Challenges

● Sequence alignment is computationally intensive

● Sequence alignment is often controlled by many different parameters

● Often not tractable to perform alignment with multiple parameter combinations

● Number of reads in a data set is growing – partially offset by better hardware

Page 4: Param selection phase1summary_v2

Approach

● Systematic analysis of effects of parameter perturbation on sequence alignment behavior– Analyze performance sensitivity of individual

parameters (phase 1)

– Analyze performance sensitivity of parameter combinations (phase 2)

– Compare performance characteristics across sequence alignment tools (phase 3)

Page 5: Param selection phase1summary_v2

Experimental Design (Phase 1)

● Identify a broad set of interesting alignment tools● Identify a broad set of interesting parameters for

each tool● Identify interesting data sets to test

tools/parameters● Execute tools on the same data sets, while

changing parameters individually over a wide range

Page 6: Param selection phase1summary_v2

Experimental Design (Phase 2)

● Identify alignment parameters for each tool that are individually sensitive to changes

● Define sensitivity w.r.t.:– Computation time and memory required– Read mapping rate

– Read mapping quality

● Identify functional ranges for each individual parameter● Execute alignment tools on combinations of parameters, while

perturbing parameters across functional range● Identify regions of increased sensitivity and execute alignment

tools with finer grained parameter value intervals

Page 7: Param selection phase1summary_v2

Experimental Design (Phase 3)

● Identify relationships between parameters● Identify parameter space regions of best

performance● Compare performance and sensitivities across

sequence alignment tools

Page 8: Param selection phase1summary_v2

Current Status

● We are at the end of phase one, ready to move into phase two

● Completed:– Select alignment tools

– Select data sets

– Select parameters of interest

– Run experiments across broad parameter ranges

– Collect performance and sensitivity data

● Now: visualize and assess data for each alignment tool

Page 9: Param selection phase1summary_v2

Phase 2 Requirements

● Identify sensitive parameters● Identify functional range of parameters● Write software scripts to automatically

generate parameter combination jobs to run on cluster (there will be many, many jobs)

● Execute jobs and collect results

Page 10: Param selection phase1summary_v2

Experimental Choices

● Datasets● Alignment tools● Collected Results

– Parameters

– Sensitivities

– Functional ranges

Page 11: Param selection phase1summary_v2

Datasets

● Collected several data sets:– DePristo/Broad NA12878 Whole Genome

– DePristo/Broad NA17878 Whole Exome

– Synthetic paired end● 10 million pairs● 1 million pairs● 100k pairs● 10k pairs● 1k pairs

Page 12: Param selection phase1summary_v2

Alignment Tools

● BWA-mem● BWA-sw● SOAP2● Bowtie2● Novoalign● SeqAlto● RazerS

Page 13: Param selection phase1summary_v2

Collected Results

● For each alignment tool:– Table of parameters ranked (roughly) by magnitude of effect on

performance● according to standard deviation of performance characteristics

– Figures of most sensitive parameters showing performance results over entire parameter range tested

– Scatter plots of every experimental result showing tradeoffs:● CPU time vs. reads mapped● CPU time vs. mean MAPQ● Reads mapped vs. mean MAPQ

Page 14: Param selection phase1summary_v2

Comparison at Defaults

Aligner CPU Usage

Max V.Mem

mapped reads

mapq mean

mapq stdev

pct mismatc

h

BWA-mem

483.35 5.576G 5827211 44.0732 20.0053 25.4545

BWA-sw 1102.11 5.215G 2853347 64.7333 69.9756 25.3673

SOAP2 463.01 5.613G 5645049 22.8406 12.7877 30.0828

Bowtie2 1565.36 3.343G 5803539 25.1817 14.6297 25.7779

Novoalign 4247.55 7.981G 5296829 65.422 11.862 25.3865

SeqAlto 2823.06 7.022G 5669412 49.164 18.4466 25.8334

RazerS 26738.65 8.577G 1777714 255 0 38.9175

Page 15: Param selection phase1summary_v2

BWA-mem

● A recent addition to the BWA package– Designed for short reads up to 100bp

● Based on Burrows-Wheeler Transform index structures

● Some parameter values caused BWA to find more reads than should be present

● Fairly typical set of parameters● Released 2012

Page 16: Param selection phase1summary_v2

BWA-mem Parameter SensitivitiesName Flag CPU Memory Reads MAPQ Parameter

ValuesInvalid Values

minimum seed length -k 274.71 1,439.18 1,231,012.26 20.08 [0,1,10,19, 100]

1,000.00

occurrence threshold for discard -c 1,432.77 7,771.84 178,211.44 5.45 [1,10,100, 1000,10000, 100000]

0.00

mismatch penalty -B 55.54 290.55 892,401.53 8.28 [0,1,4,10, 100,1000]

[]

matching score -A 137.85 713.91 481,096.44 2.29 [1,10,100, 1000]

0.00

unpaired penalty -U 32.41 166.97 8.52 7.25 [0,1,9,10, 100,1000]

[]

re-seeding threshold -r 59.50 301.14 620.75 1.64 [0,1,1.01,1.1,1.5,2,10, 100,1000]

[]

gap open penalty -O 50.71 264.05 13,670.66 0.76 [0,1,6,10, 100,1000]

[]

band width -w 35.58 185.60 9,465.24 0.10 [0,1,10,100,1000,10000]

[]

gap extension penalty -E 35.12 182.84 11,426.13 0.08 [0,1,10,100,1000]

[]

clipping penalty -L 12.18 62.41 6,357.34 0.04 [0,1,5,10, 100,1000]

[]

Page 17: Param selection phase1summary_v2

BWA-mem – k, c

Page 18: Param selection phase1summary_v2

BWA-mem – B, A

Page 19: Param selection phase1summary_v2

BWA-mem – U, O

Page 20: Param selection phase1summary_v2

BWA-mem Tradeoffs

Page 21: Param selection phase1summary_v2

BWA-sw

● Doesn't work on paired ends– Treat each end as an individual read

– Reads mapped reported is bugged because of identical read IDs

● Works on reads 70bp-1Mbp● Similar features to BWA-mem● Similar parameters to BWA-mem● Released 2010

Page 22: Param selection phase1summary_v2

BWA-sw – Parameter SensitivitiesName Flag CPU Memory Reads MAPQ Parameter

ValuesInvalid Values

min score threshold -T 410.90 2,121.72 727,340.56 27.76 [0,1,10,37, 100]

1,000.00

z-best heuristics -z 27,174.89 144,452.72

10,993.91 21.56 [1,10,100] 0.00

threshold adjustment coef -c 114.96 593.24 27,795.00 22.45 [0,1,5.5, 10]

[100,1000]

mismatch penalty -b 122.87 634.87 7,638.45 16.90 [0,1,3,10, 100,1000]

[]

gap open penalty -q 311.60 1,609.43 18,425.36 16.73 [0,1,5,10, 100,1000]

[]

max SA interval for seed -s 11,048.94 57,081.69 24.42 1.13 [1,3,10, 100,1000]

[]

min number seeds -N 265.15 1,368.47 345.99 4.47 [0,1,5,10, 100,1000]

[]

gap extension penalty -r 271.25 1,400.37 2,992.56 0.62 [1,2,10, 100,1000]

[]

band width -w 111.93 577.31 2,800.64 0.01 [1,10,33, 100,1000]

[]

match score -a 0.00 0.00 0.00 0.00 1.00 [0,10,100,1000]

Page 23: Param selection phase1summary_v2

BWA-sw – T, z

Page 24: Param selection phase1summary_v2

BWA-sw – c, b

Page 25: Param selection phase1summary_v2

BWA-sw – q, N

Page 26: Param selection phase1summary_v2

BWA-sw Tradeoffs

Page 27: Param selection phase1summary_v2

SOAP2

● Also based on BWT index structures● Order of magnitude improvement over

previous version● Similar parameters to BWA● Original release in 2008, latest release in 2011

Page 28: Param selection phase1summary_v2

SOAP2 – Parameter Sensitivities

Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values

min insert size -m 268.29 1,505.68 0.00 4.49 [0,1,10,100,400, 1000,10000,100000]

[]

continuous gap size allowed -g 50.62 284.38 36,521.81 0.07 [0,1,10,100,1000] []

min alignment length -s 60.41 338.90 33,200.06 0.07 [10,100,255,1000] []

max insert size -x 186.14 1,044.54 0.00 2.70 [0,1,10,100,600, 1000,10000,100000]

[]

disallow gap within e-bp -e 24.60 137.84 0.00 0.00 [0,1,5,10,100,1000] []

max mismatch per read -v 20.21 113.54 31.70 0.00 [0,1,5,10,100,1000] []

seed length -l 11.50 64.80 412.81 0.00 [100,256,1000] []

number Ns to allow -n 7.33 41.21 115.89 0.00004

[0,1,5,10,100,1000] []

Page 29: Param selection phase1summary_v2

SOAP2 – m ,g

Page 30: Param selection phase1summary_v2

SOAP2 – s, x

Page 31: Param selection phase1summary_v2

SOAP2 - Tradeoffs

Page 32: Param selection phase1summary_v2

Bowtie 2

● Works on reads from 50-1000bp● Compresses BWT index to limit memory

footprint● Similar parameters to BWA and SOAP2, with a

few additions● Released 2012, latest release in 2013

Page 33: Param selection phase1summary_v2

Bowtie2 – Parameter SensitivitesName Flag CPU Memory Reads MAPQ Parameter

ValuesInvalid Values

length of seed substring -L 27,039.75 2,857.25 34,224.99 2.68 [3,6,9,13,16,19,22,26,29,32]

[]

end of interval between seed substrings -i2 950.37 3,500.88 15,031.23 1.21 [0,1,1.25,2,4,8,16]

[]

min acceptable alignment score coefficient

-score-min2

501.10 718.53 801,965.66 13.57 [-0.9,-0.6,-0.3,0,1]

[]

reference gap open penalty -rfg1 181.85 2,152.54 2,889.80 0.11 [0,1,3,5,6,10,32,100]

[]

reference gap extend penalty -rfg2 462.52 1,624.29 5,373.45 0.15 [1,3,5,6,10,32,100]

[]

max mismatch penalty -mp1 91.02 1,071.79 59,024.02 5.69 [2,3,5,6,10,32,100]

[]

stop gap extension after <D> failures -D 76.67 1,441.92 12,201.07 0.93 [5,9,13,15,17,21,25]

[]

read gap open penalty -rdg1 31.19 1,395.54 6,491.17 0.04 [0,1,3,5,6,10,32,100]

[]

min mismatch penalty -mp2 24.00 1,292.41 1,068.41 0.09 [2,3,4,5] []

try <R> sets of seeds for repetitive seeds -R 47.47 1,185.47 290.98 0.00 [1,2,3] []

penalty for Ns -np 27.86 1,161.23 3,349.96 0.01 [0,1,2,3,5,10,32,100]

[]

read gap extension penalty -rdg2 30.43 1,072.39 6,812.47 0.04 [1,3,5,6,10,32,100]

[]

max mismatches in seed -N 0.00 0.00 0.00 0.00 0.00 1.00

Page 34: Param selection phase1summary_v2

Bowtie2 – L, i2

Page 35: Param selection phase1summary_v2

Bowtie2 – score-min2, rfg1

Page 36: Param selection phase1summary_v2

Bowtie2 – rfg2, mp1

Page 37: Param selection phase1summary_v2

Bowtie2 - Tradeoffs

Page 38: Param selection phase1summary_v2

Novoalign

● Smallest number of parameters● Requires paid license for commercial use● Does global alignment with full Needleman-Wunsch

algorithm● Some nice 'bonus' features:

– multithreaded support

– Base quality calibration

– Adapter stripping

● Originally released 2008, newest version 3 released last month

Page 39: Param selection phase1summary_v2

Novoalign – Parameter Sensitivities

Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values

gap open penalty 'g' 201,109.30 3,731,245.16 4,535.13 3.09 [0,10,20,30,40,50,60,70,80,90,99]

[]

threshold for highest alignment score t 948.24 7,490.47 1,095,323.89 1.11 [-1 0 10 20 40 50 60 70 80 90 100]

[]

minimum good qual bases for read l 565.53 4,489.02 244,324.72 0.07 [15 20 25 35 45 55 65 75 85 95 100]

[]

structural variation penalty for chimeric fragments

v 423.15 5,630.93 4,542.41 0.18 [0 10 20 30 40 50 60 70 80 90 100 110 120 130 140]

[]

gap extend penalty x 251.55 1,985.14 6,112.46 0.08 [6 10 20 30 40 50 60 70 80 90 99]

[]

treshold for homopolymer filter 'h' 269.90 2,297.69 136.19 0.00 [0,10,20,30,40] []

Page 40: Param selection phase1summary_v2

Novoalign – g, t

Page 41: Param selection phase1summary_v2

Novoalign – l, v

Page 42: Param selection phase1summary_v2

Novoalign – x, h

Page 43: Param selection phase1summary_v2

Novoalign Tradeoffs

Page 44: Param selection phase1summary_v2

SeqAlto

● More parameters than other aligners● Uses standard hashing index structures with

larger seeds and adaptive stopping● Designed for reads about 100bp or more● Claims 2-4x faster than BWA but our results

do not agree● Initially released in 2012

Page 45: Param selection phase1summary_v2

SeqAlto – Parameter Table

Name Flag CPU Memory Reads MAPQ Parameter Values Invalid Values

k-mer maximum occurance threshold (Needleman-Wunsch) max_occ_nw 78.04 545.72 65,393.48 0.73 [2,10,100,1000,100000]

[]

minimum gap open rate o 5,074.01 35,925.63 281.74 0.01 [0.005,0.05,0.5,0.99,1] []

maximum template size i 2,707.76 18,995.95 676.10 10.24 [250,550,5500,55000] []

k-mer maximum occurance threshold max_occ 174.30 1,222.60 58,578.73 0.66 [2,10,100,1000,10000,100000]

[]

Phred score pairing prior d 103.76 727.66 928.91 1.96 [0,8,80,100,800,8000] []

maximum gap extension length e 892.84 6,259.12 4,723.80 0.05 [0,5,25,50,75,100,1000]

[]

Needleman-Wunsch mismatch penalty nw_sub 308.02 2,156.92 1,242.50 1.60 [0,10,15,100,1000] []

Needleman-Wunsch match score nw_mat 30.85 215.38 4,575.00 1.23 [0,2,5,10,100,1000] []

Needleman-Wunsch gap extension penalty nw_ext 16.84 116.35 5,343.32 0.12 [2,10,100,1000] []

Smith-Waterman match score sw_mat 32.53 225.79 2,322.35 0.05 [0,2,5,10,100,1000] []

Needleman-Wunsch gap open penalty nw_gap 33.41 232.77 1,922.74 0.01 [0,10,40,100,1000] []

additional k-mer look-ahead for high mismatch (Needleman-Wunsch)

kmer_pen_nw 38.24 267.76 1,899.81 0.09 [0,1,10] []

additional k-mer look-ahead for high mismatch kmer_pen 34.28 240.52 1,647.21 0.08 [0,1,10,100] []

k-mer look ahead look_ahead 85.48 600.54 42.25 0.04 [0,2,10,100] []

minimum unclipped read percentage c 5.46 38.80 320.08 0.00 [0,5,25,50,75,100] []

Smith-Waterman gap open penalty sw_gap 6.87 48.40 313.79 0.01 [0,10,40,100,1000] []

Smith-Waterman mismatch penalty sw_sub 14.78 101.98 98.20 0.02 [0,10,15,100,1000] []

Smith-Waterman gap extension penalty sw_ext 9.57 65.70 4.49 0.00 [0,2,10,100,1000] []

k-mer look ahead (Needleman-Wunsch) look_ahead_nw

3.74 26.81 0.00 0.00 [0,2,10,100] []

average template size m 3.35 24.88 0.00 0.00 [0,100,200,300,550] []

Page 46: Param selection phase1summary_v2

SeqAlto – max_occ_nw, o

Page 47: Param selection phase1summary_v2

SeqAlto – i, max_occ

Page 48: Param selection phase1summary_v2

SeqAlto – d, e

Page 49: Param selection phase1summary_v2

SeqAlto - Tradeoffs

Page 50: Param selection phase1summary_v2

RazerS

● Some irregularities:– Majority of experiments report only 1.7 million reads

mapped, but some experiments report over 13 million reads mapped

– MAPQ reported as 255 for all experiments

– Mean read length for most experiments is ~4.5

● Uses q-gram counting for approximate search● Latest 3 supports parallelization● Initially released 2012

Page 51: Param selection phase1summary_v2

RazerS – Parameter Sensitivities

Name Flag CPU Memory Reads MAPQ

Parameter Values Invalid Values

tolerated deviation from library size le 3,808.44 28,396.98 6,406,142.49 0.00 [0,25,50,100,1000,10000]

[]

threshold of common kmers between read and reference

t 40,775.85 297,513.56 886,521.11 0.00 [-1,1,10,100] []

percent identity threshold i 15,076.96 79,300.67 695,578.11 0.00 [92,100] [50,60]

mean library length ll 1,492.70 16,932.42 1,578,820.93 0.00 [100,120,220,320, 2200]

[]

no gaps flag ng 2,623.38 21,989.04 669,685.28 0.00 [0,1] []

repeat length rl 2,826.93 18,235.32 6.00 0.00 [10,100,1000, 10000]

[]

distance range for best match errors dr 1,371.85 7,688.67 230,520.49 0.00 [-1,0,1,10,100] []

read kmers overabundence cutoff oc 1,205.15 5,543.36 0.00 0.00 [0,1] []

overlap length ol 1,107.97 5,205.90 0.00 0.00 [-1,0,1,10,100] []

mutation rate mr 734.48 3,577.62 0.00 0.00 [0,1,5,10] []

percent recognition rate rr 354.60 1,607.40 0.00 0.00 [82,85,90,99] []

taboo length tl 0.00 0.00 0.00 0.00 1.00 [10,100]

Page 52: Param selection phase1summary_v2

RazerS – le, t

Page 53: Param selection phase1summary_v2

RazerS – i, ll

Page 54: Param selection phase1summary_v2

RazerS – ng, dr

Page 55: Param selection phase1summary_v2

RazerS - Tradeoffs

Page 56: Param selection phase1summary_v2

CPU Time HistogramsBWA-mem BWA-sw SOAP2

Bowtie2 Novoalign SeqAlto

Page 57: Param selection phase1summary_v2

Mean MAPQ HistogramsBWA-mem BWA-sw SOAP2

Bowtie2 Novoalign SeqAlto

Page 58: Param selection phase1summary_v2

Reads Mapped HistogramsBWA-mem BWA-sw SOAP2

Bowtie2 Novoalign SeqAlto

Page 59: Param selection phase1summary_v2

Some Conclusions

● BWA-mem and SOAP2 are fastest, and execute in minutes– But, BWA accuracy is not great

– SOAP2 accuracy is even worse

● Novoalign is the most accurate but requires more time and memory, and aligns fewer reads

● Bowtie2 is the most memory efficient● Novoalign and SeqAlto appear to be the most stable aligners● SeqAlto is decent all around, not the best, not the worst● RazerS has some basic issues● In many cases, the best performance characteristics can be

achieved without sacrificing performance in other areas

Page 60: Param selection phase1summary_v2

Next Steps

● Generate parameter combination jobs● Submit jobs to Beocat for execution

– Beocat has been under a lot of maintenance lately, is that more or less finished?

● Consolidate results for next round of analysis● Interpret results and start working on

manuscript

Page 61: Param selection phase1summary_v2

Acknowledgements

● Faculty– Brooke Fridley

– Jeremy Chen

– Sue Brown

● Students, Staff, and Post-docs– Byunggil Yoo

– Jennifer Shelton

– Rama Raghavan

– Greg Matuszek

Page 62: Param selection phase1summary_v2

BWA-mem Supplementary

Page 63: Param selection phase1summary_v2
Page 64: Param selection phase1summary_v2
Page 65: Param selection phase1summary_v2
Page 66: Param selection phase1summary_v2
Page 67: Param selection phase1summary_v2

BWA-sw Supplementary

Page 68: Param selection phase1summary_v2
Page 69: Param selection phase1summary_v2
Page 70: Param selection phase1summary_v2
Page 71: Param selection phase1summary_v2
Page 72: Param selection phase1summary_v2

Bowtie2 Supplementary

Page 73: Param selection phase1summary_v2
Page 74: Param selection phase1summary_v2
Page 75: Param selection phase1summary_v2
Page 76: Param selection phase1summary_v2
Page 77: Param selection phase1summary_v2
Page 78: Param selection phase1summary_v2
Page 79: Param selection phase1summary_v2
Page 80: Param selection phase1summary_v2

SeqAlto Supplementary

Page 81: Param selection phase1summary_v2
Page 82: Param selection phase1summary_v2
Page 83: Param selection phase1summary_v2
Page 84: Param selection phase1summary_v2
Page 85: Param selection phase1summary_v2
Page 86: Param selection phase1summary_v2
Page 87: Param selection phase1summary_v2
Page 88: Param selection phase1summary_v2
Page 89: Param selection phase1summary_v2
Page 90: Param selection phase1summary_v2
Page 91: Param selection phase1summary_v2
Page 92: Param selection phase1summary_v2
Page 93: Param selection phase1summary_v2
Page 94: Param selection phase1summary_v2
Page 95: Param selection phase1summary_v2

RazerS Supplementary

Page 96: Param selection phase1summary_v2
Page 97: Param selection phase1summary_v2
Page 98: Param selection phase1summary_v2
Page 99: Param selection phase1summary_v2
Page 100: Param selection phase1summary_v2
Page 101: Param selection phase1summary_v2
Page 102: Param selection phase1summary_v2
Page 103: Param selection phase1summary_v2
Page 104: Param selection phase1summary_v2