Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical...

29
Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk Bozdag, Umit Catalyurek Dept. of Biomedical Informa3cs Dept. of Electrical & Computer Engineering The Ohio State University Catalin Barbacioru Applied Biosystems

Transcript of Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical...

Page 1: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics

Parallel Short Sequence Mapping for High Throughput Genome Sequencing 

Doruk Bozdag, Umit Catalyurek Dept. of Biomedical Informa3cs 

Dept. of Electrical & Computer Engineering The Ohio State University 

Catalin Barbacioru Applied Biosystems 

Page 2: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 2

Outline 

•  Short Sequence Mapping Problem 

•  Related Work •  Covering Designs •  New Paralleliza3on Methods 

•  Experimental Results •  Conclusion and Future Work 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 3: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 3

Short Sequence Mapping 

•  Next genera3on sequencing instruments (SOLiD, Solexa, 454) can sequence up to 1 billion bases a day -  35‐50 base reads 

•  Reads should be efficiently mapped to a reference genome -  Human genome: 3 billion bases 

•  Sequen3ally mapping a single run (~130M reads, 3G base genome) takes about a day 

•  Fast, resource efficient, parallel algorithms that can handle mismatches are required 

3 Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 4: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 4

The Problem 

•  Iden3fy matching loca3ons of short sequences (reads) on reference genome. -  Allow mismatches 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 4 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 5: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 5

Related Work  

•  Local sequence alignment -  BLAST, Pa]ernHunter -  Start with a short sequence (anchor) mapping opera3on (14‐18 bases long), then extend matches 

-  No mismatch tolerance at anchor posi3ons •  Pa]ernHunter improves sensi3vity using spaced seeds 

-  Target longer alignments and more general alignment problems 

•  Short sequence mapping -  xMAN, Eland, PASS, SOAP, BLAT, SSAHA, MAQ, MosaicAligner, ZOOM, SHRIMP, MapReads. 

-  Oaen a hash or an index table is u3lized - Mismatches are either not allowed or handled by enumera3on 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 5 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 6: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 6

  Sequence Mapping Example 

•  Two step approach -  Index/hash table construc3on using sliding window -  Table lookup to find matches for each read 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 6 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 7: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 7

Covering Designs (1) 

•  Consider a sequence of length 5  •  Checking for every possible      2‐mismatch scenario requires     

       comparisons 

•  Find a set of 3‐mismatch pa]erns to cover all possible  2‐mismatch cases - Minimize the number of               3‐mismatch pa]erns 

52

=10

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 7 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 8: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 8

Covering Designs (2) 

Pros: 

•  Accounts for mismatches with full sensi3vity •  Reduces the size of the hash table and the number of table lookups while matching a read 

•  Best known covering designs are publicly available (La Jolla Covering Repository) Cons: 

•  Requires post processing to remove hits having mismatches greater than the targeted number 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 8 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 9: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 9

ParallelizaNon methods 

•  We propose 6 paralleliza3on methods for hash/index based short sequence mapping -  Performance depends on data size and number of nodes 

•  G: Size of reference genome 

•  R: Number of reads •  N: Number of computa3on nodes 

•  Modeling unit computa3on cost -  cg : Time to compute an index/hash for a single sequence 

•  In sequen3al algorithm, index table is constructed in cgG 

-  cr : Time to process a single read if no collision -  cc : Average 3me to resolve a collision 

•  In sequen3al algorithm, all reads are processed in (cr+ ccG)R 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 9 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 10: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 10

ParNNoning Reads Only (PRO) 

•  Par33on reads into N equal parts. 

•  Useful when R is large and G is small. 

•  Memory requirement does not scale 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 10 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 11: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 11

ParNNoning Genome Only (PGO) 

•  Par33on genome into N equal parts 

•  Useful when G is large and R is small. 

•  Memory requirement scales perfectly 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 11 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 12: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 12

ParNNon Reads and Genome (PRG) 

•  A generaliza3on of PRO and PGO 

•  Nodes are arranged in N=N1xN2 mesh 

•  Useful unless G>>R or G<<R 

•  Memory scales worse than PGO, but be]er than PRO 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 12 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 13: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 13

Suffix Based Assignment (SBA) 

•  Need to avoid processing en3re genome (PRO) or all reads (PGO) 

•  Assign a set of suffixes of length s to each node -  4s suffixes for a given s 

•  Each node only processes reads and genome sequences that end with assigned suffixes 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 13 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 14: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 14

SBA Example 

•  Consider suffixes at the last s care posi3ons 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 14 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 15: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 15

 SBA Load Balancing 

•  The number of reads (genome sequences) ending in sfxA is not necessarily the same as that ending in sfxB 

•  Count the occurrences of each color/base in a sample of reference genome 

•  Es3mate the load for each suffix based on occurrence frequency of colors/bases in the suffix -  Independent of pa]ern being considered 

•  Balance the load using bin packing •  Balance can be improved using larger s but this increases the number of suffixes, hence comparison cost 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 15 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 16: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 16

SBA cost model 

•  cgs : Time to compare a genome sequence against assigned suffixes  

•  crs : Time to compare a read against assigned suffixes 

•   Assigned genome sequences are not consecu3ve -  Sliding window is no longer efficient -  cg’ : Time to compute index/hash from scratch 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 16 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 17: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 17

SBA Load DistribuNon 

•  Useful for medium values of N -  Under perfect balance  G and R are par33oned equally  

-  Limited scalability due to cgs and crs terms 

•  Memory requirement scales well 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 17 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 18: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 18

SBA aUer ParNNoning Reads (SPR) 

•  Form N2 node groups and assign R/N2 reads to each. 

•  Apply SBA on each group 

•  Takes advantage of SBA when R is large 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 18 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 19: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 19

SBA aUer ParNNoning Genome (SPG) 

•  Form N1 node groups and assign G/N1 sequences to each. 

•  Apply SBA on each group 

•  Takes advantage of SBA when G is large 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 19 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 20: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 20

Experimental Setup 

•  Our implementa3on is based on  MapReads, a part of SOLiD System Color Space Mapping Tool •  Implemented in C using MPI 

•  Used default covers with allowing up to 2 mismatches 

•  Experiments on 64‐node dual 2.4GHz Opteron  cluster with 8GB memory 

•  Nodes are interconnected via Infiniband MVAPICH v0.9.8 •  Reads from a single run of SOLiD system •  Human Genome Build 36.1 (h]p://genome.uscs.edu) 

•    

N1 = N2 = N

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 20 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 21: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 21

Varying Number of Reads 

•  G: 800M, R: (16M, 32M, 64M, 130M), N:16 

•  Par33oning reads helps reducing matching 3me 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 21 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 22: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 22

Varying Genome Size 

•  G: (50M, 200M, 800M, 3080M), R: 130M, N:16 

•  Par33oning genome helps reducing hashing 3me 

•  Memory problems with PRO and SPR when G=3080M (SBA failed unexpectedly) 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 22 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 23: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 23

Varying Number of Nodes 

•  G: 800M, R: 130M, N: (4, 16, 64) 

•  Up to 22x speedup: From a day to an hour! 

•  PRO failed unexpectedly 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 23 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 24: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics

RunNme EsNmates for Various ConfiguraNons 

4 Processors  16 Processors 

Umit V. Catalyurek - "Parallel Short Sequence Mapping" Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 24

Page 25: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics

RunNme EsNmates for Various ConfiguraNons 

36 Processors  64 Processors 

Umit V. Catalyurek - "Parallel Short Sequence Mapping" Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 25

Page 26: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 26

PredicNng the Best Method 

•  Imbalance problem with SBA related methods 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 26 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 27: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 27

Conclusions 

•  Proposed 6 different paralleliza3on methods for short sequence mapping 

•  Extensively analyzed performance of each method wrt. genome size, number of reads and number of nodes -  Described theore3cal cost models 

-  Evaluated performance experimentally 

•  Proposed a predic3on func3on to select the best method for a given scenario 

•  Achieved fairly good speedup that allows reducing the mapping 3me from a day to an hour. 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 27 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 28: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 28

Future Work 

•  Fixing and cleanup -  Inves3gate causes of imbalance for SBA related methods 

•  Iden3fy reasons specific to MapReads -  Inves3gate addi3onal factors in the cost func3ons -  Improve predic3on method to allow tolerance to poten3al imbalances -  Implement the most general model that encompasses all paralleliza3on 

methods •  SBA aaer par33oning both reads and the genome 

•  Compute best values for N1 and N2 for given G,R and N -  including N1xN2 < N -  also a scheduling problem for mul3‐user server deployment 

•  Can we improve load balance? •  How to choose the tolerance for Cover Design? •  Can we improve the pa]erns of op3mum Cover Designs? •  Next Problem: Genome Assembly 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 28 Umit V. Catalyurek - "Parallel Short Sequence Mapping"

Page 29: Parallel Short Sequence Mapping for High Throughput Genome ...€¦ · Department of Biomedical Informatics Parallel Short Sequence Mapping for High Throughput Genome Sequencing Doruk

Department of Biomedical Informatics 29

Thanks 

•  More Info:  -  h]p://bmi.osu.edu/~umit   -  [email protected] 

Dagstuhl - Combinatorial Sci. Comp. - Feb 5, 2009 29 Umit V. Catalyurek - "Parallel Short Sequence Mapping"