Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism Narayan Ganesan 1, Roger...
-
date post
19-Dec-2015 -
Category
Documents
-
view
230 -
download
2
Transcript of Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism Narayan Ganesan 1, Roger...
Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism
Narayan Ganesan1, Roger Chamberlain2, Jeremy Buhler2 and Michela Taufer1
Computer and Info. Sciences Dept, University of Delaware.1
Computer Science and Engineering, Washington University in St. Louis2
Motivation
• Dataset size is growing for many applications rapidly• Problem sizes grow as the product of the data sizes
E.g., Genome sequence alignment, protein motif finding
• Number of problems also grows as the product of number of available data• Many sequentially dependent algorithms are executed serially• Processor speed has hit a brick wall (~3.5-4.00GHz in 2003) and
serial evaluation is just not feasible for large data applications Need for parallelism is greater than ever
• Parallel hardware is ubiquitous GPU, Multi-core, SIMD, Hybrid, MIMD, FPGA
• Efforts must be spent on parallelizing algorithms and applications for these hardware
Application: Protein Motif Finding
• Proteins are synthesized by biological Processes that can be described by HMMs
• Each class is described by short characteristic sequences or motifs• Each class or “generator” is described by a Hidden Markov model• Protein Motif Finding answers the questions:
Given a sequence what class does it belong to? Given a sequence and a HMM what is the probability that the sequence
belongs to that class?
Class AProtein Synthesis
...PAQVEMYKFLLRISQLNRD...... CHTEARGLEGVCIDPKK...... DGEACNSPYLDWRKDTEQ...
Class NProtein Synthesis
.
.
. ... CTPPSLAACTPPTS...... LTITNLMKSLGFKPKPKKI...... DELAAMVRDYLKKTPEF...
Protein Motif Finding
• Protein motif finding is very similar to signal source identification• A protein sequence may contain multiple motifs:
. . A C G F T D F W A P S L T H L T I K N L . .• Motifs in the sequence are sometimes modified by addition and deletion
of random amino acids• Occurrence of motifs can be modeled by Profile Hidden Markov models• Viterbi algorithm is used to “decode” a given protein sequence against a
model• The result is the probability that the sequence belongs to the class
Protein Synthesis
...PAQVEMYKFLLRISQLNRD...... CHTEARGLEGVCIDPKK...... DGEACNSPYLDWRKDTEQ...
HMMER Search - Protein Motif Finding
D2
I2 I3 I4
D3 D4
J
BS
I1
E TN CM1 M2 M3 M4 M5
. . . . A C G F T D F W A P S L T H L T I K N L . . . .
. . . . A C G F T D F W A P S L T H L T I K N L . . . .
. . . . A C G F T D F W A P S A L T H L T I K N L . . . .
. . . . A C G F T D F W A P S A G L - H L T I K N L . . .. . . . A C G F T D F W A P S A G L T H L T I K N L . . .
Sample Paths
HMMER Search- Protein Motif Finding
Model
Sequence
1 m1
L
XE
i
j
Insert:
Match:
Delete:
• Dependence on XE imposes a row major order computation• Delete state costs impose a serial dependency on previous element in the row• Harder to parallelize by conventional means
),)1,(max(),( iiDD dcjiVjiV
HMMER Search – Protein Motif Finding
1 m
),),(max()2,( 22 kkDD dckiVkiV ),)3,(max(),( mmDD dckiVmiV
),)1,(max(),( kkDD dciVkiV
),)2,(max()3,( 33 kkDD dckiVkiV
row i
• Delete costs impose sequential dependency
• Parallelize the row calculations• The recurrence for VD is parallelizable by blocking strategy
NVIDIA-CUDA Programming Interface
. . .
ThreadsThread Blocks
. . .
Global Memory
Shared Memory
Constant Cache
Texture Cache
Multiprocessor 1
Registers Registers Registers
Processor 1 Processor 2 Processor M… InstructionUnit
Shared Memory
Constant Cache
Texture Cache
Multiprocessor 2
Registers Registers Registers
Processor 1 Processor 2 Processor M… InstructionUnit
Traditional Task Parallelism for GPUs
Seq i Seq i+M
ThreadBlock 1
ThreadBlock 2
ThreadBlock P
• Different GPU threads work on independent tasks• Time taken for different tasks
vary according to the nature of the job Different sequences of
different lengths take different time to complete
Load imbalance issues are possible
Total time depends on location of the longest sequences in the database
Hybrid data and task parallelism
Seq i Seq i+M
ThreadBlock 1
ThreadBlock 2
ThreadBlock P
• We extract parallelism out of data dependency Multiple GPU threads cooperate
to work on the same task• By dividing the database into roughly
equal chunks, we naturally solve any load imbalance problems• This technique works for uniform
recurrence equations Ubiquitous in computational
biology including local sequence alignment, multiple sequence alignment, motif finding
GPU Implementation
• Multiple threads cooperate by partitioning a single sequence and working on different partitions• Working set is one row of the DP matrix, 507x3, 32-bit integers
Model of size 507 has 507x9 transition probabilities and 507x40 emission probabilities stored as short integers
• Model data is read in a coalesced form into the shared memory• Working set is stored and updated within shared memory
1 mrow i
Performance and Results
3 HMM Sizes:• 128• 256 • 507
NCBI NR Protein Database:• 5.5GB in Size• 10.5 Million Protein
Sequences
• We compared our implementation on 1 and 4 GPUs versus the mpi-gpu HMMER for the same dataset
Conclusions and Future Work
• Our GPU implementation of HMMER is: 5-8x faster than GPU-HMMER search implementation for the same
data set 100x faster than the CPU implementation
• Future work: Phylogenetic Motif Identification Identify common genetic motifs among several groups of organisms Representative of evolutionary relatedness among different (typically
1000s) species Closely related to multiple sequence alignment problem via Hidden
Markov Models Computationally intensive for which faster motif finding is absolutely
necessary