Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism

Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism

Narayan Ganesan1, Roger Chamberlain2, Jeremy Buhler2 and Michela Taufer1

Computer and Info. Sciences Dept, University of Delaware.1

Computer Science and Engineering, Washington University in St. Louis2

Motivation

• Dataset size is growing for many applications rapidly• Problem sizes grow as the product of the data sizes

E.g., Genome sequence alignment, protein motif finding

• Number of problems also grows as the product of number of available data• Many sequentially dependent algorithms are executed serially• Processor speed has hit a brick wall (~3.5-4.00GHz in 2003) and

serial evaluation is just not feasible for large data applications Need for parallelism is greater than ever

• Parallel hardware is ubiquitous GPU, Multi-core, SIMD, Hybrid, MIMD, FPGA

• Efforts must be spent on parallelizing algorithms and applications for these hardware

Application: Protein Motif Finding

• Proteins are synthesized by biological Processes that can be described by HMMs

• Each class is described by short characteristic sequences or motifs• Each class or “generator” is described by a Hidden Markov model• Protein Motif Finding answers the questions:

Given a sequence what class does it belong to? Given a sequence and a HMM what is the probability that the sequence

belongs to that class?

Class AProtein Synthesis

...PAQVEMYKFLLRISQLNRD...... CHTEARGLEGVCIDPKK...... DGEACNSPYLDWRKDTEQ...

Class NProtein Synthesis

.

.

. ... CTPPSLAACTPPTS...... LTITNLMKSLGFKPKPKKI...... DELAAMVRDYLKKTPEF...

Protein Motif Finding

• Protein motif finding is very similar to signal source identification• A protein sequence may contain multiple motifs:

. . A C G F T D F W A P S L T H L T I K N L . .• Motifs in the sequence are sometimes modified by addition and deletion

of random amino acids• Occurrence of motifs can be modeled by Profile Hidden Markov models• Viterbi algorithm is used to “decode” a given protein sequence against a

model• The result is the probability that the sequence belongs to the class

Protein Synthesis

...PAQVEMYKFLLRISQLNRD...... CHTEARGLEGVCIDPKK...... DGEACNSPYLDWRKDTEQ...

HMMER Search - Protein Motif Finding

D2

I2 I3 I4

D3 D4

J

BS

I1

E TN CM1 M2 M3 M4 M5

. . . . A C G F T D F W A P S L T H L T I K N L . . . .

. . . . A C G F T D F W A P S L T H L T I K N L . . . .

. . . . A C G F T D F W A P S A L T H L T I K N L . . . .

. . . . A C G F T D F W A P S A G L - H L T I K N L . . .. . . . A C G F T D F W A P S A G L T H L T I K N L . . .

Sample Paths

HMMER Search – Recurrence Equations

M1 M2 M3

D2

M4 M5

I2 I3 I4

D3 D4

J

VD

VI

VM

BS

I1

E TN C

HMMER Search- Protein Motif Finding

Model

Sequence

1 m1

L

XE

i

j

Insert:

Match:

Delete:

• Dependence on XE imposes a row major order computation• Delete state costs impose a serial dependency on previous element in the row• Harder to parallelize by conventional means

),)1,(max(),( iiDD dcjiVjiV

HMMER Search – Protein Motif Finding

1 m

),),(max()2,( 22 kkDD dckiVkiV ),)3,(max(),( mmDD dckiVmiV

),)1,(max(),( kkDD dciVkiV

),)2,(max()3,( 33 kkDD dckiVkiV

row i

• Delete costs impose sequential dependency

• Parallelize the row calculations• The recurrence for VD is parallelizable by blocking strategy

NVIDIA-CUDA Programming Interface

. . .

ThreadsThread Blocks

. . .

Global Memory

Shared Memory

Constant Cache

Texture Cache

Multiprocessor 1

Registers Registers Registers

Processor 1 Processor 2 Processor M… InstructionUnit

Shared Memory

Constant Cache

Texture Cache

Multiprocessor 2

Registers Registers Registers

Processor 1 Processor 2 Processor M… InstructionUnit

Traditional Task Parallelism for GPUs

Seq i Seq i+M

ThreadBlock 1

ThreadBlock 2

ThreadBlock P

• Different GPU threads work on independent tasks• Time taken for different tasks

vary according to the nature of the job Different sequences of

different lengths take different time to complete

Load imbalance issues are possible

Total time depends on location of the longest sequences in the database

Hybrid data and task parallelism

Seq i Seq i+M

ThreadBlock 1

ThreadBlock 2

ThreadBlock P

• We extract parallelism out of data dependency Multiple GPU threads cooperate

to work on the same task• By dividing the database into roughly

equal chunks, we naturally solve any load imbalance problems• This technique works for uniform

recurrence equations Ubiquitous in computational

biology including local sequence alignment, multiple sequence alignment, motif finding

GPU Implementation

• Multiple threads cooperate by partitioning a single sequence and working on different partitions• Working set is one row of the DP matrix, 507x3, 32-bit integers

Model of size 507 has 507x9 transition probabilities and 507x40 emission probabilities stored as short integers

• Model data is read in a coalesced form into the shared memory• Working set is stored and updated within shared memory

1 mrow i

Performance and Results

3 HMM Sizes:• 128• 256 • 507

NCBI NR Protein Database:• 5.5GB in Size• 10.5 Million Protein

Sequences

• We compared our implementation on 1 and 4 GPUs versus the mpi-gpu HMMER for the same dataset

Conclusions and Future Work

• Our GPU implementation of HMMER is: 5-8x faster than GPU-HMMER search implementation for the same

data set 100x faster than the CPU implementation

• Future work: Phylogenetic Motif Identification Identify common genetic motifs among several groups of organisms Representative of evolutionary relatedness among different (typically

1000s) species Closely related to multiple sequence alignment problem via Hidden

Markov Models Computationally intensive for which faster motif finding is absolutely

necessary

Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism

Documents

Transcript of Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism