Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009
Embed Size (px)
description
Transcript of Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009

CUDASW++: optimizing Smith-Waterman sequence
database searches for CUDA-enabled graphics processing units
Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009Speaker : De Yu ChenData: 2011/4/20

Outline
Introduction
Smith-Waterman Algorithm
CUDA Programming Model
Methods
Results and discussion
2

Introduction
In this paper, the compute power of CUDA-enabled GPUs is further explored to
accelerate SW sequence database searches.
Two versions of CUDASW++ are implemented: a single-GPU version and a
multi-GPU version. Our CUDASW++ implementations provide better performance
guarantees for protein sequence database searches compared to
(1) SWPS3: PlayStation 3, 2008
(2) CBESW: PlayStation 3, 2008
(3) SW-CUDA: CUDA-enabled GPU, 2008
(4) NCBI-BLAST: Basic Local Alignment Search Tool program was designed by
National Center for Biotechnology Information, USA, 1997
3

Smith-Waterman Algorithm
4

CUDA Programming Model
5
CUDA execution modelCPU (host)
GPU (Device)

CUDA Programming Model (count.)
6
CUDA hardware model

CUDA Programming Model (count.)
7
※ Shared memory access patterns
__shared__int data[32]
0 161 172 183 194 205 216 227 238 249 2510 2611 2712 2813 2914 3015 31

CUDA Programming Model (count.)
8
FPGA implement
GPU implement
Forward array
backward array

Methods
9
Considering the optimal local alignment of a query sequence and a subject
sequence as a task, we have investigated two approaches for parallelizing the
sequence database searches using CUDA.
Inter-task parallelization
Each task is assigned to exactly one thread and dimBlock tasks are performed
in parallel by different threads in a thread block.
query
subject

Methods (count.)
Intra-task parallelization
Each task is assigned to one thread block and all dimBlock threads in the thread
block cooperate to perform the task in parallel, exploiting the parallel
characteristics of cells in the minor diagonals.
10
query
subject

Methods (count.)
Inter-task parallelization occupies more device memory but achieves better
performance than intra-task parallelization. However, intra-task parallelization
occupies significantly less device memory and therefore can support longer
query/subject sequences.
In our implementation, two stages are used: the first stage exploits inter-task
parallelization and the second intra-task parallelization. For subject sequences of
length less than or equal to threshold, the alignments with a query sequence are
performed in the first stage in order to maximize the performance.
The alignments of subject sequences of length greater than threshold, are carried
out in the second stage. In our implementation, the threshold is set to 3,072.
11

Methods (count.)
12
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
※ Device memory access patterns for coalescing:
(1) Placing matrix elements into linear orderM
M

Methods (count.)
13
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
M
M
AccessDirection By threads
Load iteration1
(2) A coalesced access pattern
T(0) T(1) T(2) T(3)
Load iteration1T(0) T(1) T(2) T(3)

Methods (count.)
14
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
M
(2) An uncoalesced access pattern
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16M
Load iteration1T(0) T(1) T(2) T(3)
Load iteration2T(0) T(1) T(2) T(3)
AccessDirection By threads

Methods (count.)
Our implementation uses three techniques to improve performance:
(I) Coalesced subject sequence arrangement.
(II) Coalesced global memory access.
(III) Cell block division method.
I. Coalesced subject sequence arrangement
15

Methods (count.)
II. Coalesced global memory access
16

Methods (count.)
III. Cell block division method
To maximize performance and to reduce the bandwidth demand of global memory,
we propose a cell block division method for the inter-task parallelization, where
the alignment matrix is divided into cell blocks of equal size.
17
Query sequences
Subject sequences
n × n n × n n × n
n × n n × n n × n
n × n n × n n × n

Results and discussion
To remove the dependency on the query sequences and the databases used for
the different tests, Cell Updates Per Second (CUPS) is a commonly used
performance measure in bioinformatics.
Given a query sequence of size Q and a database of size D.
the GCUPS(billion cell updates per second) value is calculated by:
|Q| is the total number of symbols in the query sequence
|D| is the total number of symbols in the database
t is the runtime in. In this paper, for the single-GPU version, the runtime t includes the transfer time of the query sequences from host to GPU, the calculation time of the SW algorithm, and the transfer-back time of the scores;
18
910
t
DQ

Results and discussion (count.)
19
(1) Database : Swiss-Prot release 56.6(2) Number of query sequences: 25 (3) Query Length: 144 ~ 5,478 (4) Single-GPU: NVIDIA GeForce GTX 280 ( 30M, 240 cores, 1G RAM) (5) Multi-GPU: NVIDIA GeForce GTX 295 (60M, 480 cores, 1.8G RAM )

Results and discussion (count.)
20

Results and discussion (count.)
21
(1) Number of query sequences: 1 (2) Number of database sequences: 30 (3) Database Length: 256 ~ 5,000 (4) Single-GPU: NVIDIA GeForce C1060 ( 30M, 240 cores, 4G RAM)(5) CPU: Intel E6420 1.8GHz
Database
Length
Execution time T (s)
CUDASW++ CPU SW Our implement
128 0.007 0.02 0.002
256 0.017 0.082 0.008
512 0.058 0.277 0.029
1024 0.223 1.25 0.113
1500 0.476 2.81 0.328
2048 0.88 5.39 0.437
3500 1.45 10.86 0.83