Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009

CUDASW++: optimizing Smith-Waterman sequence

database searches for CUDA-enabled graphics processing units

Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009Speaker : De Yu ChenData: 2011/4/20

Outline

Introduction

Smith-Waterman Algorithm

CUDA Programming Model

Methods

Results and discussion

2

Introduction

In this paper, the compute power of CUDA-enabled GPUs is further explored to

accelerate SW sequence database searches.

Two versions of CUDASW++ are implemented: a single-GPU version and a

multi-GPU version. Our CUDASW++ implementations provide better performance

guarantees for protein sequence database searches compared to

(1) SWPS3: PlayStation 3, 2008

(2) CBESW: PlayStation 3, 2008

(3) SW-CUDA: CUDA-enabled GPU, 2008

(4) NCBI-BLAST: Basic Local Alignment Search Tool program was designed by

National Center for Biotechnology Information, USA, 1997

3

Smith-Waterman Algorithm

4

CUDA Programming Model

5

CUDA execution modelCPU (host)

GPU (Device)

CUDA Programming Model (count.)

6

CUDA hardware model


7

※ Shared memory access patterns

__shared__int data[32]

0 161 172 183 194 205 216 227 238 249 2510 2611 2712 2813 2914 3015 31


8

FPGA implement

GPU implement

Forward array

backward array

Methods

9

Considering the optimal local alignment of a query sequence and a subject

sequence as a task, we have investigated two approaches for parallelizing the

sequence database searches using CUDA.

Inter-task parallelization

Each task is assigned to exactly one thread and dimBlock tasks are performed

in parallel by different threads in a thread block.

query

subject

Methods (count.)

Intra-task parallelization

Each task is assigned to one thread block and all dimBlock threads in the thread

block cooperate to perform the task in parallel, exploiting the parallel

characteristics of cells in the minor diagonals.

10

query

subject

Methods (count.)

Inter-task parallelization occupies more device memory but achieves better

performance than intra-task parallelization. However, intra-task parallelization

occupies significantly less device memory and therefore can support longer

query/subject sequences.

In our implementation, two stages are used: the first stage exploits inter-task

parallelization and the second intra-task parallelization. For subject sequences of

length less than or equal to threshold, the alignments with a query sequence are

performed in the first stage in order to maximize the performance.

The alignments of subject sequences of length greater than threshold, are carried

out in the second stage. In our implementation, the threshold is set to 3,072.

11

Methods (count.)

12

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

※ Device memory access patterns for coalescing：

(1) Placing matrix elements into linear orderM

M

Methods (count.)

13

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

M

M

AccessDirection By threads

Load iteration1

(2) A coalesced access pattern

T(0) T(1) T(2) T(3)

Load iteration1T(0) T(1) T(2) T(3)

Methods (count.)

14

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

M

(2) An uncoalesced access pattern

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16M



AccessDirection By threads

Methods (count.)

Our implementation uses three techniques to improve performance:

(I) Coalesced subject sequence arrangement.

(II) Coalesced global memory access.

(III) Cell block division method.

I. Coalesced subject sequence arrangement

15

Methods (count.)

II. Coalesced global memory access

16

Methods (count.)

III. Cell block division method

To maximize performance and to reduce the bandwidth demand of global memory,

we propose a cell block division method for the inter-task parallelization, where

the alignment matrix is divided into cell blocks of equal size.

17

Query sequences

Subject sequences

n × n n × n n × n

n × n n × n n × n

n × n n × n n × n

Results and discussion

To remove the dependency on the query sequences and the databases used for

the different tests, Cell Updates Per Second (CUPS) is a commonly used

performance measure in bioinformatics.

Given a query sequence of size Q and a database of size D.

the GCUPS(billion cell updates per second) value is calculated by:

|Q| is the total number of symbols in the query sequence

|D| is the total number of symbols in the database

t is the runtime in. In this paper, for the single-GPU version, the runtime t includes the transfer time of the query sequences from host to GPU, the calculation time of the SW algorithm, and the transfer-back time of the scores;

18

910

t

DQ

Results and discussion (count.)

19

(1) Database : Swiss-Prot release 56.6(2) Number of query sequences: 25 (3) Query Length: 144 ~ 5,478 (4) Single-GPU: NVIDIA GeForce GTX 280 ( 30M, 240 cores, 1G RAM) (5) Multi-GPU: NVIDIA GeForce GTX 295 (60M, 480 cores, 1.8G RAM )


20


21

(1) Number of query sequences: 1 (2) Number of database sequences: 30 (3) Database Length: 256 ~ 5,000 (4) Single-GPU: NVIDIA GeForce C1060 ( 30M, 240 cores, 4G RAM)(5) CPU: Intel E6420 1.8GHz

Database

Length

Execution time T (s)

CUDASW++ CPU SW Our implement

128 0.007 0.02 0.002

256 0.017 0.082 0.008

512 0.058 0.277 0.029

1024 0.223 1.25 0.113

1500 0.476 2.81 0.328

2048 0.88 5.39 0.437

3500 1.45 10.86 0.83

Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009

Documents

Transcript of Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009