Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009

21
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009 Speaker : De Yu Chen Data: 2011/4/20

description

CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009 Speaker : De Yu Chen Data: 2011/4/20. Outline. Introduction Smith-Waterman Algorithm - PowerPoint PPT Presentation

Transcript of Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009

Page 1: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

CUDASW++: optimizing Smith-Waterman sequence

database searches for CUDA-enabled graphics processing units

Author : Yongchao Liu, Douglas L Maskell, Bertil Schmidt Publisher: BMC Research Notes, 2009Speaker : De Yu ChenData: 2011/4/20

Page 2: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Outline

Introduction

Smith-Waterman Algorithm

CUDA Programming Model

Methods

Results and discussion

2

Page 3: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Introduction

In this paper, the compute power of CUDA-enabled GPUs is further explored to

accelerate SW sequence database searches.

Two versions of CUDASW++ are implemented: a single-GPU version and a

multi-GPU version. Our CUDASW++ implementations provide better performance

guarantees for protein sequence database searches compared to

(1) SWPS3: PlayStation 3, 2008

(2) CBESW: PlayStation 3, 2008

(3) SW-CUDA: CUDA-enabled GPU, 2008

(4) NCBI-BLAST: Basic Local Alignment Search Tool program was designed by

National Center for Biotechnology Information, USA, 1997

3

Page 4: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Smith-Waterman Algorithm

4

Page 5: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

CUDA Programming Model

5

CUDA execution modelCPU (host)

GPU (Device)

Page 6: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

CUDA Programming Model (count.)

6

CUDA hardware model

Page 7: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

CUDA Programming Model (count.)

7

※ Shared memory access patterns

__shared__int data[32]

0 161 172 183 194 205 216 227 238 249 2510 2611 2712 2813 2914 3015 31

Page 8: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

CUDA Programming Model (count.)

8

FPGA implement

GPU implement

Forward array

backward array

Page 9: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods

9

Considering the optimal local alignment of a query sequence and a subject

sequence as a task, we have investigated two approaches for parallelizing the

sequence database searches using CUDA.

Inter-task parallelization

Each task is assigned to exactly one thread and dimBlock tasks are performed

in parallel by different threads in a thread block.

query

subject

Page 10: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods (count.)

Intra-task parallelization

Each task is assigned to one thread block and all dimBlock threads in the thread

block cooperate to perform the task in parallel, exploiting the parallel

characteristics of cells in the minor diagonals.

10

query

subject

Page 11: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods (count.)

Inter-task parallelization occupies more device memory but achieves better

performance than intra-task parallelization. However, intra-task parallelization

occupies significantly less device memory and therefore can support longer

query/subject sequences.

In our implementation, two stages are used: the first stage exploits inter-task

parallelization and the second intra-task parallelization. For subject sequences of

length less than or equal to threshold, the alignments with a query sequence are

performed in the first stage in order to maximize the performance.

The alignments of subject sequences of length greater than threshold, are carried

out in the second stage. In our implementation, the threshold is set to 3,072.

11

Page 12: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods (count.)

12

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

※ Device memory access patterns for coalescing:

(1) Placing matrix elements into linear orderM

M

Page 13: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods (count.)

13

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

M

M

AccessDirection By threads

Load iteration1

(2) A coalesced access pattern

T(0) T(1) T(2) T(3)

Load iteration1T(0) T(1) T(2) T(3)

Page 14: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods (count.)

14

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

M

(2) An uncoalesced access pattern

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16M

Load iteration1T(0) T(1) T(2) T(3)

Load iteration2T(0) T(1) T(2) T(3)

AccessDirection By threads

Page 15: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods (count.)

Our implementation uses three techniques to improve performance:

(I) Coalesced subject sequence arrangement.

(II) Coalesced global memory access.

(III) Cell block division method.

I. Coalesced subject sequence arrangement

15

Page 16: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods (count.)

II. Coalesced global memory access

16

Page 17: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Methods (count.)

III. Cell block division method

To maximize performance and to reduce the bandwidth demand of global memory,

we propose a cell block division method for the inter-task parallelization, where

the alignment matrix is divided into cell blocks of equal size.

17

Query sequences

Subject sequences

n × n n × n n × n

n × n n × n n × n

n × n n × n n × n

Page 18: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Results and discussion

To remove the dependency on the query sequences and the databases used for

the different tests, Cell Updates Per Second (CUPS) is a commonly used

performance measure in bioinformatics.

Given a query sequence of size Q and a database of size D.

the GCUPS(billion cell updates per second) value is calculated by:

|Q| is the total number of symbols in the query sequence

|D| is the total number of symbols in the database

t is the runtime in. In this paper, for the single-GPU version, the runtime t includes the transfer time of the query sequences from host to GPU, the calculation time of the SW algorithm, and the transfer-back time of the scores;

18

910

t

DQ

Page 19: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Results and discussion (count.)

19

(1) Database : Swiss-Prot release 56.6(2) Number of query sequences: 25 (3) Query Length: 144 ~ 5,478 (4) Single-GPU: NVIDIA GeForce GTX 280 ( 30M, 240 cores, 1G RAM) (5) Multi-GPU: NVIDIA GeForce GTX 295 (60M, 480 cores, 1.8G RAM )

Page 20: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Results and discussion (count.)

20

Page 21: Author :  Yongchao Liu, Douglas L Maskell, Bertil Schmidt  Publisher:  BMC Research Notes, 2009

Results and discussion (count.)

21

(1) Number of query sequences: 1 (2) Number of database sequences: 30 (3) Database Length: 256 ~ 5,000 (4) Single-GPU: NVIDIA GeForce C1060 ( 30M, 240 cores, 4G RAM)(5) CPU: Intel E6420 1.8GHz

Database

Length

Execution time T (s)

CUDASW++ CPU SW Our implement

128 0.007 0.02 0.002

256 0.017 0.082 0.008

512 0.058 0.277 0.029

1024 0.223 1.25 0.113

1500 0.476 2.81 0.328

2048 0.88 5.39 0.437

3500 1.45 10.86 0.83