Parallel Longest Common Subsequence using Graphics Hardware

Post on 14-Feb-2017

232 views 0 download

Transcript of Parallel Longest Common Subsequence using Graphics Hardware

1

Parallel Longest Common Subsequence using Graphics Hardware

John KloetzliBrian Strege

Jonathan DeckerDr. Marc Olano

Presented by: Brian Strege

2

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

3

Introduction

• Worked on GPU acceleration of Dynamic Programming– Specifically, problems in the Gaussian

Elimination Paradigm (GEP)– More specifically, Longest Common

Subsequence as a representative problem belonging to the GEP

4

Problem Statement

• Design and implement an algorithm for finding the LCS of two arbitrary length strings on a CPU + GPU machine– Must make efficient use of both CPU and

GPU architectures– Must have theoretical justification of design

5

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

6

Related Work

• General Purpose on Graphics Hardware– NVIDIA CUDA– Owens et al. (2005)

• Linear Dynamic Programming– Hirschberg (1975)– Chowdhury et al. (2006)

• GPU Sequence Alignment– Liu et al. (2007)– Schatz et al. (2007)

7

• 16 multiprocessors, 8 cores each128 logical processors

• 1.35 GHz• 768 MB of RAM• 86.4GB/sec transfer rate

(8.5GB/sec Core 2 Duo)

• 520 GFLOPS(22 GFLOPS Core 2 Duo)

NV

IDIA

CU

DA

Pro

gram

min

g G

uide

, 1.0

The NVIDIA G80 Architecture

8

The NVIDIA G80 Architecture

Program workflow:• CPU (host) creates

kernel program• GPU maps kernel

“blocks” to processors• Processors map

kernel “threads” to processor cores

• Cores execute in parallel

NV

IDIA

CU

DA

Pro

gram

min

g G

uide

, 1.0

9

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

10

Algorithm Description

• The SIMPLE-LCS recurrence– Requires quadratic space, which limits

scalability– Faster than Chowdhury et al. linear space

method

11

A B A B

AABB

SIMPLE-LCS Example

12

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0

0

0

0

13

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 10

0

0

14

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 10

0

0

15

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 10

0

0

16

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10

0

0

17

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 10

0

18

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 10

0

19

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 20

0

20

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20

0

21

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 10

22

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 20

23

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 20

24

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30

25

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1

26

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2

27

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2

28

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

29

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

30

A B A B

AABB

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

31

Algorithm Description

• Chowdhury et al. perform CPU quadratic space algorithm on small subproblems– CH-LCS is their linear space algorithm– CUTOFF ranges from 28 – 210

32

Algorithm Description• Our approach is to add another base case

solved quickly on the GPU– GPU-LCS is our new algorithm (not recursive)– GPU-CUTOFF is 216

– CUTOFF is 211

33

Algorithm Description

• CH: CPU Linear Space DP• GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

34

CH: CPU Linear Space DP

Two recursive functions used:• Output boundary• LCS reconstruction

35

CH: CPU Linear Space DP

Output boundary:• Given input boundary,

computes output boundary

• Expects subproblem size to be square, with power-of-two lengths

36

A B A B

AABB

Pushing Example

19 20 21 22 2220202020

37

A B A B

AABB

Pushing Example

19 20 21 22 2220202020

20 20 20 20 19 20 21 22 22

38

A B A B

AABB

Pushing Example

19 20 21 22 2220 20202020

20 20 20 20 20 20 21 22 22

39

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21202020

20 20 20 20 20 21 21 22 22

40

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 2120 212020

20 20 20 21 20 21 21 22 22

41

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 2120 21 212020

20 20 20 21 21 21 21 22 22

42

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 2220 21 212020

20 20 20 21 21 21 22 22 22

43

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 212020

20 20 20 21 21 21 22 22 22

44

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 222020

20 20 20 21 21 22 22 22 22

45

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 222020

20 20 20 21 21 22 22 22 22

46

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 2120

20 20 21 21 21 22 22 22 22

47

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220

20 20 21 22 21 22 22 22 22

48

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21

20 21 21 22 21 22 22 22 22

49

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21 22

20 21 22 22 21 22 22 22 22

50

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 2220 21 22

20 21 22 22 22 22 22 22 22

51

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22

20 21 22 22 22 23 22 22 22

52

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22

20 21 22 22 22 23 22 22 22

53

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23

20 21 22 22 23 23 22 22 22

54

A B A B

AABB

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23

20 21 22 22 23 23 22 22 22

55

Algorithm Description

• CH: CPU Linear Space DP • GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

56

GPU Processing Overview• Two levels of parallelism

– Blocks are executed on a processor– Threads are executed on a processor core– Each thread is computed by exactly one processor core

57

GPU Level 1: Quadratic Space

• Length of LCS with max length of 216

• Divide DP matrix into “blocks,” each block is solved by one of the GPU processors

• We must enforce the correct order of block execution– Each diagonal can be

computed in parallel

58

GPU Level 1: Quadratic Space

• The basic quadratic space DP algorithm would require 16 GB of memory– We “fold” the memory to store only the input/output boundary

for each block– Reduces the storage required to 64 MB– From n2 to 2(n2/m) where m = 512– Duplicate some values to avoid memory contention

59

Algorithm Description

• CH: CPU Linear Space DP • GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

60

GPU Level 2: Linear Space

• Within each block we also have more parallelism– Divide each block into “threads”– Each processor core computes one thread at a time– Hardware-level synchronization ensures the correct

diagonal ordering– Each core reuses the same space (white) and

computes the entire logical matrix (grey)

61

GPU Level 2 : Linear Space

• Each thread is a 4x4 subproblem– The size was determined by experimentation– This memory is on chip, so we do not have to

worry about memory conflicts– The linear space algorithm allows us to make

each block as large as possible, which allows for very fast execution

62

Algorithm Description

• CH: CPU Linear Space DP • GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

63

Simple: CPU Quadratic Space DP

• Only gets called when a subproblem is too small for the GPU

• Implements SIMPLE-LCS, the “classic” matrix-based LCS algorithm

64

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

65

Results and Analysis

GPU thread width of 4 proves optimal

66

Results and Analysis

GPU block width of 512 is slightly faster

67

Results and Analysis

CPU/GPU cutoff sizes determined experimentally

68

Results and Analysis

• Test DNA sequence data obtained from Mike Brudno• Over five-fold performance improvement from results in

Chowdhury et al. on all sequence comparisons

Species LengthHuman 1.80Chimp 1.32Baboon 1.51Chicken 0.42Fugu 0.27Cow 1.46Mouse 1.49Rat 1.50Cat 1.16Dog 1.05

Lengths in millions

69

Conclusion

• We present a GPU based Dynamic Programming algorithm to compute the LCS of very large sequences

• GPU implementation over five-fold performance boost over single CPU implementation

70

Future Work

• We believe our algorithm can be accelerated further with careful optimization– Memory management on the GPU– Memory transfer between CPU and GPU

• Investigation of other computation models– Implementations using 8xCPU + 2xGPU?

71

Questions?

Special thanks to Rezaul Chowdhury for his support and Mike Brudno for the DNA sequence data

72

NVIDIA CUDA

• Compute Unified Device Architecture• Available on G80 Series• Architecture for utilizing the GPU as a

data-parallel computing device• Eliminates the need to map computation

through graphics API• User writes a C style function which is

then run in parallel on the GPU

73

CH: CPU Linear Space DP

LCS reconstruction• Computes output

boundaries in specific order

• Traces back through boundaries to generate LCS

• Linear space

74

CH: CPU Linear Space DP

LCS reconstruction omissions:

• Non-power-of-two sequence lengths

• Non-equal sequence lengths

75

Integration with Parallel CPUs

• Chowdhury et al. implemented a parallel version of their algorithm– No data available for LCS, but results from other

algorithms show we should expect ~6 times speedup for LCS using 8 server processors

– Disadvantages: • Number of processors which can be effectively used scales

poorly with input size

• Server CPUs cost between $500 and $1600 each, while the GPU we used cost $550