Parallel Longest Common Subsequence using Graphics Hardware

John KloetzliBrian Strege

Jonathan DeckerDr. Marc Olano

Presented by: Brian Strege

Overview

• Introduction– Problem Statement

• Background and Related Work– The NVIDIA G80 Architecture

• Algorithm Description• Results and Analysis• Conclusion

Introduction

• Worked on GPU acceleration of Dynamic Programming– Specifically, problems in the Gaussian

Elimination Paradigm (GEP)– More specifically, Longest Common

Subsequence as a representative problem belonging to the GEP

Problem Statement

• Design and implement an algorithm for finding the LCS of two arbitrary length strings on a CPU + GPU machine– Must make efficient use of both CPU and

GPU architectures– Must have theoretical justification of design

Overview

Related Work

• General Purpose on Graphics Hardware– NVIDIA CUDA– Owens et al. (2005)

• Linear Dynamic Programming– Hirschberg (1975)– Chowdhury et al. (2006)

• GPU Sequence Alignment– Liu et al. (2007)– Schatz et al. (2007)

• 16 multiprocessors, 8 cores each128 logical processors

• 1.35 GHz• 768 MB of RAM• 86.4GB/sec transfer rate

(8.5GB/sec Core 2 Duo)

• 520 GFLOPS(22 GFLOPS Core 2 Duo)

The NVIDIA G80 Architecture

Program workflow:• CPU (host) creates

kernel program• GPU maps kernel

“blocks” to processors• Processors map

kernel “threads” to processor cores

• Cores execute in parallel

Overview

Algorithm Description

• The SIMPLE-LCS recurrence– Requires quadratic space, which limits

scalability– Faster than Chowdhury et al. linear space

method

A B A B

SIMPLE-LCS Example

A B A B

SIMPLE-LCS Example

0 0 0 0 0

A B A B

SIMPLE-LCS Example

0 0 0 0 0

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 10

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 10

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 10

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 10

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 20

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 10

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 20

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 20

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

A B A B

SIMPLE-LCS Example

0 0 0 0 0

0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3

• Chowdhury et al. perform CPU quadratic space algorithm on small subproblems– CH-LCS is their linear space algorithm– CUTOFF ranges from 28 – 210

Algorithm Description• Our approach is to add another base case

solved quickly on the GPU– GPU-LCS is our new algorithm (not recursive)– GPU-CUTOFF is 216

– CUTOFF is 211

• CH: CPU Linear Space DP• GPU: GPU DP

– GPU level 1: GPU Quadratic Space DP (block level)

– GPU level 2: GPU Linear Space DP (thread level)

• Simple: CPU Quadratic Space DP

CH: CPU Linear Space DP

Two recursive functions used:• Output boundary• LCS reconstruction

Output boundary:• Given input boundary,

computes output boundary

• Expects subproblem size to be square, with power-of-two lengths

A B A B

Pushing Example

19 20 21 22 2220202020

A B A B

Pushing Example

19 20 21 22 2220202020

20 20 20 20 19 20 21 22 22

A B A B

Pushing Example

19 20 21 22 2220 20202020

20 20 20 20 20 20 21 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21202020

20 20 20 20 20 21 21 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 2120 212020

20 20 20 21 20 21 21 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 2120 21 212020

20 20 20 21 21 21 21 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 2220 21 212020

20 20 20 21 21 21 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 212020

20 20 20 21 21 21 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 222020

20 20 20 21 21 22 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 222020

20 20 20 21 21 22 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 2120

20 20 21 21 21 22 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220

20 20 21 22 21 22 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21

20 21 21 22 21 22 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21 22

20 21 22 22 21 22 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 2220 21 22

20 21 22 22 22 22 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22

20 21 22 22 22 23 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22

20 21 22 22 22 23 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23

20 21 22 22 23 23 22 22 22

A B A B

Pushing Example

19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23

20 21 22 22 23 23 22 22 22

• CH: CPU Linear Space DP • GPU: GPU DP

GPU Processing Overview• Two levels of parallelism

– Blocks are executed on a processor– Threads are executed on a processor core– Each thread is computed by exactly one processor core

GPU Level 1: Quadratic Space

• Length of LCS with max length of 216

• Divide DP matrix into “blocks,” each block is solved by one of the GPU processors

• We must enforce the correct order of block execution– Each diagonal can be

computed in parallel

GPU Level 1: Quadratic Space

• The basic quadratic space DP algorithm would require 16 GB of memory– We “fold” the memory to store only the input/output boundary

for each block– Reduces the storage required to 64 MB– From n2 to 2(n2/m) where m = 512– Duplicate some values to avoid memory contention

GPU Level 2: Linear Space

• Within each block we also have more parallelism– Divide each block into “threads”– Each processor core computes one thread at a time– Hardware-level synchronization ensures the correct

diagonal ordering– Each core reuses the same space (white) and

computes the entire logical matrix (grey)

GPU Level 2 : Linear Space

• Each thread is a 4x4 subproblem– The size was determined by experimentation– This memory is on chip, so we do not have to

worry about memory conflicts– The linear space algorithm allows us to make

each block as large as possible, which allows for very fast execution

Simple: CPU Quadratic Space DP

• Only gets called when a subproblem is too small for the GPU

• Implements SIMPLE-LCS, the “classic” matrix-based LCS algorithm

Overview

Results and Analysis

GPU thread width of 4 proves optimal

GPU block width of 512 is slightly faster

CPU/GPU cutoff sizes determined experimentally

• Test DNA sequence data obtained from Mike Brudno• Over five-fold performance improvement from results in

Chowdhury et al. on all sequence comparisons

Species LengthHuman 1.80Chimp 1.32Baboon 1.51Chicken 0.42Fugu 0.27Cow 1.46Mouse 1.49Rat 1.50Cat 1.16Dog 1.05

Lengths in millions

Conclusion

• We present a GPU based Dynamic Programming algorithm to compute the LCS of very large sequences

• GPU implementation over five-fold performance boost over single CPU implementation

Future Work

• We believe our algorithm can be accelerated further with careful optimization– Memory management on the GPU– Memory transfer between CPU and GPU

• Investigation of other computation models– Implementations using 8xCPU + 2xGPU?

Questions?

Special thanks to Rezaul Chowdhury for his support and Mike Brudno for the DNA sequence data

NVIDIA CUDA

• Compute Unified Device Architecture• Available on G80 Series• Architecture for utilizing the GPU as a

data-parallel computing device• Eliminates the need to map computation

through graphics API• User writes a C style function which is

then run in parallel on the GPU

LCS reconstruction• Computes output

boundaries in specific order

• Traces back through boundaries to generate LCS

• Linear space

LCS reconstruction omissions:

• Non-power-of-two sequence lengths

• Non-equal sequence lengths

Integration with Parallel CPUs

• Chowdhury et al. implemented a parallel version of their algorithm– No data available for LCS, but results from other

algorithms show we should expect ~6 times speedup for LCS using 8 server processors

– Disadvantages: • Number of processors which can be effectively used scales

poorly with input size

• Server CPUs cost between $500 and $1600 each, while the GPU we used cost $550

Parallel Longest Common Subsequence using Graphics Hardware

Documents

Transcript of Parallel Longest Common Subsequence using Graphics Hardware

Algorithms Dynamic programming Longest Common Subsequence.

CSE 421 Algorithms Richard Anderson Lecture 19 Longest Common Subsequence.

Longest common subsequence: linear space, diff, and bit ...resources.mpi-inf.mpg.de/.../ss13/strings/slides4.pdf · We deﬁne the longest common subsequence of a pair of strings

Longest common subsequence

A Dynamic Algorithm for Longest Common Subsequence … · m (m n), the traditional technique for finding Longest Common Subsequence is based on Dynamic Programming which consists

COMPUTING A LONGEST COMMON SUBSEQUENCE FOR A …subsequence of a string is obtained by deleting 0 or more (not necessarily consecutive) symbols from the string. For example, 'cbcbd'

Longest common subsequence lcs

An efficient algorithm for the longest tandem subsequence

Fluctuations of the longest common subsequence in the ...

Algorithms for the Longest Common Subsequence Problem

Reducing approximate Longest Common …deterministic approximation algorithm for longest common subsequence that runs in deterministic T(n) + O(n) andapproximatesLCS(A;B) towithina(1=2

Dynamic Programming Min Edit Distance Longest Increasing Subsequence Climbing Stairs Minimum Path Sum.

Dynamic Programming, Longest Common Subsequencesourav/Lecture-12.pdfL12.4 . Towards a better algorithm . Simplification: 1. Look at the . length. of a longest-common subsequence. 2.

Sparse Dynamic Programming for Longest …m, respectively) and a set Mof matching substrings of Xand Y, ﬁnd the longest common subsequence based only on the symbol correspondences

Bounds on the Complexity of the Longest Common Subsequence Problemdan/pubs/p1-ullman.pdf · 2013. 5. 13. · The Longest Common Subsequence Problem T(n, s) n 2 .~.n 2 4 .2.7

Bounds on the Complexity of the Longest Common Subsequence ...

Longest increasing subsequence (LIS) Matrix chain ...€¦ · 1 The longest increasing subsequence may not be contiguous. 5 4 9 11 5 3 2 10 0 8 6 1 7 Solution: 4 5 6 7

Dynamic Programming (Longest Common Subsequence)

Longest Common Subsequence Definition: The longest common subsequence or LCS of two strings S1 and S2 is the longest subsequence common between two strings.

Today: Dynamic Programming Longest Common Subsequenceweb.eecs.utk.edu/~leparker/Courses/CS581-spring14/... · Dynamic Programming Longest Common Subsequence . COSC 581, Algorithms