Download - Transforming Irregular Algorithms for Heterogeneous ...research.cs.vt.edu/seec/2015-08-workshop/Irregular App on GPU-SEE… · Transforming Irregular Algorithms for Heterogeneous

synergy.cs.vt.edu

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics

Jing ZhangAdvisor: Dr. Wu FengCollaborator: Hao Wang

synergy.cs.vt.edu

Irregular Algorithms• Characterized by …

– Operate on irregular data structures - trees, graphs, linked lists, priority queues

– Data are processed in variable-iteration loops– Both memory access and control flow are irregular and

unpredictable

• Emerging data intensive applications have increasing irregularity in memory access and control flow over massive data set

– Bioinformatics applications• Human genome: 3.2 billion letters (3.2 Gb)

– Social network analysis, graph algorithms• Twitter (June-Dec 2009) - 17 million users, 476 million tweets (6Gb) 2

synergy.cs.vt.edu

Heterogeneous Computing Systems

• Heterogeneous computing systems can offer massively parallel computation with high power efficiency and throughput

– GPU, AMD APU– Intel Xeon Phi– FPGA, DSP, …

• But, designed for regular programs, relying on data locality and regular computation to tolerate access latencies

– Thousands of smaller (weak), more efficient (simple) cores– Limited cache size, i.e., short cache line lifetimes

3

synergy.cs.vt.edu

Heterogeneous Computing Systems

• Heterogeneous computing systems can offer massively parallel computation with high power efficiency and throughput

– GPU, AMD APU– Intel Xeon Phi– FPGA, DSP, …

• But, designed for regular programs, relying on data locality and regular computation to tolerate access latencies

– Thousands of smaller (weak), more efficient (simple) cores, which are lack of branch prediction, out-of-order execution, etc.

– Limited cache size, i.e., short cache line lifetimes 4

It is very challenging to map irregular algorithms on heterogeneous

computing systems!

synergy.cs.vt.edu

Impacts of Irregularity on GPU Performance

• Irregular control flow– Branch divergence

• Occurs when threads inside warps branches to different execution paths

• Waste computational resource

5

Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt, Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, micro’07

Branch

Path A

Path BPath B

Branch

Path A

synergy.cs.vt.edu

Impacts of Irregularity on GPU Perf (cont’d)

• Irregular memory access– Uncoalesced memory access

• Lower memory bus utilization

6

…

One segmentsA single 128-byte transactionCoalesced Memory Access

…

Cross N segmentsN 32-byte transaction

Uncoalesced Memory Access

128 256 128 256

Warp Warp

synergy.cs.vt.edu

Irregular Algorithms Optimizations on GPU

• Irregular control flow– Kernel fission– Sorting work– Warp-centric execution…...

• Irregular memory access– Memory access reordering – Exploiting memory hierarchy– Data structure adaptation…...

7

synergy.cs.vt.edu

Case Study: Mapping BLASTp on GPU*

• BLAST - Basic Local Alignment Search Tool– Find most similar sequences from database for a query

sequence– Use heuristic algorithms, which have significant irregularities in

both memory access and control flow

• P is for protein sequence; N is for nucleotide sequence

8*: J. Zhang, H. Wang, H. Lin, W. Feng, “cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU”, IPDPS 2014

synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

Que

rySe

quen

ce

Subject Sequence (database)

C

B4

B

4 7

9

Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension

synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

Que

rySe

quen

ce

Subject Sequence

C

B4

B

4 7

A

B

C

B

C

C

C

C

B

0

1

2

…

A

B

A

5

10


synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

Que

rySe

quen

ce

Subject Sequence

C

B4

B

4 7

ABA none

ABB

ABC 0,5

none

…

…

Lookup Table

11


synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0,0

5,5

0

1

2

3

5

6

7

0,5

Que

rySe

quen

ce

Subject Sequence

query position, subject position

C

B4

B

4 7

ABA none

ABB

ABC 0,5

none

…

…

Lookup Table5,0

12


synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0,0

5,5

0

1

2

3

5

6

7

Subject Sequence

C

B4

B

4 7Q

uery

Sequ

ence

distance < threshold A


0,5

5,0

13

synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0,0

5,5

0

1

2

3

5

6

7

Subject Sequence

C

B4

B

4 7Q

uery

Sequ

ence


14

synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0,0

5,5

0

1

2

3

5

6

7

Subject Sequence

C

B4

B

4 7

X√

√√

Que

rySe

quen

ce


15

synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0,0

5,5

0

1

2

3

5

6

7

Subject Sequence

C

B4

B

4 7

X√

√√

Que

rySe

quen

ce


16

synergy.cs.vt.edu

BLAST Algorithm

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0,0

5,5

0

1

2

3

5

6

7

Subject Sequence

C

B4

B

4 7

X√

√√

Que

rySe

quen

ce


17Knowledge: “seed-and-extend” –the most widely used alignment paradigm

synergy.cs.vt.edu

State-of-the-Art GPU BLAST Approaches*• Use intuitive parallelization - “coarse-grained

parallelization”, where a thread compares the query sequence to a subject sequence

18

Thread0

Thread1

Thread2

Thread3

...

SubjectSeq0 SubjectSeq1 SubjectSeq2 SubjectSeq3 …...

Que

rySeq

uence

*: Design and Implementation of a CUDA-compatible GPU-based Core for Gapped BLAST Algorithm (ICCS ’10) GPU- BLAST: Using Graphics Processors to Accelerate Protein Sequence Alignment CUDA-BLASTP (Binformatics)CUDA-BLASTP: Accelerating BLASTP on CUDA-enabled Graphics Hardware (TCBB)Accelerating Protein Sequence Search in a Heterogeneous Computing System. (IPDPS ’11)

synergy.cs.vt.edu

Challenge #1: Control Flow Irregularity• With coarse-grained parallelism, different threads may

execute across different stages

… … … …subject_seq 0 subject_seq 1 subject_seq 2 subject_seq 3

Thread 0 Thread 1 Thread 2 Thread 3

Hit detection

Subject sequence

Hit found Hit found & distance < A No hit

Ungapped extension extension

No hit

19

Warp

…

…

Divergence

synergy.cs.vt.edu

Solution: Kernel Fission• Split hit detection and ungapped extension stage into

separate kernels• Use intermediate kernels to bridge the two stages

20

synergy.cs.vt.edu

Challenge #2: Control Flow Irregularity (2)

21

• Coarse-grained extension, where a thread is responsible for a diagonal, can result in divergence, since different diagonals could be extended to different lengths

Thread 1

Thread 0

Thread 2

One thread per diagonal

synergy.cs.vt.edu

Solution: Warp-centric Extension

22

• Map a warp of threads to one diagonal• Threads in a warp checks different positions

concurrently

Warp 1

Warp 0

Warp 2

Warp-based Extension

synergy.cs.vt.edu

Challenge #3: Memory Access Irregularity• Use a global array, called lastHit Array, to record the previous

hit in each diagonal

23

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

Que

rySe

quen

ce

C

B4

B

4 7

n

n

n

n

n

n

n

nLa

stH

itA

rray

7

6

5

4

2

1

0

3

…

-5 n

Subject Sequence

synergy.cs.vt.edu

Challenge #2: Memory Access Irregularity

24

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

Que

rySe

quen

ce

C

B4

B

4 7

n

n

n

n

n

n

n

n

0,07

6

5

4

2

1

0

3

Operation Position

Read 0

n

…

-5

Write 0

• Read LastHit Array to get last hit position in the diagonal• Write back current position to lastHit Array

R0 W

Subject Sequence

Thread 0

synergy.cs.vt.eduR


25

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

Que

rySe

quen

ce

C

B4

B

4 7

n

n

n

n

n

n

0

n

0,07

6

5

4

2

1

0

3

Operation Position

Read 0

5,0

n

…

-5

Write 0


5 W

Write -5

Read -5

Subject Sequence

Thread 0

synergy.cs.vt.edu


26

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

0,5

Que

rySe

quen

ce

C

B4

B

4 7

n

n

n

n

n

n

n

n

0,0

0

Operation Position

Read7

6

5

4

2

1

0

3

0

Write -5R

Read 5

Write 5

5 W

5,0

5

…

-5

Write 0

Read -5

Subject Sequence

Thread 0


synergy.cs.vt.edu

R


27

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

0,5

Que

rySe

quen

ce

C

B4

B

4 7

n

n

n

n

n

n

n

n

0,0

5,5

0

5

Operation Position

Read7

6

5

4

2

1

0

3

0

Write -5

Read 5

Write 5

5 W

Read 0

Write 0

5,0

5

…

-5

Write 0

Read -5

Subject Sequence

Thread 0


synergy.cs.vt.edu

Challenge #2: Memory Access Irregularity• Irregular Read/Write Ops on LastHit array

28

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

C

0

1

2

3

5

6

7

0,5

Que

rySe

quen

ce

Subject SequenceC

B4

B

4 7

n

n

n

n

n

n

n

n

0,0

5,5

05

5

Operation Position

Read7

6

5

4

2

1

0

3

0

Write -5

Read 5

Write 5

Read 0

Write 0

Irregular Memory Access

5,0

5

…

-5

Write 0

Read -5

Thread 0

synergy.cs.vt.edu

7

6

5

4

2

1

0

3

Bins

-5

…

Solution: Memory Access Reordering

• Collect and group hits by diagonal numbers via binning

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

A

0,0

5,5

0

1

2

3

5

6

7

0,5

Que

rySe

quen

ce

A

B4

B

4 7

0,05,5

0,5

Out-of-order5,0

5,0

29

Threads

Binning

Subject Sequence

synergy.cs.vt.edu

7

6

5

4

2

1

0

3

Bins

-5

…


• Sort hits by positions in each bin

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

A

0,0

5,5

0

1

2

3

5

6

7

0,5

A

B4

B

4 7

0,05,5

0,5

5,0

5,0

Sort

7

6

5

4

2

1

0

3

Bins

-5

…5,50,0

0,5

5,0

30

Threads

Binning

synergy.cs.vt.edu

7

6

5

4

2

1

0

3

Bins

-5

…


• Filter out non-extenable hits by distance of adjacent hits

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

A

0,0

5,5

0

1

2

3

5

6

7

0,5

A

B4

B

4 7

0,05,5

0,5

5,0

5,0

Sort

7

6

5

4

2

1

0

3

Bins

-5

…5,50,0

0,5

5,0

√

X

X

Filter

7

6

5

4

2

1

0

3

Bins

-5

…5,5

31

Threads

Binning

synergy.cs.vt.edu

7

6

5

4

2

1

0

3

Bins

-5

…


• Transform an irregular algorithm (lastHit array method) into three regular phases

A B C B A B

0 1 2 3 5 6

A

B

C

C

A

B

A

0,0

5,5

0

1

2

3

5

6

7

0,5

A

B4

B

4 7

0,05,5

0,5

5,0

5,0

Sort

7

6

5

4

2

1

0

3

Bins

-5

…5,50,0

0,5

5,0

√

X

X

Filter

7

6

5

4

2

1

0

3

Bins

-5

…5,5

32

Threads

Regular read, irregular write

Binning

Regular memory access

synergy.cs.vt.edu

0%

20%

40%

60%

80%

100%

BranchDivergenceOverhead*

OccupancyAchieved GlobalLoadEfficiency

cuBLASTP- HitDetection cuBLASTP- HitSortingcuBLASTP- HitFiltering cuBLASTP- UngappedExtensionOriginal

Performance Numbers• Compared to original version, cuBLASTP has ...

– Less branch divergence overhead

33

*: Branch Divergence Overhead: the ratio of divergent execution to total execution

cuBLASTPcuBLASTP cuBLASTP

The

low

er th

e be

tter

The

high

er th

e be

tter

The

high

er th

e be

tter

synergy.cs.vt.edu

0%

20%

40%

60%

80%

100%





– Less branch divergence overhead– Better occupancy

34



The

low

er th

e be

tter

The

high

er th

e be

tter

The

high

er th

e be

tter

synergy.cs.vt.edu

0%

20%

40%

60%

80%

100%





– Less branch divergence overhead– Better occupancy– Better load efficiency

35



The

low

er th

e be

tter

The

high

er th

e be

tter

The

high

er th

e be

tter

synergy.cs.vt.edu

Speedup

• With swissprot database, cuBLASTP achieves …• Up to 2.6-fold speedup of hit detection and ungapped extension

over GPU-BLASTP, which is the fastest GPU-based BLAST• Up to 1.6-fold overall speedup over GPU-BLASTP (cuBLASTP has

parallelization of the rest of stages)

36

0

0.5

1

1.5

2

2.5

3

query127 query517 query1054SpeedupofHitDetection&

UngappedExtension

0

0.4

0.8

1.2

1.6

2

query127 query517 query1054OverallSpeedup

synergy.cs.vt.edu

Conclusion and Future Work

Conclusion• Mapping irregular algorithms to heterogeneous system is

non-trivial, need couples of transformation and adaption on algorithms and data structures

• But these concepts of transformations are propagable– Kernel fission, memory access reordering, warp-centric execution, …

Future Work• Generalize optimizations from cuBLASTP for different

algorithms and other heterogeneous systems• Design a library for irregular algorithms with optimized

irregular data structure and operations 37

synergy.cs.vt.edu

Other Research Topics

• Optimizing BWA (Burrows Wheeler Alignment) on multicore CPU and Intel Xeon Phi

• dbblast: database index based BLAST on multicore CPU and Intel Xeon Phi

• Dynamic parallelism on AMD APU (New)• Optimizing irregular applications for GPU/Xeon Phi

clusters (Future)

• Other Research Interests– Cloud computing– MapReduce with accelerators

38