synergy.cs.vt.edu
Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics
Jing ZhangAdvisor: Dr. Wu FengCollaborator: Hao Wang
synergy.cs.vt.edu
Irregular Algorithms• Characterized by …
– Operate on irregular data structures - trees, graphs, linked lists, priority queues
– Data are processed in variable-iteration loops– Both memory access and control flow are irregular and
unpredictable
• Emerging data intensive applications have increasing irregularity in memory access and control flow over massive data set
– Bioinformatics applications• Human genome: 3.2 billion letters (3.2 Gb)
– Social network analysis, graph algorithms• Twitter (June-Dec 2009) - 17 million users, 476 million tweets (6Gb) 2
synergy.cs.vt.edu
Heterogeneous Computing Systems
• Heterogeneous computing systems can offer massively parallel computation with high power efficiency and throughput
– GPU, AMD APU– Intel Xeon Phi– FPGA, DSP, …
• But, designed for regular programs, relying on data locality and regular computation to tolerate access latencies
– Thousands of smaller (weak), more efficient (simple) cores– Limited cache size, i.e., short cache line lifetimes
3
synergy.cs.vt.edu
Heterogeneous Computing Systems
• Heterogeneous computing systems can offer massively parallel computation with high power efficiency and throughput
– GPU, AMD APU– Intel Xeon Phi– FPGA, DSP, …
• But, designed for regular programs, relying on data locality and regular computation to tolerate access latencies
– Thousands of smaller (weak), more efficient (simple) cores, which are lack of branch prediction, out-of-order execution, etc.
– Limited cache size, i.e., short cache line lifetimes 4
It is very challenging to map irregular algorithms on heterogeneous
computing systems!
synergy.cs.vt.edu
Impacts of Irregularity on GPU Performance
• Irregular control flow– Branch divergence
• Occurs when threads inside warps branches to different execution paths
• Waste computational resource
5
Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt, Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, micro’07
Branch
Path A
Path BPath B
Branch
Path A
synergy.cs.vt.edu
Impacts of Irregularity on GPU Perf (cont’d)
• Irregular memory access– Uncoalesced memory access
• Lower memory bus utilization
6
…
One segmentsA single 128-byte transactionCoalesced Memory Access
…
Cross N segmentsN 32-byte transaction
Uncoalesced Memory Access
128 256 128 256
Warp Warp
synergy.cs.vt.edu
Irregular Algorithms Optimizations on GPU
• Irregular control flow– Kernel fission– Sorting work– Warp-centric execution…...
• Irregular memory access– Memory access reordering – Exploiting memory hierarchy– Data structure adaptation…...
7
synergy.cs.vt.edu
Case Study: Mapping BLASTp on GPU*
• BLAST - Basic Local Alignment Search Tool– Find most similar sequences from database for a query
sequence– Use heuristic algorithms, which have significant irregularities in
both memory access and control flow
• P is for protein sequence; N is for nucleotide sequence
8*: J. Zhang, H. Wang, H. Lin, W. Feng, “cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU”, IPDPS 2014
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
Que
rySe
quen
ce
Subject Sequence (database)
C
B4
B
4 7
9
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
Que
rySe
quen
ce
Subject Sequence
C
B4
B
4 7
A
B
C
B
C
C
C
C
B
0
1
2
…
A
B
A
5
10
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
Que
rySe
quen
ce
Subject Sequence
C
B4
B
4 7
ABA none
ABB
ABC 0,5
none
…
…
Lookup Table
11
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0,0
5,5
0
1
2
3
5
6
7
0,5
Que
rySe
quen
ce
Subject Sequence
query position, subject position
C
B4
B
4 7
ABA none
ABB
ABC 0,5
none
…
…
Lookup Table5,0
12
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0,0
5,5
0
1
2
3
5
6
7
Subject Sequence
C
B4
B
4 7Q
uery
Sequ
ence
distance < threshold A
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
0,5
5,0
13
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0,0
5,5
0
1
2
3
5
6
7
Subject Sequence
C
B4
B
4 7Q
uery
Sequ
ence
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
14
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0,0
5,5
0
1
2
3
5
6
7
Subject Sequence
C
B4
B
4 7
X√
√√
Que
rySe
quen
ce
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
15
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0,0
5,5
0
1
2
3
5
6
7
Subject Sequence
C
B4
B
4 7
X√
√√
Que
rySe
quen
ce
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
16
synergy.cs.vt.edu
BLAST Algorithm
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0,0
5,5
0
1
2
3
5
6
7
Subject Sequence
C
B4
B
4 7
X√
√√
Que
rySe
quen
ce
Four stages in BLAST1. Hit detection2. Ungapped extension3. Gapped extension4. Final extension
17Knowledge: “seed-and-extend” –the most widely used alignment paradigm
synergy.cs.vt.edu
State-of-the-Art GPU BLAST Approaches*• Use intuitive parallelization - “coarse-grained
parallelization”, where a thread compares the query sequence to a subject sequence
18
Thread0
Thread1
Thread2
Thread3
...
SubjectSeq0 SubjectSeq1 SubjectSeq2 SubjectSeq3 …...
Que
rySeq
uence
*: Design and Implementation of a CUDA-compatible GPU-based Core for Gapped BLAST Algorithm (ICCS ’10) GPU- BLAST: Using Graphics Processors to Accelerate Protein Sequence Alignment CUDA-BLASTP (Binformatics)CUDA-BLASTP: Accelerating BLASTP on CUDA-enabled Graphics Hardware (TCBB)Accelerating Protein Sequence Search in a Heterogeneous Computing System. (IPDPS ’11)
synergy.cs.vt.edu
Challenge #1: Control Flow Irregularity• With coarse-grained parallelism, different threads may
execute across different stages
… … … …subject_seq 0 subject_seq 1 subject_seq 2 subject_seq 3
Thread 0 Thread 1 Thread 2 Thread 3
Hit detection
Subject sequence
Hit found Hit found & distance < A No hit
Ungapped extension extension
No hit
19
Warp
…
…
Divergence
synergy.cs.vt.edu
Solution: Kernel Fission• Split hit detection and ungapped extension stage into
separate kernels• Use intermediate kernels to bridge the two stages
20
synergy.cs.vt.edu
Challenge #2: Control Flow Irregularity (2)
21
• Coarse-grained extension, where a thread is responsible for a diagonal, can result in divergence, since different diagonals could be extended to different lengths
Thread 1
Thread 0
Thread 2
One thread per diagonal
synergy.cs.vt.edu
Solution: Warp-centric Extension
22
• Map a warp of threads to one diagonal• Threads in a warp checks different positions
concurrently
Warp 1
Warp 0
Warp 2
Warp-based Extension
synergy.cs.vt.edu
Challenge #3: Memory Access Irregularity• Use a global array, called lastHit Array, to record the previous
hit in each diagonal
23
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
Que
rySe
quen
ce
C
B4
B
4 7
n
n
n
n
n
n
n
nLa
stH
itA
rray
7
6
5
4
2
1
0
3
…
-5 n
Subject Sequence
synergy.cs.vt.edu
Challenge #2: Memory Access Irregularity
24
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
Que
rySe
quen
ce
C
B4
B
4 7
n
n
n
n
n
n
n
n
0,07
6
5
4
2
1
0
3
Operation Position
Read 0
n
…
-5
Write 0
• Read LastHit Array to get last hit position in the diagonal• Write back current position to lastHit Array
R0 W
Subject Sequence
Thread 0
synergy.cs.vt.eduR
Challenge #2: Memory Access Irregularity
25
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
Que
rySe
quen
ce
C
B4
B
4 7
n
n
n
n
n
n
0
n
0,07
6
5
4
2
1
0
3
Operation Position
Read 0
5,0
n
…
-5
Write 0
• Read LastHit Array to get last hit position in the diagonal• Write back current position to lastHit Array
5 W
Write -5
Read -5
Subject Sequence
Thread 0
synergy.cs.vt.edu
Challenge #2: Memory Access Irregularity
26
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
0,5
Que
rySe
quen
ce
C
B4
B
4 7
n
n
n
n
n
n
n
n
0,0
0
Operation Position
Read7
6
5
4
2
1
0
3
0
Write -5R
Read 5
Write 5
5 W
5,0
5
…
-5
Write 0
Read -5
Subject Sequence
Thread 0
• Read LastHit Array to get last hit position in the diagonal• Write back current position to lastHit Array
synergy.cs.vt.edu
R
Challenge #2: Memory Access Irregularity
27
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
0,5
Que
rySe
quen
ce
C
B4
B
4 7
n
n
n
n
n
n
n
n
0,0
5,5
0
5
Operation Position
Read7
6
5
4
2
1
0
3
0
Write -5
Read 5
Write 5
5 W
Read 0
Write 0
5,0
5
…
-5
Write 0
Read -5
Subject Sequence
Thread 0
• Read LastHit Array to get last hit position in the diagonal• Write back current position to lastHit Array
synergy.cs.vt.edu
Challenge #2: Memory Access Irregularity• Irregular Read/Write Ops on LastHit array
28
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
C
0
1
2
3
5
6
7
0,5
Que
rySe
quen
ce
Subject SequenceC
B4
B
4 7
n
n
n
n
n
n
n
n
0,0
5,5
05
5
Operation Position
Read7
6
5
4
2
1
0
3
0
Write -5
Read 5
Write 5
Read 0
Write 0
Irregular Memory Access
5,0
5
…
-5
Write 0
Read -5
Thread 0
synergy.cs.vt.edu
7
6
5
4
2
1
0
3
Bins
-5
…
Solution: Memory Access Reordering
• Collect and group hits by diagonal numbers via binning
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
A
0,0
5,5
0
1
2
3
5
6
7
0,5
Que
rySe
quen
ce
A
B4
B
4 7
0,05,5
0,5
Out-of-order5,0
5,0
29
Threads
Binning
Subject Sequence
synergy.cs.vt.edu
7
6
5
4
2
1
0
3
Bins
-5
…
Solution: Memory Access Reordering
• Sort hits by positions in each bin
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
A
0,0
5,5
0
1
2
3
5
6
7
0,5
A
B4
B
4 7
0,05,5
0,5
5,0
5,0
Sort
7
6
5
4
2
1
0
3
Bins
-5
…5,50,0
0,5
5,0
30
Threads
Binning
synergy.cs.vt.edu
7
6
5
4
2
1
0
3
Bins
-5
…
Solution: Memory Access Reordering
• Filter out non-extenable hits by distance of adjacent hits
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
A
0,0
5,5
0
1
2
3
5
6
7
0,5
A
B4
B
4 7
0,05,5
0,5
5,0
5,0
Sort
7
6
5
4
2
1
0
3
Bins
-5
…5,50,0
0,5
5,0
√
X
X
Filter
7
6
5
4
2
1
0
3
Bins
-5
…5,5
31
Threads
Binning
synergy.cs.vt.edu
7
6
5
4
2
1
0
3
Bins
-5
…
Solution: Memory Access Reordering
• Transform an irregular algorithm (lastHit array method) into three regular phases
A B C B A B
0 1 2 3 5 6
A
B
C
C
A
B
A
0,0
5,5
0
1
2
3
5
6
7
0,5
A
B4
B
4 7
0,05,5
0,5
5,0
5,0
Sort
7
6
5
4
2
1
0
3
Bins
-5
…5,50,0
0,5
5,0
√
X
X
Filter
7
6
5
4
2
1
0
3
Bins
-5
…5,5
32
Threads
Regular read, irregular write
Binning
Regular memory access
synergy.cs.vt.edu
0%
20%
40%
60%
80%
100%
BranchDivergenceOverhead*
OccupancyAchieved GlobalLoadEfficiency
cuBLASTP- HitDetection cuBLASTP- HitSortingcuBLASTP- HitFiltering cuBLASTP- UngappedExtensionOriginal
Performance Numbers• Compared to original version, cuBLASTP has ...
– Less branch divergence overhead
33
*: Branch Divergence Overhead: the ratio of divergent execution to total execution
cuBLASTPcuBLASTP cuBLASTP
The
low
er th
e be
tter
The
high
er th
e be
tter
The
high
er th
e be
tter
synergy.cs.vt.edu
0%
20%
40%
60%
80%
100%
BranchDivergenceOverhead*
OccupancyAchieved GlobalLoadEfficiency
cuBLASTP- HitDetection cuBLASTP- HitSortingcuBLASTP- HitFiltering cuBLASTP- UngappedExtensionOriginal
Performance Numbers• Compared to original version, cuBLASTP has ...
– Less branch divergence overhead– Better occupancy
34
*: Branch Divergence Overhead: the ratio of divergent execution to total execution
cuBLASTPcuBLASTP cuBLASTP
The
low
er th
e be
tter
The
high
er th
e be
tter
The
high
er th
e be
tter
synergy.cs.vt.edu
0%
20%
40%
60%
80%
100%
BranchDivergenceOverhead*
OccupancyAchieved GlobalLoadEfficiency
cuBLASTP- HitDetection cuBLASTP- HitSortingcuBLASTP- HitFiltering cuBLASTP- UngappedExtensionOriginal
Performance Numbers• Compared to original version, cuBLASTP has ...
– Less branch divergence overhead– Better occupancy– Better load efficiency
35
*: Branch Divergence Overhead: the ratio of divergent execution to total execution
cuBLASTPcuBLASTP cuBLASTP
The
low
er th
e be
tter
The
high
er th
e be
tter
The
high
er th
e be
tter
synergy.cs.vt.edu
Speedup
• With swissprot database, cuBLASTP achieves …• Up to 2.6-fold speedup of hit detection and ungapped extension
over GPU-BLASTP, which is the fastest GPU-based BLAST• Up to 1.6-fold overall speedup over GPU-BLASTP (cuBLASTP has
parallelization of the rest of stages)
36
0
0.5
1
1.5
2
2.5
3
query127 query517 query1054SpeedupofHitDetection&
UngappedExtension
0
0.4
0.8
1.2
1.6
2
query127 query517 query1054OverallSpeedup
synergy.cs.vt.edu
Conclusion and Future Work
Conclusion• Mapping irregular algorithms to heterogeneous system is
non-trivial, need couples of transformation and adaption on algorithms and data structures
• But these concepts of transformations are propagable– Kernel fission, memory access reordering, warp-centric execution, …
Future Work• Generalize optimizations from cuBLASTP for different
algorithms and other heterogeneous systems• Design a library for irregular algorithms with optimized
irregular data structure and operations 37
synergy.cs.vt.edu
Other Research Topics
• Optimizing BWA (Burrows Wheeler Alignment) on multicore CPU and Intel Xeon Phi
• dbblast: database index based BLAST on multicore CPU and Intel Xeon Phi
• Dynamic parallelism on AMD APU (New)• Optimizing irregular applications for GPU/Xeon Phi
clusters (Future)
• Other Research Interests– Cloud computing– MapReduce with accelerators
38
Top Related