Streamlining GPU Applications On the Fly
Transcript of Streamlining GPU Applications On the Fly
eddy@[email protected]
Streamlining GPU Applications On the Fly
Thread Divergence Elimination through RuntimeThread-Data Remapping
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Xipeng Shen
Department of Computer Science, College of William and Mary
eddy@[email protected]
GPU Divergence
• GPU Features– Streaming multiprocessors
• SIMD• Single instruction issue per SM
– Warp / Half Warp• SIMD execution unit
• Divergence– Threads in a warp take different
execution paths
eddy@[email protected]
Example of GPU Divergence
Inst. A
Inst. B
Inst. C
Time
Warp
Thread Serialization Behavior
A
B C ……
Instructions: A, B, CControl Flow
eddy@[email protected]
Impact of GPU Divergence• Degrading GPU Throughput
– E.g., up to degradation on Tesla 1060
• Impairing GPU Usage– Esp. when having non-trivial condition
statements
1516
eddy@[email protected]
Related Work• Stream packing and unpacking
• [Popa: U. Waterloo C.S. master thesis’04]
• Simulate hardware packing on CPU
• Dynamic warp formation & scheduling• [Fung+: MICRO’07]
• Hardware solution
• Control structure splitting• [Carrillo+: CF’09]
• Reducing register pressure but removing no divergences
eddy@[email protected]
Basic Idea of Our SolutionSwapping Jobs of Threads through Thread-Data Remapping
Thread-Data Remapping
if ( A[tid] ){ green
C[tid] += 1;}else{ red
C[tid] -= 1;}
warp 1 warp 2 warp 3
A[ ]
threads
threads
eddy@[email protected]
Challenges• How to determine a desirable mapping?
– Complexities• irregular accesses, complex indexing expressions, side effects
on memory reference patterns, ...
• How to realize the new mapping?– Data movement or redirect threads’ data references
• limitations, effectiveness, and safety.
• How to do it on the fly?– Large overhead v.s. Need for runtime remapping
• dependence on runtime data values, minimizing and hidingoverhead
eddy@[email protected]
Outline• Thread-data Remapping
– Concept & mechanisms
• Transformation on the Fly– CPU-GPU pipelining & LAM
• Evaluation
• Conclusion
eddy@[email protected]
GPU Divergence Causes• Control Flows in Code
– E.g., if, do, for, while, switchif, do, for, while, switch
• Input Data Dependence– Input data-set --> execution path– Thread-data mapping --> amount of thread
divergence
eddy@[email protected]
Define Divergence• Control Flow Path Vector for One Thread
Def: PvectorPvector[tid] = < bb11, b, b22, b, b33, , ……, , bbnn >
if ( A[tid] % 2 ) {…};if ( A[tid] < 10 ) {…};
<0,0>142
………
<1,0>111
<0,1>20
PvectorA[tid]tidCondition Statements
Path Vector Example
eddy@[email protected]
Define Divergence• Control Flow Path Vector for One Thread
Def: PvectorPvector[tid] = < bb11, b, b22, b, b33, , ……, , bbnn >
if ( A[tid] % 2 ) {…};if ( A[tid] < 10 ) {…};
<0,0>142
………
<1,0>111
<0,1>20
PvectorA[tid]tidCondition Statements
Path Vector Example
eddy@[email protected]
Regroup Threads• To Satisfy Convergence Condition:
– Sort PvectorPvector[[00]],, PvectorPvector[[11],], PvectorPvector[[22], ], …… for forall threadsall threads
–– E.g, after sorting, the grouping of threadsE.g, after sorting, the grouping of threads
0 12 8 11 9 4 6 7 2 5 10 3 1 13 14 15Thread Index:Thread Index:
Path Vector:Path Vector: <0,0><0,0> <0,1><0,1> <1,1><1,1>
Warp Warp Warp Warp
eddy@[email protected]
<0,0><1,0>
Example of GPU Thread Divergence
WARP 1 WARP 2
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8Path Vector:
Data Index: j
Thread Index: i
Mapping:Thread[i] --> Data[j]
i == j
eddy@[email protected]
<0,0><1,0>
Remapping by Reference Redirection
WARP 1 WARP 2
1 2 3 4 5 6 7 8Thread Index: i
1 2 3 4 5 6 7 8Path Vector:
Data Index: j
j == IND[i]
Mapping:Thread[i] --> Data[j]
eddy@[email protected]
<0,0><1,0>
Remapping by Data Transformation
WARP 1 WARP 2
1 2 3 4 5 6 7 8Thread Index: i
1 2 3 4 5 6 7 8Path Vector:
Data Index: j
eddy@[email protected]
<0,0><1,0>
Remapping by Data Transformation
WARP 1 WARP 2
1 2 3 4 5 6 7 8Thread Index: i
Swap
1 2 3 4 5 6 7 8Path Vector:
Data Index: j
Mapping:Thread[i] --> Data[j]
i == j
eddy@[email protected]
Outline• Thread-data Remapping
– Concept & mechanisms
• Transformation on the Fly– CPU-GPU pipelining & LAM
• Evaluation
• Conclusion
eddy@[email protected]
Overview of CPU-GPU Synergy
CPU
GPU
CollectBranch
Info
Send Remapping
Info
•Path Vectors•Compute Best Mapping•Feedback Control
IndependentIndependentProtectedProtectedPipelined Pipelined
• Realize DesiredThread-data Map.
eddy@[email protected]
CPU-GPU Pipeline Scheme• Without Pipelining Scheme
• With Pipelining Scheme
Tdiv
( Tno-div + Tremap ): >=1 or < 1
Tdiv
(Tno-div + Tremap) : >= 1 No Slow-down!
eddy@[email protected]
CPU-GPU Pipeline Example I
……GPU Code
Remapping Thread
…… ……
Timeline
GPU Kernel Func.
CPU Remap Func.
eddy@[email protected]
CPU-GPU Pipeline Example II
……GPU Code
Pipeline
Threadss
…… ……
Timeline
GPU Kernel Func.
CPU Remap Func.
……
•ControllableThreading•Adaptive toAvail. Resource
eddy@[email protected]
Applicable Scenarios• Loops
– Multiple invocation of same kernel function
• Input Data Partition for A Kernel Function– Create multiple iterations
• Across Different Kernels– With idle CPU processing system resources
eddy@[email protected]
LAM: Reduce Data Movements• For Data Layout Transformation - LAM Scheme
– 3 Steps: Label, Assign & Move (LAM)
– Label -- Classify path vectors into multiple classes• Based on similarity
– Assign -- Assign warps to different classes• Based on occupation ratio
– Move -- Determine the destination
Tunable # of Classes
Increase # of no-need-moves
eddy@[email protected]
Outline• Thread-data Remapping
– Concept & mechanisms
• Transformation on the fly– CPU-GPU pipelining & LAM
• Evaluation• Conclusion
eddy@[email protected]
Experiment Settings• Host Machine
– Dual-socket quad-core Intel Xeon E5540
• GPU Device– NVIDIA Tesla 1060
• Runtime Library– Reference redirection & data transformation
– Pipeline threads management
eddy@[email protected]
Benchmarks
0%0%if statementoptionpricing
Blackscholes
50%100%if statementparallel sumReduction
99%100%if statementgraphicsalgorithm
Marching
Cubes
44%100%if statement& loop
geneticalgorithm
GAFORT
50-100%50-100%if statementLBM basedPDE solver
3D-LBM
Div.Reduction
Percent ofDiv. Warps
PotentialDiv. Source
CommentsProgram
eddy@[email protected]
Benchmarks
0%0%if statementoptionpricing
Blackscholes
50%100%if statementparallel sumReduction
99%100%if statementgraphicsalgorithm
Marching
Cubes
44%100%if statement& loop
geneticalgorithm
GAFORT
50-100%50-100%if statementLBM basedPDE solver
3D-LBM
Div.Reduction
Percent ofDiv. Warps
PotentialDiv. Source
CommentsProgram
eddy@[email protected]
Evaluation: Data Transformation• GAFORT
– Mutation probabilities– Regular mem. access– select_cld kernel– Remap scheme
• Data layout transform.
– Efficiency control• LAM & Pipeline
5132567225Time
56%100%Div. Ratio
AfterBeforePerform.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Baseline No-LAM +LAM + Pipeline
Sp
eed
up
Performance Comparison
44% Reduced
1.31 Speedup
eddy@[email protected]
Evaluation: Reference Redirect.• MarchingCubes
– Div: number of vertices thatintersect isosurface
– Random memory access
– generateTriangles2 kernel
– Remap scheme• Reference redirection
– Efficiency control• Pipeline
1.341.351.37SpeedUp
99%99%99%Div.Reduct
2566432BlockSize
Performance
124251237112666Opt.Time
166731670717414Org.Time
2566432BlockSize
Time (micro-sec)
eddy@[email protected]
Evaluation: All Benchmarks
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
3D-LBM GAFORT MARCH REDUCT BLACK
speedup
Org
Opt-W.O. Eff. Control
Opt-With Eff. Control
No Perf.Loss
eddy@[email protected]
Conclusion• An Efficient Software Solution for GPU Div.
– Mechanism• On-the-fly thread-data remapping
– Overhead control• CPU-GPU pipelining: whole-system synergy
• LAM scheme: balance between benefit & overhead
– Effectiveness• Up to 1.4x speedup
• Efficiency protection: no slowdown
eddy@[email protected]
Acknowledgement• Ye Zhao
• Xiaoming Li
• NVIDIA– Donation of GPU Device
• NSF Funds & IBM CAS Fellowship
• Anonymous Reviewers