(1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as...
-
Upload
cordelia-mccarthy -
Category
Documents
-
view
226 -
download
0
Transcript of (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as...
![Page 1: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/1.jpg)
(1)
Introduction to Control Divergence
Lectures Slides and Figures contributed from sources as noted
![Page 2: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/2.jpg)
(2)
Objective
• Understand the occurrence of control divergence and the concept of thread reconvergence Also described as branch divergence and thread divergence
• Cover a basic thread reconvergence mechanism – PDOM Set up discussion of further optimizations and advanced
techniques
• Explore one approach for mitigating the performance degradation due to control divergence – dynamic warp formation
![Page 3: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/3.jpg)
(3)
Reading
• W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009
• W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011
• M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
• Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 4: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/4.jpg)
(4)
Handling Branches
• CUDA Code:
if(…) … (True for some threads)
else … (True for others)
• What if threads takes different branches?
• Branch Divergence!
T T T T
taken not taken
![Page 5: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/5.jpg)
(5)
Branch Divergence
• Occurs within a warp• Branches lead serialization branch dependent code
Performance issue: low warp utilization
if(…)
{… }
else {…}
Idle threads
Reconvergence!
![Page 6: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/6.jpg)
Branch Divergence
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
Thread Warp Common PC
Thread2
Thread3
Thread4
Thread1
B
C D
E
F
A
G
• Different threads follow different control flow paths through the kernel code
• Thread execution is (partially) serialized Subset of threads that follow the same path execute in
parallel
![Page 7: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/7.jpg)
(7)
Basic Idea
• Split: partition a warp Two mutually exclusive thread subsets, each branching to a
different target Identify subsets with two activity masks effectively two warps
• Join: merge two subsets of a previously split warp Reconverge the mutually exclusive sets of threads
• Orchestrate the correct execution for nested branches
• Note the long history of technques in SIMD processors (see background in Fung et. al.)
Thread Warp Common PC
T2 T3 T4T1
0 0 11 activity mask
![Page 8: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/8.jpg)
(8)
Thread Reconvergence
• Fundamental problem: Merge threads with the same PC How do we sequence execution of threads? Since this can
effect the ability to reconverge
• Question: When can threads productively reconverge?
• Question: When is the best time to reconverge?
B
C D
E
F
A
G
![Page 9: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/9.jpg)
(9)
Dominator
• Node d dominates node n if every path from the entry node to n must go through d
B
C D
E
F
A
G
![Page 10: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/10.jpg)
(10)
Immediate Dominator
• Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n
B
C D
E
F
A
G
![Page 11: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/11.jpg)
(11)
Post Dominator
• Node d post dominates node n if every path from the node n to the exit node must go through d
B
C D
E
F
A
G
![Page 12: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/12.jpg)
(12)
Immediate Post Dominator
• Node d immediate post dominates node n if every path from node n to the exist node must go through d and no other nodes post dominate n between d and n
B
C D
E
F
A
G
![Page 13: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/13.jpg)
Baseline: PDOM
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
- G 1111TOS
B
C D
E
F
A
G
Thread Warp Common PC
Thread2
Thread3
Thread4
Thread1
B/1111
C/1001 D/0110
E/1111
A/1111
G/1111
- A 1111TOSE D 0110E C 1001TOS
- E 1111E D 0110TOS- E 1111
A D G A
Time
CB E
- B 1111TOS - E 1111TOSReconv. PC Next PC Active Mask
Stack
E D 0110E E 1001TOS
- E 1111
![Page 14: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/14.jpg)
(14)
Stack Entry
- G 1111TOS
B
C D
E
F
A
G
B/1111
C/1001 D/0110
E/1111
A/1111
G/1111
- A 1111TOSE D 0110E C 1001TOS
- E 1111E D 0110TOS- E 1111- B 1111TOS - E 1111TOS
Reconv. PC Next PC Active Mask
Stack
E D 0110E E 1001TOS
- E 1111
• A stack entry is a specification of a group of active threads that will execute that basic block
• The natural nested structure of control exposes the use stack-based serialization
![Page 15: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/15.jpg)
(15)
More Complex Example
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009
• Stack based implementation for nested control flow Stack entry RPC set to IPDOM
• Re-convergence at the immediate post-dominator of the branch
![Page 16: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/16.jpg)
(16)
Implementation I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scala
rPip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
Issue
I-Buffer
pending warps
FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
- G 1111TOS - A 1111TOSE D 0110E C 1001TOS
- E 1111E D 0110TOS- E 1111- B 1111TOS - E 1111TOS
Reconv. PC Next PC Active Mask
Stack
E D 0110E E 1001TOS
- E 1111
• GPGPUSim model: Implement per warp stack at issue stage Acquire the active mask and PC from the
TOS Scoreboard check prior to issue Register writeback updates scoreboard and
ready bit in instruction buffer When RPC = Next PC, pop the stack
• Implications for instruction fetch?
![Page 17: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/17.jpg)
(17)
Implementation (2)
• warpPC (next instruction) compared to reconvergence PC
• On a branch Can store the reconvergence PC as part of the branch
instruction Branch unit has NextPC, TargetPC and reconvergence PC to
update the stack
• On reaching a reconvergence point Pop the stack Continue fetching from the NextPC of the next entry on the
stack
![Page 18: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/18.jpg)
(18)
Can We Do Better?
• Warps are formed statically
• Key idea of dynamic warp formation Show a pool of warps and how they can be merged
• At a high level what are the requirements
B
C D
E
F
A
G
![Page 19: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/19.jpg)
(19)
Compaction Techniques
• Can we reform warps so as to increase utilization?
• Basic idea: Compaction Reform warps with threads that follow the same control flow
path Increase utilization of warps
• Two basic types of compaction techniques
• Inter-warp compaction Group threads from different warps Group threads within a warp
o Changing the effective warp size
![Page 20: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/20.jpg)
(20)
Inter-Warp Thread Compaction Techniques
Lectures Slides and Figures contributed from sources as noted
![Page 21: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/21.jpg)
(21)
Goal
Warp 0
Warp 1
Warp 2
Warp 3
Warp 4
Warp 5
Taken Not Taken
if(…)
{… }
else {…}
Merge threads?
![Page 22: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/22.jpg)
(22)
Reading
• W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009
• W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011
![Page 23: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/23.jpg)
DWF: Example
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
A A B B G G A AC C D D E E F F
Time
A A B B G G A AC D E E F
Time
A x/1111y/1111
B x/1110y/0011
C x/1000y/0010 D x/0110
y/0001 F x/0001y/1100
E x/1110y/0011
G x/1111y/1111
A new warp created from scalar threads of both Warp x and y executing at Basic Block D
D
Execution of Warp xat Basic Block A
Execution of Warp yat Basic Block A
LegendAA
Baseline
DynamicWarpFormation
![Page 24: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/24.jpg)
(24)
How Does This Work?
• Criteria for merging Same PC Complements of active threads in each warp Recall: many warps/TB all executing the same code
• What information do we need to merge two warps Need thread IDs and PCs
• Ideally how would you find/merge warps?
![Page 25: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/25.jpg)
DWF: Microarchitecture Implementation
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
I-Cache
Decode
Com
mit/
Writeback
RF 2
RF 1
ALU 2
ALU 1 (TID, Reg#)
(TID, Reg#)
RF 3ALU 3 (TID, Reg#)
RF 4ALU 4 (TID, Reg#)
Thread SchedulerPC-Warp LUT Warp Pool Issu
e Log
ic
Warp Allocator
TID x N PC A
TID x N PC B
H
H
TID x NPC PrioTID x NPC Prio
OCCPC IDXOCCPC IDX
Warp Update Register T
Warp Update Register NT
REQ
REQTID x N
PC Prio
Assists in aggregating threads
Identify available lanes
Identify occupied lanes
Point to warp being formed
• Warps formed dynamically in the warp pool• After commit check PC-Warp LUT and merge or allocated newly
forming warp
![Page 26: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/26.jpg)
DWF: Microarchitecture Implementation
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
I-Cache
Decode
Com
mit/
Writeback
RF 2
RF 1
ALU 2
ALU 1 (TID, Reg#)
(TID, Reg#)
RF 3ALU 3 (TID, Reg#)
RF 4ALU 4 (TID, Reg#)
Thread SchedulerPC-Warp LUT Warp Pool Issu
e Log
ic
Warp Allocator
TID x N PC A
TID x N PC B
H
H
TID x NPC PrioTID x NPC Prio
OCCPC IDXOCCPC IDX
Warp Update Register T
Warp Update Register NT
REQ
REQTID x N
PC PrioA 5 6 7 8A 1 2 3 4
5 7 8
6
B
C
1011
0100
B 2 30110B 0 B 5 2 3 8
B
0010B 2
71
3
4
2 B
C
0110
1001
C 11001C 1 4C 61101C 1
No Lane Conflict
A: BEQ R2, BC: …
X
1234
Y
5678
X
1234
X
1234
X
1234
X
1234
Y
5678
Y
5678
Y
5678
Y
5678
Z
5238
Z
5238
Z
5238
![Page 27: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/27.jpg)
(27)
Resource Usage
• Ideally would like a small number of unique PCs in progress at a time minimize overhead
• Warp divergence will increase the number of unique PCs Mitigate via warp scheduling
• Scheduling policies FIFO Program counter – address variation measure of divergence Majority/Minority- most common vs. helping stragglers Post dominator (catch up)
![Page 28: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/28.jpg)
(28)
Hardware Consequences
• Expose the implications that warps have in the base design Implications for register file access lane aware DWF
• Register bank conflicts
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009
![Page 29: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/29.jpg)
(29)
Relaxing Implications of Warps
• Thread swizzling Essentially remap work to threads so as to create more
opportunities for DWF requires deep understanding of algorithm behavior and data sets
• Lane swizzling in hardware Provide limited connectivity between register banks and
lanes avoiding full crossbars
![Page 30: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/30.jpg)
(30)
Summary
• Control flow divergence is a fundamental performance limiter for SIMT execution
• Dynamic warp formation is one way to mitigate these effects We will look at several others
• Must balance a complex set of effects Memory behaviors Synchronization behaviors Scheduler behaviors
![Page 31: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/31.jpg)
Thread Block Compaction W. Fung and T. Aamodt
HPCA 2011
![Page 32: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/32.jpg)
(32)
Goal
• Overcome some of the disadvantages of dynamic warp formation Impact of Scheduling Breaking implicit synchronization Reduction of memory coalescing opportunities
![Page 33: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/33.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 33
9 6 3 4D-- 10 -- --D1 2 3 4E5 6 7 8E
9 10 11 12E
DWF Pathologies: Starvation
• Majority Scheduling– Best Performing – Prioritize largest group of threads
with same PC• Starvation
– LOWER SIMD Efficiency!
• Other Warp Scheduler?– Tricky: Variable Memory Latency
Time
1 2 7 8C 5 -- 11 12C
9 6 3 4D-- 10 -- --D
1 2 7 8E 5 -- 11 12E
9 6 3 4E-- 10 -- --E
B: if (K > 10) C: K = 10; elseD: K = 0;E: B = C[tid.x] + K;
1000s cycles
![Page 34: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/34.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 34
DWF Pathologies: Extra Uncoalesced Accesses
• Coalesced Memory Access = Memory SIMD – 1st Order CUDA Programmer Optimization
• Not preserved by DWFE: B = C[tid.x] + K;
1 2 3 4E5 6 7 8E
9 10 11 12E
Memory
0x100
0x1400x180
1 2 7 12E9 6 3 8E5 10 11 4E
Memory
0x100
0x1400x180
#Acc = 3
#Acc = 9
No DWF
With DWF
L1 Cache AbsorbsRedundant
Memory Traffic
L1$ Port Conflict
![Page 35: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/35.jpg)
(35)Wilson Fung, Tor Aamodt
DWF Pathologies: Implicit Warp Sync.
• Some CUDA applications depend on the lockstep execution of “static warps”
Thread 0 ... 31Thread 32 ... 63Thread 64 ... 95
Warp 0Warp 1Warp 2
From W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009
![Page 36: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/36.jpg)
(36)
Performance Impact
From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011
![Page 37: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/37.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 37
Thread Block Compaction Block-wide Reconvergence Stack
Regroup threads within a block Better Reconv. Stack: Likely Convergence
Converge before Immediate Post-Dominator Robust
Avg. 22% speedup on divergent CUDA apps No penalty on others
PC RPC AMaskWarp 0
E -- 1111D E 0011C E 1100
PC RPC AMaskWarp 1
E -- 1111D E 0100C E 1011
PC RPC AMaskWarp 2
E -- 1111D E 1100C E 0011
PC RPC Active MaskThread Block 0
E -- 1111 1111 1111D E 0011 0100 1100C E 1100 1011 0011 C Warp X
C Warp Y
D Warp UD Warp T
E Warp 0E Warp 1E Warp 2
![Page 38: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/38.jpg)
GPU Microarchitecture
Wilson Fung, Tor Aamodt Thread Block Compaction
InterconnectionNetwork
Memory PartitionLast-Level
Cache Bank
Off-ChipDRAM Channel
Memory PartitionLast-Level
Cache Bank
Off-ChipDRAM Channel
Memory PartitionLast-Level
Cache Bank
Off-ChipDRAM Channel
SIMT CoreSIMT CoreSIMT CoreSIMT CoreSIMT Core
SIMTFront End SIMD Datapath
FetchDecode
ScheduleBranch
Done (Warp ID)
Memory Subsystem Icnt.NetworkSMem L1 D$ Tex $ Const$ More Details
in Paper
![Page 39: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/39.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 39
StaticWarp
DynamicWarp
StaticWarp
Observation
Compute kernels usually contain divergent and non-divergent (coherent) code segments
Coalesced memory access usually in coherent code segments DWF no benefit there
Coherent
Divergent
Coherent
Reset Warps
Divergence
RecvgPt.
Coales. LD/ST
![Page 40: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/40.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 40
Thread Block Compaction
Barrier @ Branch/reconverge pt. All avail. threads arrive at branch Insensitive to warp scheduling
Run a thread block like a warp Whole block move between coherent/divergent code Block-wide stack to track exec. paths reconvg.
Warp compaction Regrouping with all avail. threads If no divergence, gives static warp arrangement
Starvation
Implicit Warp Sync.
Extra Uncoalesced Memory Access
![Page 41: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/41.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 41
Thread Block Compaction
PC RPC Active ThreadsA - 1 2 3 4 5 6 7 8 9 10 11 12D E -- -- 3 4 -- 6 -- -- 9 10 -- --C E 1 2 -- -- 5 -- 7 8 -- -- 11 12
E - 1 2 3 4 5 6 7 8 9 10 11 12
Time
1 2 7 8C 5 -- 11 12C
9 6 3 4D-- 10 -- --D
5 6 7 8A 9 10 11 12A
1 2 3 4A
5 6 7 8E 9 10 11 12E
1 2 3 4E
A: K = A[tid.x];
B: if (K > 10)
C: K = 10;
else
D: K = 0;
E: B = C[tid.x] + K;
5 6 7 8A 9 10 11 12A
1 2 3 4A
5 -- 7 8C -- -- 11 12C
1 2 -- --C
-- 6 -- --D9 10 -- --D
-- -- 3 4D
5 6 7 8E 9 10 11 12E
1 2 7 8E
-- -- -- ---- -- -- --
-- -- -- ---- -- -- --
-- -- -- ---- -- -- --
![Page 42: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/42.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 42
Thread Block Compaction
Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks
Multiple thread blocks run on a core Already done in most CUDA applications
Block 0
Block 1
Block 2
Branch Warp Compaction
Execution
Execution
Execution
Execution
Time
![Page 43: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/43.jpg)
(43)
High Level View
• DWF: warp broken down every cycle and threads in a warp shepherded into a new warp (LUT and warp pool)
• TBC: warps broken down at potentially divergent points and threads compacted across the thread block
![Page 44: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/44.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 44
Microarchitecture Modification
Per-Warp Stack Block-Wide Stack I-Buffer + TIDs Warp Buffer
Store the dynamic warps New Unit: Thread Compactor
Translate activemask to compact dynamic warps More Detail in Paper
ALUALUALU
I-Cache Decode
Warp Buffer
Score-Board
Issue RegFile
MEM
ALU
FetchBlock-Wide
Stack
Done (WID)
Valid[1:N]
Branch Target PC
ActiveMask
Pred.Thread
Compactor
![Page 45: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/45.jpg)
(45)
Microarchitecture Modification (2)
#compacted warps
TIDs of #compacted warps
From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011
![Page 46: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/46.jpg)
(46)
Operation
All threads mapped to the same lane
When this is zero – compact (priority
encoder)
Pick a thread mapped to this lane
• Warp 2 arrives first creating 2 target entries
• Next warps update the active mask
From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011
![Page 47: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/47.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 47
Thread Compactor Convert activemask from block-wide stack to
thread IDs in warp buffer Array of Priority-Encoder
P-Enc P-Enc P-Enc P-Enc
1 2 7 85 -- 11 12
1 2 -- -- 5 -- 7 8 -- -- 11 12C E
1 2 -- --5 -- 7 8-- -- 11 12
1 2 7 8C 5 -- 11 12C
Warp Buffer
![Page 48: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/48.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 48
RarelyTaken
Likely-Convergence Immediate Post-Dominator: Conservative
All paths from divergent branch must merge there Convergence can happen earlier
When any two of the paths merge
Extended Recvg. Stack to exploit this TBC: 30% speedup for Ray Tracing
while (i < K) { X = data[i];A: if ( X = 0 )B: result[i] = Y;C: else if ( X = 1 )D: break;E: i++; }F: return result[i];
A
B C
DE
FiPDom of A
Details in Paper
![Page 49: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/49.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 49
Likely-Convergence (2)
NVIDIA uses break instruction for loop exits That handles last example
Our solution: Likely-Convergence Points
This paper: only used to capture loop-breaks
PC RPC LPC LPosActiveThdsF -- -- -- 1 2 3 4E F -- -- --B F E 1 1C F E 1 2 3 4E F E 1 2D F E 1 3 4
E F -- -- 2E F E 1 1E F -- -- 1 2
Convergence!
![Page 50: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/50.jpg)
(50)
Likely-Convergence (3)
Check !Merge inside the stack
From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011
![Page 51: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/51.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 51
Likely-Convergence (4)
Applies to both per-warp stack (PDOM) and thread block compaction (TBC) Enable more threads grouping for TBC Side effect: Reduce stack usage in some case
![Page 52: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/52.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 52
Evaluation
Simulation: GPGPU-Sim (2.2.1b) ~Quadro FX5800 + L1 & L2 Caches
21 Benchmarks All of GPGPU-Sim original benchmarks Rodinia benchmarks Other important applications:
Face Detection from Visbench (UIUC) DNA Sequencing (MUMMER-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research
![Page 53: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/53.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 53
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
TBC
DWF
IPC Relative to Baseline
COHE
DIVG
Experimental Results
2 Benchmark Groups: COHE = Non-Divergent CUDA applications DIVG = Divergent CUDA applications
Serious Slowdown from pathologiesNo Penalty for COHE
22% Speedup on DIVG
Per-Warp Stack
![Page 54: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/54.jpg)
Wilson Fung, Tor Aamodt Thread Block Compaction 54
Effect on Memory Traffic
TBC does still generate some extra uncoalesced memory access Memory
0x100
0x1400x180
#Acc = 4
1 2 7 8C 5 -- 11 12C
2nd Acc will hit the L1 cache
No Change to Overall
Memory TrafficIn/out of a core
Nor
mal
ized
Mem
ory
Sta
lls
0%
50%
100%
150%
200%
250%
300%
TBC DWF Baseline
0.6
0.8
1
1.2
BF
S2
FC
DT
HO
TS
P
LP
S
MU
M
MU
Mp
p
NA
MD
NV
RT
AE
S
BA
CK
P
CP
DG
HR
TW
L
LIB
LK
YT
MG
ST
NN
C
RA
Y
ST
MC
L
ST
O
WP
DIVG COHE
Me
mo
ry T
raff
icN
orm
ali
zed
to
Ba
se
lin
e
TBC-AGE TBC-RRB 2.67x
![Page 55: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/55.jpg)
(55)Wilson Fung, Tor Aamodt
Thread Block Compaction
Conclusion
• Thread Block Compaction Addressed some key challenges of DWF One significant step closer to reality
• Benefit from advancements on reconvergence stack Likely-Convergence Point Extensible: Integrate other stack-based proposals
![Page 56: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/56.jpg)
CAPRI: Prediction of Compaction-Adequacy for Handling Control-
Divergence in GPGPU ArchitecturesM. Rhu and M. Erez
ISCA 2012
![Page 57: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/57.jpg)
(57)
Goals
• Improve the performance of inter-warp compaction techniques
• Predict when branches diverge Borrow philosophy from branch prediction
• Use prediction to apply compaction only when it is beneficial
![Page 58: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/58.jpg)
(58)
Issues with Thread Block Compaction
B
C D
E
F
A
G
• TBC: warps broken down at potentially divergent points and threads compacted across the thread block
• Barrier synchronization overhead cannot always be hidden
• When it works, it works well
Implicit barrier across warps to collect compaction
candidates
![Page 59: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/59.jpg)
(59)
Divergence Behavior: A Closer Look
Loop executes a fixed number of times
Compaction ineffective branch
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 60: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/60.jpg)
(60)
Compaction-Resistant Branches
Control flow graph
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 61: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/61.jpg)
(61)
Compaction-Resistant Branches(2)
Control flow graph
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 62: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/62.jpg)
(62)
Impact of Ineffective Compaction
Threads shuffled around with no performance improvement
However, can lead to increased memory divergence!
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 63: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/63.jpg)
(63)
Basic Idea
• Only stall and compact when there is a high probability of compaction success
• Otherwise allow warps to bypass the (implicit) barrier
• Compaction adequacy predictor! Think branch prediction!
![Page 64: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/64.jpg)
(64)
Example: TBC vs. CAPRI
Bypassing enables increased overlap of memory references
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 65: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/65.jpg)
(65)
CAPRI: Example
No divergence, no stalling
Diverge, stall, initialize history, all others will now stall
• Diverge and history available, predict, update prediction
• All other warps follow (one prediction/branch
• CAPT updated
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 66: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/66.jpg)
(66)
The Predictor
• Prediction uses active masks of all warps Need to understand what
could have happened
• Actual compaction only uses actual stalled warps
• Minimum provides the maximum compaction ability, i.e., #compacted warps
• Update history predictor accordingly
Available threads in a lane
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 67: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/67.jpg)
(67)
Behavior
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 68: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/68.jpg)
(68)
Impact of Implicit Barriers
• Idle cycle count helps us understand the negative effects of implicit barriers in TBC
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012
![Page 69: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/69.jpg)
(69)
Summary
• The synchronization overhead of thread block compaction can introduce performance degradation
• Some branches more divergence than others
• Apply TBC judiciously predict when it is beneficial
• Effectively predict when the inter-warp compaction is effective.
![Page 70: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/70.jpg)
M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with
SIMD Lane Permutation,” ISCA 2013
![Page 71: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/71.jpg)
(71)
Goals
• Understand the limitations of compaction techniques and proximity to ideal compaction
• Provide mechanisms to overcome these limitations and approach ideal compaction rates
![Page 72: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/72.jpg)
(72)
Limitations of Compaction
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 73: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/73.jpg)
(73)
Mapping Threads to Lanes: Today
scala
rPip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
scala
rPip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
Grid 1
Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Block (1,1)
Thread(0,0,0)Thread
(0,1,3)Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
Warp 0 Warp 1
Modulo assignment to lanes
Linearization of thread IDs
![Page 74: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/74.jpg)
(74)
Data Dependent Branches
• Data dependent control flows less likely to produce lane conflicts
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 75: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/75.jpg)
(75)
Programmatic Branches
• Programmatic branches can be correlated to lane assignments (with modulo assignment)
• Program variables that operate like constants across threads can produced correlated branching behaviors
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 76: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/76.jpg)
(76)
P-Branches vs. D-Branches
P-branches are the problem!
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 77: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/77.jpg)
(77)
Compaction Opportunities
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
• Lane reassignment can improve compaction opportunities
![Page 78: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/78.jpg)
(78)
Aligned Divergence
• Threads mapped to a lane tend to evaluate (programmatic) predicates the same way Empirically, rarely exhibited for input, data dependent control
flow behavior
• Compaction cannot help in the presence of lane conflicts
• performance of compaction mechanisms depends on both divergence patterns and lane conflicts
• We need to understand impact of lane assignment
![Page 79: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/79.jpg)
(79)
Impact of Lane Reassignment
Goal: Improve “compactability”
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 80: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/80.jpg)
(80)
Random Permutations
• Does not always work well works well on average
• Better understanding of programs can lead to better permutations choices
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 81: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/81.jpg)
(81)
Mapping Threads to Lanes; New
Warp 0
Warp 1
Warp N-1
SIMD Width = 8
scala
rPip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
scala
rPip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
RF n-1
RF n-2
RF n-3
RF n-4
RF1
RF0
What criteria do we use for lane assignment?
![Page 82: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/82.jpg)
(82)
Balanced Permutation
Each lane has a single instance of a logical thread from each warp
Even warps: permutation within a half warp
Odd warps: swap upper and
lower
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 83: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/83.jpg)
(83)
Balanced Permutation (2)
Logical TID of 0 in each warp is now assigned a
different lane
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 84: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/84.jpg)
(84)
Characteristics
• Vertical Balance: Each lane only has logical TIDs of distinct threads in a warp
• Horizontal balance: Logical TID x in all of the warps is bound to different lanes
• This works when CTA have fewer than SIMD_Width warps: why?
• Note that random permutations achieve this only on average
![Page 85: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/85.jpg)
(85)
Impact on Memory Coalescing
• Modern GPUs do not require ordered requests
• Coalescing can occur across a set of requests speicific lane assignments do not affect coalescing behavior
• Increase is L1 miss rate offset by benefits of compaction
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 86: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/86.jpg)
(86)
Speedup of Compaction
• Can improve the compaction rate of divergence dues to the majority of programmatic branches
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 87: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/87.jpg)
(87)
Compaction Rate vs. Utilization
Distinguish between compaction rate and utilization!
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013
![Page 88: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/88.jpg)
(88)
Application of Balanced PermutationI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scala
rPip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
Issue
I-Buffer
pending warps
• Permutation is applied when the warp is launched
• Maintained for the life of the warp
• Does not affect the baseline compaction mechanism
• Enable/disable SLP to preserve target specific, programmer implemented optimizations
![Page 89: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/89.jpg)
(89)
Summary
• Structural hazards limit the performance improvements from inter-warp compaction
• Program behaviors produce correlated lane assignments today
• Remapping threads to lanes enables extension of compaction opportunities
![Page 90: (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697bfa61a28abf838c9882e/html5/thumbnails/90.jpg)
(90)
Summary: Inter-Warp Compaction
B
C D
E
F
A
G
Thread Block
Resource Management
Scope
μarchProgram
Properties
Thread Block
Co-Design of applications, resource management,
software, microarchitecture,