Introduction to Control Divergence · 2020. 7. 31. · 8 (15) More Complex Example From Fung, et....
Transcript of Introduction to Control Divergence · 2020. 7. 31. · 8 (15) More Complex Example From Fung, et....
-
1
(1)
Introduction to Control Divergence
Lectures Slides and Figures contributed from sources as noted
(2)
Objectives
• Understand the occurrence of control divergence and the concept of thread reconvergence v Also described as branch divergence and thread divergence
• Cover a basic thread reconvergence mechanism –stack-based reconvergence
• Cover a non-stack based reconvergence mechanism – convergence barriers
• Dynamic Warp formation – inter-warp compactionv Increase lane utilization by merging warps
• Dynamic Warp Formation – intra-warp compactionv Increase lane utilization by compacting threads in a warp
-
2
(3)
Reading
• W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009,
• T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text, Section 3.1.1, 3.1.2
• G. Diamos, R. Johnson, V. Grover, O. Giroux, J. Choquette, M. Fetterman, A. Tirumala, P. Nelson, R. Krashinsky, “Execution of Divergent Threads Using a Convergence Barrier,” U.S. Patent, US 2016/0019066A1, January 2016
• A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, M. Azimi, “SIMD Divergence Optimization Through Intra-Warp Compaction,” ISCA 2013
(4)
Handling Branches
• CUDA Code:if(…) … (True for some threads)
else … (True for others)
• What if threads takes different branches?
• Branch Divergence!
TT T T
taken not taken
-
3
(5)
Branch Divergence
• Occurs within a warp• Branches lead serialization of branch dependent code
v Performance issue: low warp utilization
if(…)
{… }
else {…}
Idle threads
Reconvergence!
Branch Divergence
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
Thread Warp Common PCThread
2Thread
3Thread
4Thread
1
B
C D
E
F
A
G
• Different threads follow different control flow paths through the kernel code
• Thread execution is (partially) serializedv Subset of threads that follow the same path execute in
parallel Example:
-
4
(7)
Basic Idea
• Split: partition a warpv Two mutually exclusive thread subsets, each branching to a
different targetv Identify subsets with two activity masks àeffectively two warps
• Join: merge two subsets of a previously split warpv Reconverge the mutually exclusive sets of threads
• Orchestrate the correct execution for nested branches
• Note the long history of technques in SIMD processors (see background in Fung et. al.)
Thread Warp Common PC
T2 T3 T4T1
0 0 11 activity mask
(8)
Thread Reconvergence• Fundamental problem:
v Merge threads with the same PCv How do we sequence execution of threads? Since this can effect
the ability to reconverge
• Question: When can threads productively reconverge?
• Question: When is the best time to reconverge?
B
C D
E
F
A
G
Must be at same PC
-
5
(9)
Dominator
• Node d dominates node n if every path from the entry node to n must go through d
B
C D
E
F
A
G
(10)
Immediate Dominator
• Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n
B
C D
E
F
A
G
-
6
(11)
Post Dominator
• Node d post dominates node n if every path from the node n to the exit node must go through d
B
C D
E
F
A
G
(12)
Immediate Post Dominator
• Node d immediate post dominates node n if every path from node n to the exit node must go through dand no other nodes post dominate n between d and n
B
C D
E
F
A
G
-
7
Baseline: PDOM
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
- G 1111TOS
B
C D
E
F
A
G
Thread Warp Common PC
Thread2
Thread3
Thread4
Thread1
B/1111
C/1001 D/0110
E/1111
A/1111
G/1111
- A 1111TOSE D 0110E C 1001TOS
- E 1111E D 0110TOS- E 1111
A D G A
Time
CB E
- B 1111TOS - E 1111TOSReconv. PC Next PC Active Mask
Stack
E D 0110E E 1001TOS
- E 1111
(14)
The Stack-Based Structure
• A stack entry is a specification of a group of active threads that will execute that basic block
• The natural nested structure of control exposes the use of stack-based serialization
B
C D
E
F
A
G
B/1111
C/1001 D/0110
E/1111
A/1111
G/1111
- G 1111- A 1111E D 0110E C 1001
- E 1111E D 0110- E 1111- B 1111- E 1111
Reconv. PC Next PC Active MaskStack
E D 0110E E 1001
- E 1111
Example:
-
8
(15)
More Complex Example
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009
• Stack based implementation for nested control flowv Stack entry RPC set to IPDOM
• Re-convergence at the immediate post-dominator of the branch
(16)
Implementation I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores
- G 1111TOS - A 1111TOSE D 0110E C 1001TOS
- E 1111E D 0110TOS- E 1111- B 1111- E 1111
Reconv. PC Next PC Active MaskStack
E D 0110E E 1001TOS
- E 1111
• GPGPUSim model: v Implement per warp stack at issue stagev Acquire the active mask and PC from the
TOSv Scoreboard check prior to issuev Register writeback updates scoreboard and
ready bit in instruction bufferv When RPC = Next PC, pop the stack
• Implications for instruction fetch?
-
9
(17)
Implementation (2)
• warpPC (next instruction) compared to reconvergence PC
• On a branchv Can store the reconvergence PC as part of the branch
instructionv Branch unit has NextPC, TargetPC and reconvergence PC to
update the stack
• On reaching a reconvergence pointv Pop the stackv Continue fetching from the NextPC of the next entry on the
stack
(18)
Stack Based Reconvergence: Summary
• Compatible with the programming model
• Interactions with the schedulerv Uniform progress à efficiency issue
• Conservative v Can we do better, i.e., earlier reconvergence?
• Issues?v Porting MIMD modelsv Interactions between scheduler (forward progress) and
programming model (bulk synchronous execution)
-
10
(19)
Revisiting SIMT vs. MIMD (1)
• Properties of current MIMD implementationsv Independent progress of a threadv Progress expectations of the scheduler à fairnessv Producer-consumer dependencies between threads
• In general, scheduling orders alone cannot guarantee forward progress
Loosely synchronized threads
e.g., pthreads
RFRF RF RF
SIMT
e.g., PTX, HSA
(20)
Revisiting SIMT vs. MIMD (2)
• Properties of current SIMT implementationsv Serialization of divergent thread execution
v Re-convergence at the earliest point (e.g., @IPDOM)
v Fairness: progress expectations of the scheduler
• Independent thread progress assumption does not applyv Blocking at the reconvergence point
Loosely synchronized threads
e.g., pthreads
RFRF RF RF
SIMT
e.g., PTX, HSA
-
11
(21)
SIMT Deadlock
• Successful thread blocked at the reconvergence point• Waiting threads blocked on lock release à deadlock• Need progress on divergent threads to continue execution
v Fairness and progress demands on the warp scheduler• Violation of (expected) programming model behaviors?
Figure from T. Aamodt, W. L. ,Fung. R. Rogers, “General Purpose Graphics Architectures”, Synthesis Lectures on Computer Architecture, Draft
(22)
Goals
• Be able to use MIMD semantics à use existing algorithmic idioms v Maintain correctness
• Porting of existing multithreaded applications to GPUs
• Avoid assumptions about compiler optimizations or schedulers
-
12
(23)
Convergence Barriers: Basic Idea
Barrier 0
Barrier 1
Barrier n-1
Barrier Hardware Structures
• Allocate a named barrier• Add threads to the
barrier
Warp splits reconverge at the
named barrier
Split the warp à two
warps
(24)
Convergence Barriers: Barrier Behavior
• Richer barrier semantics
• Thread states
• Complex conditions for clearing a barrier
• Target for compiler analysis
• Target for warp schedulers
• Allocate a named barrier• Add threads to the
barrier
Warp splits reconverge at the
named barrier
-
13
(25)
Tracking Data Structures
Barrier Participation Mask
Barrier State
Thread State: Ready, Yield, Blocked, Exited,…
Thread Active (only READY threads)
PC
(26)
Convergence Barrier: Example -1 (1)
• ADD Instruction inserted by compiler to update barrier participation mask
• Named barrier, B0, is allocated
Warp sizebarrier participation mask
barrier state
PCthread active
thread state
Figure from patent US 2016/0019066A1
-
14
(27)
Convergence Barrier: Example - 1 (2)
warp splitwarp 0
warp 1
Scheduler can schedule either warp independently
active thread
inactive thread
Figure from patent US 2016/0019066A1
(28)
Convergence Barrier: Example – 1 (3)
warp 0 warp 1
• Warp 0 arrives at barrier B0
• Executes WAIT instruction that changes thread state of active threads to blocked
• Warp 0 waits
Warp sizebarrier participation mask
barrier state
PC
thread active
thread state
Figure from patent US 2016/0019066A1
-
15
(29)
Convergence Barrier: Example – 1 (4)
• Warp 1 arrives at B0 and executes WAIT instruction
• Check participation mask • Update barrier structure• Release the barrier
warp 0 warp 1
Warp sizebarrier participation mask
barrier state
PCthread active
thread state
Figure from patent US 2016/0019066A1
(30)
Nested Control Flow (1)
Add threads to barrier B0
Figure from patent US 2016/0019066A1
-
16
(31)
Nested Contol Flow (2)
Warp split
Figure from patent US 2016/0019066A1
(32)
Nested Control Flow (3)
Add 4 active threads to barrier B1
Figure from patent US 2016/0019066A1
-
17
(33)
Nested Control Flow (4)
Warp split
Figure from patent US 2016/0019066A1
(34)
Nested Control Flow (5)
Reconverge via wait
Figure from patent US 2016/0019066A1
-
18
(35)
Nested Control Flow (6)
Wait
Figure from patent US 2016/0019066A1
(36)
Nested Control Flow (7)
All threads arrive
Figure from patent US 2016/0019066A1
-
19
(37)
Nested Control Flow (8)
Reconverge
Figure from patent US 2016/0019066A1
(38)
Iteration (1)
Warp split
• Compiler inserted YIELD instructions
• Threads are suspended and do not participate in the barrier at B1
Figure from patent US 2016/0019066A1
Warp sizebarrier participation mask
barrier state
PCthread active
thread state
-
20
(39)
Iteration (2)
Warp split
• Compiler inserted YIELD instructions
• Threads are suspended and do not participate in the barrier at B1
Figure from patent US 2016/0019066A1
Warp sizebarrier participation mask
barrier state
PCthread active
thread state
(40)
Iteration (3)
• Barrier clears with active threads in warp 1
• Ignore threads in yield state
• These threads may continue without the yielded instructions (missed reconvergence op)
• Threads that have executed YIELD, placed in yield state
• Execution suspended
• Scheduler/compiler optimizations to minimize missed reconvergence opportunities
Warp sizebarrier participation mask
barrier state
PCthread active
thread state
Figure from patent US 2016/0019066A1
warp 1
warp 0
-
21
(41)
Iteration (4)
• Compiler responsibility to ensure divergent paths do not execute indefinitely
• Periodically yield to the scheduler• Compiler-scheduler cooperation • Heuristics for insertion of YIELD instructions
vs. conditions, e.g., timeouts
Figure from patent US 2016/0019066A1
(42)
Early Reconvergence
• Early exit• Threads taking this
path opt-out of the barrier à Opt-Out Instructions
Early reconvergence for remaining threads
Thread state = “EXITED”
Figure from patent US 2016/0019066A1
-
22
(43)
Behavior at the Barrier
All threads here?
Remove Opt-out threads
Ignore yielded threads
scheduler
Clear barrier
Thread state is “EXITED
Remaining threads here?
No
Clear YIELD statesClear barrierYes
Yes
No
(44)
Non-Stack Based Reconvergence
• Close relationship between scheduler and compilerv To guarantee progress
• Mechanisms for decoupling execution constraints (e.g., blocking/waiting @reconvergence) from dependenciesv Rely on compiler or hardware (e.g. timers) to decouple
depenedencies
v Performance cost via missed reconvergence opportunities
• Side effect is more scheduling flexibility
-
23
(45)
Can We Do Better?
• Warps are formed statically
• Key idea of dynamic warp formationv Find a pool of warps àhow they can be merged?
• At a high level what are the requirements?
B
C D
E
F
A
G
(46)
Compaction Techniques
• Can we reform warps so as to increase utilization?
• Basic idea: Compactionv Reform warps with threads that follow the same control flow
pathv Increase utilization of warps
• Two basic types of compaction techniques
• Inter-warp compactionv Group threads from different warpsv Group threads within a warp
o Changing the effective warp size
-
24
(47)
Inter-Warp Thread Compaction: Dynamic Warp Formation
Lectures Slides and Figures contributed from sources as noted
(48)
Goal
Warp 0
Warp 1
Warp 2
Warp 3
Warp 4
Warp 5
Taken Not Taken
if(…)
{… }
else {…}
Merge threads?
-
25
(49)
Reading
• W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009, Section 4.
(50)
Formation
• How do we get from statically formed warps from kernel launch (recall grid à warp mapping) to dynamic warps drawing threads from multiple warps?
• Constraints imposed by the register file organization
-
26
DWF: Example
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
A A B B G G A AC C D D E E F F
Time
A A B B G G A AC D E E F
Time
A x/1111y/1111
B x/1110y/0011
C x/1000y/0010 Dx/0110y/0001 F
x/0001y/1100
E x/1110y/0011
G x/1111y/1111
A new warp created from scalar threads of both Warp x and y executing at Basic Block D
D
Execution of Warp xat Basic Block A
Execution of Warp yat Basic Block A
LegendAA
Baseline
DynamicWarpFormation
(52)
How Does This Work?
• Criteria for mergingv Same PC
v Complements of active threads in each warp
v Recall: many warps/TB all executing the same code
• What information do we need to merge two warpsv Need thread IDs and PCs
• Ideally how would you find/merge warps?
-
27
DWF: Microarchitecture Implementation
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
I-Cache
Decode
Com
mit/
Writeback
RF 2
RF 1
ALU 2
ALU 1 (TID, Reg#)
(TID, Reg#)
RF 3ALU 3 (TID, Reg#)
RF 4ALU 4 (TID, Reg#)
Thread Scheduler PC-Warp LUT Warp Pool IssueLog ic
Warp Allocator
TID x N PC A
TID x N PC B
H
H
TID x NPC PrioTID x NPC Prio
OCCPC IDXOCCPC IDX
Warp Update Register T
Warp Update Register NTREQ
REQTID x N
PC Prio
Assists in aggregating threads
Identify available lanes
Identify occupied lanes
Point to warp being formed
• Warps formed dynamically in the warp pool• After commit check PC-Warp LUT and merge or allocated newly
forming warp
Lane aware
DWF: Microarchitecture Implementation
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
I-Cache
Decode
Com
mit/
Writeback
RF 2
RF 1
ALU 2
ALU 1 (TID, Reg#)
(TID, Reg#)
RF 3ALU 3 (TID, Reg#)
RF 4ALU 4 (TID, Reg#)
Thread Scheduler PC-Warp LUT Warp Pool IssueLog ic
Warp Allocator
TID x N PC A
TID x N PC B
H
H
TID x NPC PrioTID x NPC Prio
OCCPC IDXOCCPC IDX
Warp Update Register T
Warp Update Register NTREQ
REQTID x N
PC PrioA 5 6 7 8A 1 2 3 4
5 7 8
6
B
C
1011
0100
B 2 30110B 0 B 5 2 3 8
B
0010B 2
71
3
4
2 B
C
0110
1001
C 11001C 1 4C 61101C 1
No Lane Conflict
A: BEQ R2, BC: …
X1234
Y5678
X1234
X1234
X1234
X1234
Y5678
Y5678
Y5678
Y5678
Z5238
Z5238
Z5238
-
28
(55)
Resource Usage
• Ideally would like a small number of unique PCs in
progress at a time à minimize overhead
• Warp divergence will increase the number of unique
PCs
v Mitigate via warp scheduling
• Scheduling policies
v FIFO
v Program counter – address variation measure of divergence
v Majority/Minority- most common vs. helping stragglers
v Post dominator (catch up)
(56)
Hardware Consequences (1)
• Expose the implications that warps have in the base designv Implications for register file access à lane aware DWF
• Register bank conflicts
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009
-
29
(57)
Hardware Consequences (2)
Dec
oder
Dec
oder
Dec
oder
Scalar
pipeline
Xbar
Scalar
pipeline
Scalarpipeline
Warp IDTID
Lane AwareIgnoring bank and lane assignments
writeback
(58)
Relaxing Implications of Warps• Thread swizzling
v Essentially remap work to threads so as to create more opportunities for DWF à requires deep understanding of algorithm behavior and data sets
• Lane swizzling in hardwarev Provide limited connectivity between register banks and
lanes à avoiding full crossbars
Warp IDTID
-
30
(59)
Intra-Warp Thread Compaction: Cycle Compression
Lectures Slides and Figures contributed from sources as noted
(60)
Goals
• Improve utilization in divergent code via intra-warp compaction
• Become familiar with the architecture of Intel’s Gen integrated general purpose GPU architecture
-
31
(61)
Integrated GPUs: Intel HD Graphics
Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
(62)
Integrated GPUs: Intel HD Graphics
Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
Gen graphics processor • 32-byte bidirectional ring• Dedicated coherence
signals
• Coherent distributed cache• Shared with GPU
• Operates as a memory side cache
• Shared physical memory
-
32
(63)
Inside the Gen9 EU
Thread 0
Architecture Register File (ARF)
Per thread register state
• Up to 7 threads• 128, 256-bit registers/thread (8-way SIMD)• Each thread executes a kernel
v Threads may execute different kernels
v Multi-instruction dispatch
Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
(64)
Operation (1)
• Dispatch 4 instructions from 4 threads
• Constraints between issue slots
• Support both FP and Integer operations
• 4, 32-bit FP operations• 8, 16-bit integer operations• 8, 16-bit FP operations• MAD operations each cycle
• 96 bytes/cycle read BW• 32 bytes/cycle write BW
• Divergence/reconvergence management
SIMD-16, SIMD-8, SIMD-32 instructions
SIMD-4 instructions
transform
Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
-
33
(65)
Operation (2)
SIMD-16, SIMD-8, SIMD-32 instructions
SIMD-4 instructions
transform
Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
SIMD 8
SIMD 16
SIMD 4
Intermix SIMD instructions of various lengths
Thread
(66)
Mapping the BSP Model
Grid 1
Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Block (1,1)
Thread(0,0,0)Thread
(0,1,3)Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
• Map multiple threads to SIMD instance executed by a EU Thread
• All threads in a TB or workgroup mapped to same thread (shared memory access)
Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
SIMD-16, SIMD-8, SIMD-32 instructions
SIMD-4 instructions
transform
-
34
(67)
Slice Organization
From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
Flexible partitioning• SM• Data cache• Buffers for accelerators
Shared memory• 64Kbyte/slice• Not coherent with other
structures
(68)
Coherent Memory Hierarchy
From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides
• Not coherent
-
35
(69)
Baseline Microarchitecture OperationI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending
(70)
Microarchitecture OperationI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending
Per-thread pre-fetch
Per-thread scoreboard check
Thread arbitration, dual issue/2-cycles, compaction cntrl
Operand fetch/swizzle• Encode swizzle in RF access
Instruction execution happens in waves of 4-wide operations • Note: variable width SIMD instructions
Per-thread decode : variable width SIMD + EX masks
cycle 0
cycle 2
cycle 1
cycle 3
16-way SIMD
Executed in 4 cycles
-
36
(71)
Divergence Assessment
Coherent applications
Divergent Applications
SIMD Efficiency
Average SIMD width
Average Lane Utilization
=
(72)
Basic Cycle Compression (1)
• Actual operation depends on data types, execution cycles/op• Note power/energy savings
• Compress cycles by suppressing dispatch
• Operand fetch • Instruction execution
• Complex MRF access patterns
-
37
(73)
Basic Cycle Compression (2)
Cycle compression applies to- Dispatch (complex MRF accesses
can take multiple cycles- Predication- Other cases with idle cycles
Instruction predicate mask
Dispatch maskDivergence state
Execution Mask
(74)
Basic Cycle Compression: MRF AccessBaseline Register File
Organization for BCC
-
38
(75)
Swizzle Cycle Compression
• Use more flexible thread àlane assignment• Operand data flow?
• Compact threads to create idle cycles in the pipeline
(76)
Register File Access for BCC
R8.H0R8.H1
32-bits
-
39
(77)
32-bit Operand Mapping for SIMD16
ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R13.H1 R9.H1 R11.H1
ADD (16) R12 R8 R10 [EXEC MASK 0XFFFF]
R9R8
• Implicitly use pairs of registers for all 32b operands • R8, R9à 16 x 32 = 512 bits
ADD (16) R12 R8 R10
EXEC MASK 0xFFFF
R8.H0R8.H1
256 bits
Quad instructions
Translate
(78)
Operand Flow (1)
ADD.Q1 (4) R12.H1 R8.H1 R10.H1
R9R8
R11R10
R13R12
R10 R10
128 bits, 4-Lane ALU 128 bits, 4-Pane ALU
128-bit buses
-
40
(79)
Operand Flow (2)
R9R8
R11R10
R13R12
R10 R10ADD.Q1 (4) R12.H1 R8.H1 R10.H1
128 bits, 4-Lane ALU 128 bits, 4-Pane ALU
128-bit buses
(80)
Operand Flow (3)
R9R8
R11R10
R13R12
R10 R10
ADD.Q1 (4) R12.H1 R8.H1 R10.H1
128 bits, 4-Lane ALU 128 bits, 4-Pane ALU
128-bit buses
ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R13.H1 R9.H1 R11.H1
Quad instructions
-
41
(81)
Compressing Idle Cycles (1)
ADD (16) R12 R8 R10 [EXEC MASK 0XF0F0]
R9R8
ADD.Q0 (4) R12.H0 R8.H0 R10.H0
R11R10
R13R12
active operands in this instruction
ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R12.H1 R8.H1 R10.H1
H0H1
(82)
Compressing Idle Cycles(2)
ADD (16) R12 R8 R10 [EXEC MASK 0XF0F0]
R9R8
ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R12.H1 R8.H1 R10.H1
R11R10
R13R12
active operands in this instruction
Suppressed Compressed
-
42
(83)
Swizzle Cycle Compression
• Use more flexible thread àlane assignment• Operand data flow?
• Compact threads to create idle cycles in the pipeline
(84)
Distributing Data to Lanes (1)
32-bit operand 32-bit operand 32-bit operand 32-bit operand
• Restrict swizzle patternsv Does not support all possible compression patterns
• Need fast computation of efficient swizzle settings
512 bitquad
Lane Inputs
-
43
(85)
Distributing Data to Lanes (2)
512 bitquad
4 –Lane ALU
• Each lane connected to 32bits of the 128 bit bus
• Each element of the Quad is connected to 32 bits of a 128-bit segment
(86)
SCC Operation
Can compact across Quads Swizzle settings
overlapped with RF access
Pack lane inputs into a quad
128b
4 lanes
cycle icycle i+1cycle i+2cycle i+3
RF for a Single Operand
• Note increased area, power/energy• Note the need for inverse permutations
-
44
(87)
Compaction Opportunities
SIMD 16
SIMD 8
Idle lanes
Idle lanes
No further compaction possible
No further compaction possible
For K active threads what is the maximum cycle savings for SIMD N instructions ?
Thread
(88)
Performance Savings
• Difference between saving cycles and saving timev When is #cycles ≠ time?
-
45
(89)
Summary
• Multi-cycle warp/SIMD/work_group execution
• Optimize #cycles/warp by compressing idle cyclesv Rearrange idle cycles via swizzling to create opportunity
• Sensitivities to the memory interface speedsv Memory bound applications may experience limited benefit
(90)
Intra-Warp Compaction
B
C D
E
F
A
G
Thread Block Thread Block
• Scope limited to within a warp
• Increasing scope means increasing warp size, explicitly, or implicitly (treating multiple warps as a single warp
-
46
(91)
Summary
• Control flow divergence is a fundamental performance limiter for SIMT execution
• Dynamic warp formation is one way to mitigate these effectsv We will look at several others
• Must balance a complex set of effectsv Memory behaviorsv Synchronization behaviorsv Scheduler behaviors