Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace...
Transcript of Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace...
![Page 1: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/1.jpg)
Warp-Aware TraceScheduling for GPUS
James Jablin (Brown)
Thomas Jablin (UIUC)
Onur Mutlu (CMU)
Maurice Herlihy (Brown)
![Page 2: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/2.jpg)
Historical Trends in GFLOPS:CPUs vs. GPUs
0
250
500
750
1000
1250
1500
1750
2000
2250
2500
2750
3000
3250
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Northwood WoodcrestPrescott HarpertownBloomfield
WestmereSandy Bridge
NVIDIA GPU Single-Precision FP
Intel CPU Single-Precision FP
2012
GeForce 5800
GeForce 6800 Ultra
GeForce 7800 GTX
GeForce 8800 GTX
GeForce 280 GTX
GeForce 480 GTX
GeForce 580 GTX
GeForce 680 GTX
Theore
tica
l G
FLO
P/s
reproduced from NVIDIA CUDA C Programming Guide (Version 5.0)
![Page 3: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/3.jpg)
Performance Pitfalls
Control flow cannegatively affect performance.
![Page 4: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/4.jpg)
Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency
Performance Pitfalls
![Page 5: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/5.jpg)
Hardware: CPU versus GPU
ControlALU ALU
ALUALU
Cache
DRAM DRAM
CPU GPUreproduced from NVIDIA CUDA C Programming Guide (Version 5.0)
![Page 6: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/6.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 7: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/7.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 8: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/8.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 9: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/9.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
pipeline stall (bubble)
![Page 10: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/10.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 11: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/11.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 12: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/12.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 13: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/13.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 14: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/14.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 15: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/15.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 9 10
Without Branch Prediction
![Page 16: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/16.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 17: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/17.jpg)
6Clock Cycle
0 1 2 3 4 87
CompletedInstructions
5
Decode
Execute
Fetch
Write
Pipeline Stages
WaitingInstructions
With Branch Prediction
6Clock Cycle
0 1 2 3 4 875 109
Without Branch Prediction
![Page 18: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/18.jpg)
Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency
Performance Pitfalls
![Page 19: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/19.jpg)
Performance Pitfalls
Warp Divergence - threads within awarp take different paths and thedifferent execution paths are serialized
Pipeline Stall - execution delay in aninstruction pipeline to resolve adependency
![Page 20: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/20.jpg)
Warp Divergence Example
B
C
A
D
![Page 21: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/21.jpg)
Warp Divergence Example
B
C
A
D
A
![Page 22: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/22.jpg)
Warp Divergence Example
Warp Divergence!
B
C
A
D
A
![Page 23: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/23.jpg)
Warp Divergence Example
Warp Divergence!
B
C
A
D
A
B
![Page 24: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/24.jpg)
Warp Divergence Example
Warp Divergence!
B
C
A
D
Warp Divergence!
A
B
![Page 25: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/25.jpg)
Warp Divergence Example
Warp Divergence!
B
C
A
D
Warp Divergence!
A
B
![Page 26: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/26.jpg)
Warp Divergence Example
Warp Divergence!
B
C
A
D
Warp Divergence!
Warp Reconverges!
A
B
![Page 27: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/27.jpg)
Warp Divergence Example
Warp Divergence!
B
C
A
D
Warp Divergence!
Warp Reconverges!
A
B
D
![Page 28: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/28.jpg)
Warp-Aware Trace Scheduling
Schedule instructions across basic blockboundaries to expose additional ILP...
![Page 29: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/29.jpg)
Warp-Aware Trace Scheduling
Schedule instructions across basic blockboundaries to expose additional ILP...
while managing andoptimizing warp divergence.
![Page 30: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/30.jpg)
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
Step Description
![Page 31: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/31.jpg)
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
1. Trace Selection
Step Description
![Page 32: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/32.jpg)
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
2. Trace Formation
1. Trace Selection
Step Description
![Page 33: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/33.jpg)
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
3. Local Scheduling
2. Trace Formation
1. Trace Selection
Step Description
![Page 34: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/34.jpg)
Origins: Microcode Trace Scheduling
...generalizing local and disparate vertical-to-horizontal microcode compaction
3. Local Scheduling
schedule instructionswithin each region
2. Trace Formation facilitate local scheduling,potentially adding nodesand edges
1. Trace Selection partition basic blocksinto regions
Step Description
![Page 35: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/35.jpg)
J
L
K
A
B
C
G
H
I
D
F
E
![Page 36: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/36.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
100
100
![Page 37: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/37.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
![Page 38: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/38.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
![Page 39: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/39.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
![Page 40: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/40.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
![Page 41: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/41.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
Trace # 1
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
![Page 42: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/42.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
Trace # 1
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
![Page 43: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/43.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8
Trace # 1
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
![Page 44: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/44.jpg)
J
L
K
50
100
A
B
C
G
H
I
90
10
99
D
F
E
87
100
13
1
92
100
8Trace # 2
Trace # 1
Trace # 3
595
50
Annotate CFG - dynamic profiling - static branch prediction
Find the nextunvisited node,with highest edgeweight
Add node to trace
loop
end loop
while there are unvisited nodes
end while
100
100
![Page 45: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/45.jpg)
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
![Page 46: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/46.jpg)
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
![Page 47: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/47.jpg)
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
![Page 48: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/48.jpg)
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
![Page 49: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/49.jpg)
Before After
BB3BB3:mul.wide.s32 %rd15, %r3, 4;add.s64 %rd12, %rd1, %rd15;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2]; ...
454647484950515253...
...mul.wide.s32 %rd13, %r4, 4;add.s54 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;@%p2 bra BB2;
10111213
2122
......
10
100
BB0
BB1394041424344
100
BB2BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
90
BB3
...mul.wide.s32 %rd13, %r4, 4;add.s64 %rd14, %rd2, %rd13;cvt.s64.s32 %rd7, %r6;add.s64, %rd8, %rd6, %rd7;bra.uni BB3;
3435363738
...
10
100
BB1
100
90
BB2:ld.shared.f32 %f5, [%rd3];ld.shared.f32 %f6, [%rd3];mul.f32 %f7, %f5, %f6;st.shared.f32 [%r2], %f7;bra.uni BB3;
394041424344
BB2...
BB3:mul.wide.s32 %rd15, %r3, %rd8;add.s64 %rd12, %rd1, %rd15; ...
454647...
... ...mov.u32 %r11, %ctaid.y;add.s32 %r12, %r8, 1;mov.u32 %r1, %tid.y;mov.u32 %r2, %tid.x; ...setp.ne.s32 %p2, %r2, 0;shl.b64 %rd16, %rd4, 6;mov.u64 %rd17, __param_0;add.s64 %rd18, %rd17, %rd16;mul.wide.s32 %rd19, %r2, 4;add.s64 %rd2, %rd18, %rd19;ld.global.f32 %f4, [%rd2];@%p2 bra BB2;
BB0
10111213
2148495051525322
![Page 50: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/50.jpg)
Profiling
0.95×
1.00×
1.05×
1.10×
1.15×
1.20×
1.25×
1.30×
1.35×
1.40×
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
KernelSpeedup
Instructions
ExecutedperCycle(IPC)
ComparingSpeedup and IPC UsingDynamic
backprop
bfscfd heartwall
hotspot
kmeans
lavaMD
leukocyte
lud mummergpu
nn nw particlefilter(f)
particlefilter(n)
pathfinder
sradstreamcluster
GEOMEAN
HARMEAN
bpnnlayerforw
ardCUDA
Kernel
Kernel2
cudacom
puteflux
kernel
calculatetem
p
invertmapping
kmeansP
oint
kernelgpu
cuda
dilatekernel
IMGVFkernel
luddiagonal
mum
mergpuK
ernel
printKernel
euclidneedle
cudashared1
needlecuda
shared2
findindex
kernellikelihood
kernel
kerneldynproc
kernel
sradcuda
1
sradcuda
2
kernelcom
putecost
Speedup
IPC
![Page 51: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/51.jpg)
Backup Slides
![Page 52: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/52.jpg)
store instructions
shadow store buffer,
Restricted [6] General [6] Boosting [36] Deviant (GP U)
excludes texture, shared
Scheduling Restrictions Legal and Safe Legal noneand constant memory
operations and all
shadow register file,
HardwareSupport nonenon-trapping
noneinstructions and support for
re-executing instructionsException Handling for
prohibited ignored supported absentSpeculative Instructions
![Page 53: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/53.jpg)
GPU Programming Model
CPU GPU
Tim
e
Host Code
Host Code Device CodeCPU
CPU
GPUCyclic
Communication
![Page 54: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/54.jpg)
GPU Programming Model
CPU GPU
Tim
e
Host Code
Host Code Device Code
GridBlock (0,0) Block (1,0)
Block (0,1) Block (1,1)CPU
CPU
GPUCyclic
Communication
![Page 55: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/55.jpg)
GPU Programming Model
CPU GPU
Tim
e
Host Code
Device CodeHost Code
GridBlock (0,0) Block (1,0)
Block (0,1) Block (1,1)
Block (0,1)Thread (0,0) Thread (1,0) Thread (2,0)
Thread (0,1) Thread (1,1) Thread (2,1)
CPU
CPU
GPUCyclic
Communication
![Page 56: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/56.jpg)
Characterizing the Grid...
Grid
gridDim.x
gri
dD
im.y
![Page 57: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/57.jpg)
Characterizing the Grid, Blocks...
Grid
gridDim.x
gri
dD
im.y Block (0,1)
blockDim.x
blo
ckD
im.y
![Page 58: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/58.jpg)
Characterizing the Grid, Blocks...
Grid
gridDim.x
gri
dD
im.y Block (blockIdx.x,blockIdx.y)
blockDim.x
blo
ckD
im.y
![Page 59: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/59.jpg)
Characterizing the Grid, Blocks, andThreads
Grid
gridDim.x
gri
dD
im.y Block (blockIdx.x,blockIdx.y)
Thread (0,1)
blockDim.x
blo
ckD
im.y
![Page 60: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/60.jpg)
Characterizing the Grid, Blocks, andThreads
Grid
gridDim.x
gri
dD
im.y Block (blockIdx.x,blockIdx.y)
Thread (threadIdx.x,threadIdx.y)
blockDim.x
blo
ckD
im.y
![Page 61: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/61.jpg)
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x < 32) { }
![Page 62: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/62.jpg)
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x > 15) { }
if (threadIdx.x < 32) { } NO
![Page 63: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/63.jpg)
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x > 15) { }
if (threadIdx.x < 32) { } NO
if (threadIdx.x > 65) { }
YES
![Page 64: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/64.jpg)
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x > 15) { }
if (threadIdx.x < 32) { } NO
if (threadIdx.x > 65) { }
if (BlockIdx.x > 1) { }
YES
YES
![Page 65: Warp-Aware Trace Scheduling for GPUSomutlu/pub/warp-aware...Origins: Microcode Trace Scheduling...generalizing local and disparate vertical-to-horizontal microcode compaction 3. Local](https://reader033.fdocuments.in/reader033/viewer/2022051913/6003f9a96c76fa7928555bf5/html5/thumbnails/65.jpg)
Warp Divergence Examples
Assuming one block of 128 threads...
Divergence?Example
if (threadIdx.x > 15) { }
if (threadIdx.x < 32) { } NO
if (threadIdx.x > 65) { }
if (blockIdx.x > 1) { }
YES
YES
NO