Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Ankit Sethia* D. Anoushe Scott Mahlke...
-
Upload
andra-daniels -
Category
Documents
-
view
218 -
download
0
Transcript of Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Ankit Sethia* D. Anoushe Scott Mahlke...
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Ankit Sethia* D. Anoushe Scott Mahlke Jamshidi
University of Michigan
2
Graphics
Simulation
Linear Algebra
Data Analytics
Machine Learning
Computer Vision
All kinds of applications, compute and memory intensive
are targeting GPUs
GPU usage is expanding
3
PF
kmea
ns-2
sgem
m
leuk
o-2tp
acf
cutc
p
histo-
2BP-
1
sten
cil
mrig
-3
histog
ramsa
d-1
mri-
q
lava
MD
leuk
o-3
leuk
o-1BP-
2
srad
-1
srad
-2
mrig
-1
histo-
3
spm
v
mrig
-2
histo-
1lbmsa
d-2
mum
mer
parti
cle bfs
kmea
ns-1
0%
20%
40%
60%
80%
100%
% o
f p
eak I
PC
Compute Intensive
Memory Intensive
Memory intensive kernels saturate bandwidth and get lower
performance
Performance variation of kernels
4
L1
FPUs
LSU
Memory System
. . . .
L1
FPUs
LSUL1
FPUs
LSU
• Memory intensive kernels serialize memory requests• Critical to prioritize order of requests from SMs
Impact of memory saturation - I
5
Impact of memory saturation
PF
kmea
ns-2
sgem
m
leuk
o-2tp
acf
cutc
p
histo-
2BP-
1
sten
cil
mrig
-3
histog
ramsa
d-1
mri-
q
lava
MD
leuk
o-3
leuk
o-1BP-
2
srad
-1
srad
-2
mrig
-1
histo-
3
spm
v
mrig
-2
histo-
1lbmsa
d-2
mum
mer
parti
cle bfs
kmea
ns-1
0%
20%
40%
60%
80%
100%fraction of peak IPC fraction of cycles LSU stalled
Compute Intensive
Memory Intensive
Significant stalls in LSU correspond to low performance in memory
intensive kernels
6
L1
FPUs
LSU
Memory System
. . . .
L1
FPUs
LSUL1
FPUs
LSU
• Data present in the cache, but LSU can’t access• Unable to feed enough data for processing
Impact of memory saturation - II
W1
W0
L1 LSU
FPUs
W1 Data
Cache-blocks
7
Increasing memory resources
0
0.250.5
0.751
1.251.5
1.752
Large MSHRs + Queues Full Associativity Freq +20% All
Memory Intensive Kernels
Sp
eed
up
Large # of MSHRs + Full Associativity + 20% bandwidth boost
UNBUILDABLE
8
During memory saturation:• Serialization of memory requests cause less
overlap between memory accesses and compute: Memory aware scheduling (Mas)
• Data present in the cache cannot be reused as data cache cannot accept any request:
Cache access re-execution (Car)
Mas + car
Mascar
9
Memory Aware SchedulingMemory
Saturation
Warp 0 Warp 1 Warp 2
Memory requests
Memory requests
Memory requests
Serving one request and switching to another warp (RR)• No warp is ready to make forward progress
10
Memory Aware SchedulingMemory
Saturation
Warp 0 Warp 1 Warp 2
Memory requests
Memory requests
Memory requests
Serving one request and switching to another warp (RR)• No warp is ready to make forward progressServing all requests from one warp and then switching to another• One warp is ready to begin computation early (MAS)
GTO issues instructions from another warp whenever:• No instruction in the i-buffer for that warp• Dependency between instructionsSimilar to RR as multiple warps may issue memory requests
Schedule in Memory Priority mode
MAS operation
11
Check if memory intensive(MSHRs or miss queue almost full)
Y
NSchedule in Equal Priority mode
Assign new owner warp(only owner’s request can go beyond
L1)
Execute memory inst. only from owner, other warps can execute compute inst.
YNIs the next instruction of owner
dependent on already issued loadMP mode
12
Implementation of MAS
Scheduler
...
Decode
I-Buffer
Scoreboard
Ordered Warps
From RF
Issued Warp
Warp Id....
Mem_Q Head
Comp_Q Head
WRC
Stall bit
Memory saturation flag
OPtype...
WST
• Divide warps as memory and compute warps in ordered warps
• Warp Readiness Checker (WRC): Tests if a warp should be allowed to issue memory instructions
• Warp Status Table: Decide if scheduler should schedule from a warp
13
During memory saturation:• Serialization of memory requests cause less
overlap between memory accesses and compute: Memory aware scheduling (Mas)
• Data present in the cache cannot be reused as data cache cannot accept any request:
Cache access re-execution (Car)
Mas + car
Mascar
14
L1 CacheLoad Store
Unit
Cache-blocks
W1 Data
W1 W0
Re-execution Queue
HIT
Cache access re-execution
W2
Better than adding more MSHRs:• More MSHR cause faster saturation of
memory• Cause faster thrashing of data cache
15
Experimental methodology
• GPGPUSim 3.2.2 – GTX 480 architecture• SMs – 15, 32 PEs/SM• Schedulers – LRR, GTO, OWL and CCWS• L1 cache – 32kB, 64 sets, 4 way, 64 MSHRs• L2 cache – 768kB, 8 way, 6 partitions, 200 core
cycles• DRAM – 32 requests/partition, 440 core cycles
16
Performance of compute intensive kernels
PF
kmea
ns-2
sgem
m
leuk
o-2
tpac
f
cutc
p
hist
o-2
BP-1
sten
cil
mrig
-3
hist
ogra
msa
d-1
mri-
q
lava
MD
leuk
o-3
GMEA
N0
0.25
0.5
0.75
1
1.25GTO OWL CCWS Mascar
Sp
eed
up
w.r
.t R
R
Performance of compute intensive kernels is insensitive to scheduling
policies
17
Performance of memory intensive kernels
BP-2 histo-3 histo-1 sad-20
0.5
1
1.5
2GTO OWL CCWS MAS CAR
Sp
eed
up
w.r
.t R
R
4.8 4.24
Bandwidth intensive
Cache Sensitive
3.0
Overall
Scheduler
GTOOWL
CCWSMascar
BandwidthIntensive
4%4%4%17%
CacheSensitive
24%4%
55%56%
13%4%
24%34%
Overall
18
During memory saturation:• Serialization of memory requests cause less overlap
between memory accesses and compute: Memory aware scheduling (Mas) allows one warp to issue all its requests and begin early computation
• Data present in the cache cannot be reused as data cache cannot accept any request:
Cache access re-execution (Car) exploits more hit-under-miss opportunities through re-execution queue
Conclusion
34% speedup12% energy savings
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Questions??