Post on 24-Feb-2016
description
Application-aware Memory System for Fair and Efficient Execution of Concurrent
GPGPU Applications
Adwait Jog1, Evgeny Bolotin2, Zvika Guz2,a, Mike Parker2,b, Steve Keckler2,3, Mahmut Kandemir1, Chita Das1
Penn State1, NVIDIA2, UT Austin3, now at (Samsunga, Intelb)
GPGPU Workshop @ ASPLOS 2014
Era of Throughput ArchitecturesGPUs are scaling: Number of CUDA Cores, DRAM bandwidth
GTX 780 Ti(Kepler) 2880 cores
(336 GB/sec)
GTX 275 (Tesla) 240 cores (127 GB/sec)
GTX 480 (Fermi) 448 cores
(139 GB/sec)
Prior Approach (Looking Back)
Execute one kernel at a timeWorks great, if kernel has enough parallelism
SM-1 SM-2 SM-30 SM-31 SM-32 SM-X
Single Application
Memory
Cache
Interconnect
Current Trend
What happens when kernels do not have enough threads?Execute multiple kernels (from same application/context) concurrently
4
KeplerFermi
CURRENT ARCHITECTURES SUPPORT THIS FEATURE
Future Trend (Looking Forward)
SM-1 SM-A
Application-1
Memory
Cache
Interconnect
SM-A+1
SM-B
Application-2SM-B+1
SM-X
Application-N
We study execution of multiple kernels from
multiple applications (contexts)
Why Multiple Applications (Contexts)?
Improves overall GPU throughput
Improves portability of multiple old apps (with limited thread-scalability) on newer scaled GPUs
Supports consolidation of multiple-user requests on to the same GPU
We study two applications scenarios
2. Co-scheduling two apps Assumed equal partitioning, 30 SM + 30 SM
7
SM-1 SM-30
Application-1
Memory
Cache
Interconnect
SM-31 SM-60
Application-2
SM-1 SM-60
Single Application (Alone)
Memory
Cache
Interconnect
SM-2 SM-3 SM-59
1. One application runs alone on 60 SM GPU (Alone_60)
8
MetricsInstruction Throughput (Sum of IPCs)
IPC (App1) + IPC (App2) + …. IPC (AppN)
Weighted SpeedupWith co-scheduling:
Speedup (App-N) = Co-scheduled IPC (App-N) / Alone IPC (App-N)Weighted Speedup = Sum of speedups of ALL apps
Best case:Weighted Speedup = N (Number of apps)
With destructive interferenceWeighted Speedup can be between 0 to N
Time-slicing – running alone:Weighted Speedup = 1 (Baseline)
OutlineIntroduction and motivation
Positives and negatives of co-scheduling multiple applications
Understanding inefficiencies in memory-subsystem
Proposed DRAM scheduler for better performance and fairness
Evaluation
Conclusions
Positives of co-scheduling multiple apps
Weighted Speedup = 1.4, when HIST is concurrently executed with DGEMM
40% improvement over running alone (time-slicing)
Gain in weighted speedup (application throughput)
Co-schedu...0
0.5
1
1.5HIST+DGEMM
Wei
ghte
d Sp
eedu
p
Baseline
Alone_60 With DGEMM0
0.20.40.60.8
1
HIST
Spee
dup
Alone_60 With HIST0
0.20.40.60.8
1DGEMM
Spee
dup
Unequal performance degradation indicates unfairness in the system
With DGEMM With GUPS0
0.20.40.60.8
1HIST Performance
Spee
dup
Negatives of co-scheduling multiple apps (1)(A) Fairness
GAUSS+GUPS: Only 2% improvement in weighted speedup, over running alone
Negatives of co-scheduling multiple apps (2)(B) Weighted speedup (Application Throughput)
With destructive InterferenceWeighted speedup can be between 0 to 2 (can also go below baseline = 1)
HIST+DGEMM HIST+GUPS GAUSS+GUPS0
0.5
1
1.5W
eigh
ted
Spee
dup Baseline
hist
_gau
ss
hist
_gup
s
hist
_bfs
hist
_3ds
hist
_dge
mm
gaus
s_gu
ps
gaus
s_bf
s
gaus
s_3d
s
gaus
s_dg
emm
gups
_bfs
gups
_3ds
gups
_dge
mm
bfs_
3ds
bfs_
dgem
m 0
0.20.40.60.8
11.21.41.6
1st APP 2nd APP
Wei
ghte
d Sp
eedu
p
Highlighted workloads: Exhibit unfairness (imbalance in red-green portions) & low throughputNaïve coupling of 2 apps is probably not a good idea
Summary: Positives and Negatives
Baseline
OutlineIntroduction and motivation
Positives and negatives of co-scheduling multiple applications
Understanding inefficiencies in memory-subsystem
Proposed DRAM scheduler for better performance and fairness
Evaluation
Conclusions
Primary Sources of Inefficiencies Application Interference at many levels
L2 CachesInterconnectDRAM (Primary Focus of this work)
SM-1 SM-A
Application-1
Memory
Cache
Interconnect
SM-A+1
SM-B
Application-2
SM-B+1
SM-X
Application-N
Bandwidth Distribution
Bandwidth intensive applications (e.g. GUPS) takes majority of memory bandwidth
alon
e_30
alon
e_60
gaus
sgu
ps bfs
3ds
dgem
m
alon
e_30
alon
e_60 hist
gups bf
s3d
sdg
emm
alon
e_30
alon
e_60 hist
gaus
sbf
s3d
sdg
emm
alon
e_30
alon
e_60 hist
gaus
sgu
ps 3ds
alon
e_30
alon
e_60 hist
gaus
sgu
ps bfs
dgem
m
alon
e_30
alon
e_60 hist
gaus
sgu
ps 3ds
HIST (1st App) GAUSS (1st App) GUPS (1st App) BFS (1st App) 3DS (1st App) DGEMM (1st App)
0%
20%
40%
60%
80%
100%1st App 2nd App Wasted-BW Idle-BW
Perc
enta
ge o
f Pe
ak B
andw
idth
Red portion is the fraction of wasted DRAM cycles during which data is not transferred over bus
17
hist
_gau
ss
hist
_gup
s
hist
_bfs
hist
_3ds
hist
_dge
mm
gaus
s_gu
ps
gaus
s_bf
s
gaus
s_3d
s
gaus
s_dg
emm
gups
_bfs
gups
_3ds
gups
_dge
mm
bfs_
3ds
bfs_
dgem
m 0
0.20.40.60.8
11.21.41.6
1st APP 2nd APP
Wei
ghte
d Sp
eedu
p
Imbalance in green and red portions indicates unfairness
Revisiting Fairness and ThroughputBaseline
18
Agnostic to different requirements of memory requests coming from different applications
Leads to – Unfairness– Sub-optimal performance
Primarily focus on improving DRAM efficiency
Current Memory Scheduling Schemes
19
Simple FCFS
Time
Bank
R1
R1
R1
R2
R2
R2
R3
Row Switch
Row Switch
Row Switch
Row Switch
Commonly Employed Memory Scheduling Schemes
High DRAM Page Hit Rats
Time
Bank
R1
R1
R1
R2
R2
R2
R3
Row Switch
Row Switch
App-1 App-2
R1 R2 R3
Request toRow-1 Row-2 Row-3
Low DRAM Page Hit RateOut of order (FR-FCFS)
Both schedulers are application agnostic! (App-2 suffers)
20
OutlineIntroduction and motivation
Positives and negatives of co-scheduling multiple applications
Understanding inefficiencies in memory-subsystem
Proposed DRAM scheduler for better performance and fairness
Evaluation
Conclusions
21
As an example of adding application-awarenessInstead of FCFS, schedule requests in Round-Robin FashionPreserve the page hit rates
Proposal:FR-FCFS (Baseline) FR-(RR)-FCFS (Proposed)Improves FairnessImproves Performance
Proposed Application-Aware Scheduler
22
Proposed Application-Aware FR-(RR)-FCFS SchedulerApp-1 App-2
Time
Bank
R1 R2 R3
Request toRow-1 Row-2 Row-3
R1
R1
R1
R2
R2
R2
R3
Row Switch
Row Switch Time
Bank
R3
R1
R1
R1Row Switch
R2
R2
R2
Row Switch
App-2 is scheduled after App-1 in Round-Robin order
Baseline FR-FCFS Proposed FR-(RR)-FCFS
DRAM Page Hit-Rates
hist_g
auss
hist_g
ups
hist_bfs
hist_3
ds
hist_d
gemm
gaus
s_gu
ps
gaus
s_bfs
gaus
s_3d
s
gaus
s_dg
emm
gups
_bfs
gups
_3ds
gups
_dgem
m
bfs_3
ds
3ds_
dgemm
30%40%50%60%70%80%90%
FR-FCFS FR-RR-FCFS P
age
Hit
Rat
es
Same Page Hit-Rates as Baseline (FR-FCFS)
24
OutlineIntroduction and motivation
Positives and negatives of co-scheduling multiple applications
Understanding inefficiencies in memory-subsystem
Proposed DRAM scheduler for better performance and fairness
Evaluation
Conclusions
Simulation Environment
GPGPU-Sim (v3.2.1)
Kernels from multiple applications are issued to different concurrent CUDA Streams
14 two-application workloads considered with varying memory demands
Baseline configuration similar to scaled-up version of GTX480 60 SMs, 32-SIMT lanes, 32-threads/warp 16KB L1 (4-way, 128B cache block) + 48KB SharedMem per SM 6 partitions/channels (Total Bandwidth: 177.6 GB/sec)
Improvement in Fairness
Fairness = max (r1, r2) Index
r1 = Speedup(app1)
Speedup(app2)
r2 = Speedup(app2)
Speedup(app1)
hist
_gau
ss
hist
_gup
s
hist
_bfs
hist
_3ds
hist
_dge
mm
gaus
s_gu
ps
gaus
s_bf
s
gaus
s_3d
s
gaus
s_dg
...
gups
_bfs
gups
_3ds
gups
_dg.
..
bfs_
3ds
3ds_
dgem
m 0
2
4
6
8
10
12FR-FCFS FR-RR-FCFS
Fair
ness
Inde
x
On average 7% improvement (up to 49%) in fairness
Significantly reduces the negative impact of BW sensitive applications (e.g. GUPS) on overall fairness of the GPU system
Lower is Better
Improvement in Performance (Normalized to FR-FCFS)
hist_g
auss
hist_g
ups
hist_b
fs
hist_3
ds
hist_d
gemm
gaus
s_gu
ps
gaus
s_bfs
gaus
s_3d
s
gaus
s_dg
emm
gups
_bfs
gups
_3ds
gups
_dge
mm
bfs_3
ds
3ds_
dgem
m 0.8
0.85
0.9
0.95
1
1.05
1.1
Norm
aliz
ed W
eigh
ted
Spee
dup
hist_g
auss
hist_g
ups
hist_b
fs
hist_3
ds
hist_d
gemm
gaus
s_gu
ps
gaus
s_bfs
gaus
s_3d
s
gaus
s_dg
emm
gups
_bfs
gups
_3ds
gups
_dge
mm
bfs_3
ds
3ds_
dgem
m 0.80.9
11.11.21.31.41.51.61.7
Norm
aliz
ed In
stru
ctio
n Th
roug
hput
On average 10% improvement (up to 64%) in instruction throughput performance and up to 7% improvement in weighted speedup performance.
Significantly reduces the negative impact of BW sensitive applications (e.g. GUPS) on overall performance of the GPU system
Instruction Throughput Weighted Speedup
Bandwidth Distribution with Proposed Scheduler
alon
e_30
alon
e_60
fr-fc
fs-g
ups
fr-rr
-fcfs
-gup
s
alon
e_30
alon
e_60
fr-fc
fs_g
ups
fr-rr
-fcfs
-gup
s
alon
e_30
alon
e_60
fr-fc
fs-g
ups
fr-rr
-fcfs
-gup
s
HIST (1st App) GAUSS (1st App) 3ds (1st App)
0%
20%
40%
60%
80%
100%
1st App 2nd App Wasted-BW Idle-BWPe
rcen
tage
of
Peak
Ban
dwid
th
Lighter applications get better DRAM bandwidth share
Conclusions
Naïve coupling of applications is probably not a good ideaCo-scheduled applications interfere in the memory-subsystemSub-optimal Performance and Fairness
Current DRAM schedulers are agnostic to applicationsTreat all memory request equally
Application-aware memory system is required for enhanced performance and superior fairness
30
Thank You!
Questions?