Critical Power Slope: Understanding the Runtime Effects of Frequency Scaling
Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100...
Transcript of Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100...
http://www.mummi.org
Using MuMMI to Model and Optimize Energy and Performance
Xingfu Wu and Valerie Taylor Texas A&M University
Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015
http://www.mummi.org
Outline
n Recent Development of MuMMI
n Power Measurement Tools
n Hardware Performance Counter Tools
n Performance Counter-based Modeling
n Performance Counter-guided Optimization
http://www.mummi.org
MuMMI (Multiple Metrics Modeling Infrastructure) Project
Multicore/Heterogeneous System for Execution
PAPI PowerPack Database
Application
E-AMOM
EMON API
http://www.mummi.org
MuMMI Database Schema
Application Executable Run Application Performance
Modules
Function Performance
Basic Unit Performance
Data Structure Performance
Inputs
Systems
Resource
Connection
Functions Module_Info
Control Flow
Compilers
Model Template
Function_Info
Library Model_Info
User
Counters
Sys_Comp
Sys_Comm
Power Coupling
Power Counters
http://www.mummi.org
Data Collection: MAIDE System Source code
MAIDE
Instrumented source code
Compiler
Instrumented executable
Performance, HW counters, Power and Energy Data
MuMMI Database
Call Graph Power and HW Counters
SOAP Server with Perl Script
http://www.mummi.org
Power Measurement Tools
n IBM EMON API on BlueGene P/Q (MonEQ)
n Intel RAPL
n NVIDA MLPM
n PowerMon2 (RENCI)
n PowerPack (VT)
http://www.mummi.org
PowerPack Schema (Virginia Tech)
Power sampling frequency: 1 sample per second
http://scape.cs.vt.edu/software/powerpack-2-0/
http://www.mummi.org
PowerPack
http://www.mummi.org
IBM EMON API
Power sampling frequency: ~2 samples per second
Source: IBM
http://www.mummi.org
IBM EMON API
http://www.mummi.org
IBM EMON API
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 10 20 30 40 50 60 70 80 90 100
Pow
er (W
)
Time (seconds)
Power per node card for GTC on 16384 nodes of ANL BGQ Mira
Node_Card
Chip_Core
DRAM
Network
SRAM
Optics
PCIexpress
Link_Chip_Core
http://www.mummi.org
IBM EMON API
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100
Pow
er (W
)
Time (seconds)
Average Power per node for GTC on 16384 nodes of ANL BGQ Mira
Node Power CPU Power Memory Power Network
http://www.mummi.org
Performance Counter Tools
n Perf_events (Linux)
n HPM (IBM)
n perfmon (Linux)
n PAPI (UTK)
http://www.mummi.org
Outline
n Recent Development of MuMMI
n Power Measurement Tools
n Hardware Performance Counter Tools
n Performance Counter-based Modeling n Performance Counter-guided Optimization
http://www.mummi.org
Overview: Performance Counter-based Modeling
Runtime and Power Modeling
HPC Application
Application or function-level Performance
Counters (Ci)
Metric
Spearman PCA
PAPI
Application or Function-level Runtime and
Power (CPU,mem)
Predicted runtime
Predicted power (node, CPU, mem)
Counters
Model Regression
f(C1,C2,…,Cn)
Four metrics: runtime, node power, CPU power, memory power
Recommendations for Improvements
http://www.mummi.org
Four Models of Parallel eq3dyna on SystemG
http://www.mummi.org
Four Models of Parallel eq3dyna on ANL Mira
http://www.mummi.org
Prediction Error Rates on ANL Mira
http://www.mummi.org
Outline
n Recent Development of MuMMI
n Power Measurement Tools
n Hardware Performance Counter Tools
n Performance Counter-based Modeling
n Performance Counter-guided Optimization
http://www.mummi.org
Overview: Counter-Guided Optimizations
Ranking counters based on coefficient percentage
Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m)
Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r)
Runtime f1(C11, C12, …,C1n)
System Power f2(C21, C22, …,C2m)
Memory Power f4(C41, C42, …,C4r)
CPU Power f3(C31, C32, …,C3s)
Ranking counters with percentage (>1%)
(rank from the highest to the lowest)
C1, C2, …,Ck
Pair-wise spearman correlation analysis
Final counters (rank from the highest to the lowest) C1, C2, …,Cj (j < k)
http://www.mummi.org
Counter Ranking
Ranking counters based on coefficient percentage
Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m)
Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r)
Runtime f1(C11, C12, …,C1n)
System Power f2(C21, C22, …,C2m)
Memory Power f4(C41, C42, …,C4r)
CPU Power f3(C31, C32, …,C3s)
For example, given a parallel aerospace simulation PMLB:
Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM: 10.49% L1_ICM: 9.75% L2_ICA: 1.40% BR_INS: 0.03% SR_INS: 0.01%
Node Power VEC_INS: 76.64% CA_SHR: 22.45% L1_TCM: 0.89% RES_STL: 0.02%
Memory Power VEC_INS: 83.91% CA_CLN: 13.74% BR_NTK: 0.98% L1_TCM: 0.92% RES_STL: 0.18% BR_TKN: 0.16% L1_ICA: 0.11%
CPU Power VEC_INS: 99.15% BR_NTK: 0.81% RES_STL: 0.04%
http://www.mummi.org
0
10
20
30
40
50
60
70
80
90
100
Runtime System Power CPU Power Memory Power
Coe
ffici
ent P
erce
ntag
e (%
)
Models
Counter Ranking for Original PMLB on SystemG
PAPI_L1_ICA
PAPI_BR_TKN
PAPI_CA_CLN
PAPI_BR_NTK
PAPI_RES_STL
PAPI_L1_TCM
PAPI_CA_SHR
PAPI_VEC_INS
PAPI_SR_INS
PAPI_BR_INS
PAPI_L2_ICA
PAPI_L1_ICM
PAPI_L2_ICM
PAPI_TLB_DM
PAPI_TLB_IM
http://www.mummi.org
Counter Ranking
Ranking counters based on coefficient percentage
Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m)
Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r)
Runtime f1(C11, C12, …,C1n)
System Power f2(C21, C22, …,C2m)
Memory Power f4(C41, C42, …,C4r)
CPU Power f3(C31, C32, …,C3s)
Ranking counters with percentage (>1%)
(from the highest to the lowest)
C1, C2, …,Ck Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM:10.49% L1_ICM: 9.75% L2_ICA: 1.40%
Node Power VEC_INS: 76.64% CA_SHR: 22.45%
Memory Power VEC_INS: 83.91% CA_CLN: 13.74%
CPU Power VEC_INS: 99.15%
TLB_IM, VEC_INS, TLB_DM, L2_ICM, L1_ICM, L2_ICA, CA_SHR, CA_CLN
http://www.mummi.org
Correlation Analysis Using Pair-wise Spearman
n TLB_IM: Occurred in Runtime TLB_DM: Corr Value=0.89217296 : Occurred in Runtime BR_NTK: Corr Value=0.83305966 : Occurred in CPU, Memory L2_ICM: Corr Value=0.88451013 : Occurred in Runtime L1_ICM: Corr Value=0.96934866 : Occurred in Runtime L2_ICA: Corr Value=0.97044335 : Occurred in Runtime BR_TKN: Corr Value=0.88122605 : Occurred in Memory BR_INS: Corr Value=0.88122605 : Occurred in Runtime
n VEC_INS: Occurred in System, CPU, Memory
Final counters: TLB_IM and VEC_INS for optimization focus
http://www.mummi.org
16
32
64
128
256
1 2 4 8 16 32 64 128
Tim
e (s
)(log
2)
Number of Cores
Performance for PMLB with 128x128x128 on SystemG Original Optimized
http://www.mummi.org
250
260
270
280
290
300
310
320
330
340
350
1 2 4 8 16 32 64 128
Pow
er p
er n
ode
(W)
Number of Cores
Node Power Comparison on SystemG Original Optimized
http://www.mummi.org
4096
8192
16384
32768
65536
1 2 4 8 16 32 64 128
Ener
gy p
er n
ode
(J)(l
og2)
Number of Cores (log2)
Energy Comparison for PMLB on SystemG Original Optimized
http://www.mummi.org
0
10
20
30
40
50
60
70
80
90
100
Runtime System Power CPU Power Memory Power
Coe
ffici
ent P
erce
ntag
e (%
)
Models
Counter Ranking for Original PMLB on Mira
PAPI_FDV_INS
PAPI_FML_INS
PAPI_RES_STL
PAPI_VEC_INS
PAPI_FP_INS
PAPI_SR_INS
PAPI_BR_NTK
PAPI_BR_MSP
PAPI_L1_ICM
PAPI_HW_INT
http://www.mummi.org
4
8
16
32
64
128
256
Tim
e (s
) (lo
g2)
Number of Nodes X Number of Threads per Node
Performance Comparison on Mira Orignial Optimized
http://www.mummi.org
45
47
49
51
53
55
57
59
61
Pow
er p
er n
ode
(W)
Number of Nodes X Number of Threads per Node
System Power Comparison on Mira Orignial Optimized
http://www.mummi.org
256
512
1024
2048
4096
8192
16384
Ener
gy p
er N
ode
(I) (l
og2)
Number of Nodes X Number of Threads per Node
Energy Comparison for PMLB with 512x512x512 on Mira Orignial Optimized
http://www.mummi.org
Counter Ranking on SystemG
0
10
20
30
40
50
60
70
80
90
100
Runtime System Power CPU Power Memory Power
Coe
ffici
ent P
erce
ntag
e (%
)
Models
Counter Ranking for Original eq3dyna on SystemG
PAPI_SR_INS
PAPI_FDV_INS
PAPI_L2_STM
PAPI_FML_INS
PAPI_L1_TCA
PAPI_RES_STL
PAPI_TLB_DM
PAPI_L1_STM
PAPI_L2_TCW
PAPI_BR_NTK
PAPI_FP_INS
PAPI_L2_DCW
PAPI_L2_ICA
PAPI_L1_ICM
http://www.mummi.org
4000
8000
16000
32000
64000
128000
256000
512000
1 2 4 8 16 32 64 128 256
Ener
gy p
er N
ode
(J) (
log2
)
Number of Cores (log2)
Energy Comparison of eq3dyna on SystemG Original Optimized
http://www.mummi.org
Counter Ranking on ANL BGQ Mira
0
10
20
30
40
50
60
70
80
90
100
Runtime System Power CPU Power Memory Power
Coe
ffici
ent P
erce
ntag
e (%
)
Models
Counter Ranking for Original eq3dyna on ANL BGQ Mira
PAPI_L1_DCM
PAPI_SR_INS
PAPI_BR_NTK
PAPI_L1_STM
PAPI_RES_STL
PAPI_LD_INS
PAPI_BR_MSP
PAPI_VEC_INS
http://www.mummi.org
0
10000
20000
30000
40000
50000
60000
32x16 32x32 32x64 64x16 64x32 64x64 128x16 128x32 128x64 192x16 192x32 192x64 256x16 256x32 256x64
Ener
gy p
er n
ode
(J)
Number of Nodes x Nunber of Threads per node (max number of threads per core is 4)
Energy Comparison for eq3dyna with 100m on ANL BG/Q Mira Orignial Optimized
http://www.mummi.org
Summary and Future Work
n We used runtime and power models to identify the most important counters for optimization focus
n Our counters-guided optimizations save energy by u Eq3dyna:
Average 48.65% on Mira; 30.67% on SystemG u PMLB:
Average 18.28% on Mira; 11.28% on SystemG n We believe that the modeling and optimization
methodology can be applied to large-scale scientific applications on other architectures such as GPU and Xeon Phi by utilizing counters for GPU and Xeon Phi.