Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100...

36
http://www.mummi.org Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas A&M University Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015

Transcript of Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100...

Page 1: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Using MuMMI to Model and Optimize Energy and Performance

Xingfu Wu and Valerie Taylor Texas A&M University

Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015

Page 2: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Outline

n Recent Development of MuMMI

n Power Measurement Tools

n Hardware Performance Counter Tools

n Performance Counter-based Modeling

n Performance Counter-guided Optimization

Page 3: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

MuMMI (Multiple Metrics Modeling Infrastructure) Project

Multicore/Heterogeneous System for Execution

PAPI PowerPack Database

Application

E-AMOM

EMON API

Page 4: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

MuMMI Database Schema

Application Executable Run Application Performance

Modules

Function Performance

Basic Unit Performance

Data Structure Performance

Inputs

Systems

Resource

Connection

Functions Module_Info

Control Flow

Compilers

Model Template

Function_Info

Library Model_Info

User

Counters

Sys_Comp

Sys_Comm

Power Coupling

Power Counters

Page 5: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Data Collection: MAIDE System Source code

MAIDE

Instrumented source code

Compiler

Instrumented executable

Performance, HW counters, Power and Energy Data

MuMMI Database

Call Graph Power and HW Counters

SOAP Server with Perl Script

Page 6: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Power Measurement Tools

n  IBM EMON API on BlueGene P/Q (MonEQ)

n  Intel RAPL

n  NVIDA MLPM

n  PowerMon2 (RENCI)

n  PowerPack (VT)

Page 7: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

PowerPack Schema (Virginia Tech)

Power sampling frequency: 1 sample per second

http://scape.cs.vt.edu/software/powerpack-2-0/

Page 8: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

PowerPack

Page 9: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

IBM EMON API

Power sampling frequency: ~2 samples per second

Source: IBM

Page 10: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

IBM EMON API

Page 11: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

IBM EMON API

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 10 20 30 40 50 60 70 80 90 100

Pow

er (W

)

Time (seconds)

Power per node card for GTC on 16384 nodes of ANL BGQ Mira

Node_Card

Chip_Core

DRAM

Network

SRAM

Optics

PCIexpress

Link_Chip_Core

Page 12: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

IBM EMON API

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

Pow

er (W

)

Time (seconds)

Average Power per node for GTC on 16384 nodes of ANL BGQ Mira

Node Power CPU Power Memory Power Network

Page 13: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Performance Counter Tools

n  Perf_events (Linux)

n  HPM (IBM)

n  perfmon (Linux)

n  PAPI (UTK)

Page 14: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Outline

n Recent Development of MuMMI

n Power Measurement Tools

n Hardware Performance Counter Tools

n Performance Counter-based Modeling n Performance Counter-guided Optimization

Page 15: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Overview: Performance Counter-based Modeling

Runtime and Power Modeling

HPC Application

Application or function-level Performance

Counters (Ci)

Metric

Spearman PCA

PAPI

Application or Function-level Runtime and

Power (CPU,mem)

Predicted runtime

Predicted power (node, CPU, mem)

Counters

Model Regression

f(C1,C2,…,Cn)

Four metrics: runtime, node power, CPU power, memory power

Recommendations for Improvements

Page 16: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Four Models of Parallel eq3dyna on SystemG

Page 17: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Four Models of Parallel eq3dyna on ANL Mira

Page 18: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Prediction Error Rates on ANL Mira

Page 19: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Outline

n Recent Development of MuMMI

n Power Measurement Tools

n Hardware Performance Counter Tools

n Performance Counter-based Modeling

n Performance Counter-guided Optimization

Page 20: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Overview: Counter-Guided Optimizations

Ranking counters based on coefficient percentage

Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m)

Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r)

Runtime f1(C11, C12, …,C1n)

System Power f2(C21, C22, …,C2m)

Memory Power f4(C41, C42, …,C4r)

CPU Power f3(C31, C32, …,C3s)

Ranking counters with percentage (>1%)

(rank from the highest to the lowest)

C1, C2, …,Ck

Pair-wise spearman correlation analysis

Final counters (rank from the highest to the lowest) C1, C2, …,Cj (j < k)

Page 21: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Counter Ranking

Ranking counters based on coefficient percentage

Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m)

Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r)

Runtime f1(C11, C12, …,C1n)

System Power f2(C21, C22, …,C2m)

Memory Power f4(C41, C42, …,C4r)

CPU Power f3(C31, C32, …,C3s)

For example, given a parallel aerospace simulation PMLB:

Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM: 10.49% L1_ICM: 9.75% L2_ICA: 1.40% BR_INS: 0.03% SR_INS: 0.01%

Node Power VEC_INS: 76.64% CA_SHR: 22.45% L1_TCM: 0.89% RES_STL: 0.02%

Memory Power VEC_INS: 83.91% CA_CLN: 13.74% BR_NTK: 0.98% L1_TCM: 0.92% RES_STL: 0.18% BR_TKN: 0.16% L1_ICA: 0.11%

CPU Power VEC_INS: 99.15% BR_NTK: 0.81% RES_STL: 0.04%

Page 22: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

0

10

20

30

40

50

60

70

80

90

100

Runtime System Power CPU Power Memory Power

Coe

ffici

ent P

erce

ntag

e (%

)

Models

Counter Ranking for Original PMLB on SystemG

PAPI_L1_ICA

PAPI_BR_TKN

PAPI_CA_CLN

PAPI_BR_NTK

PAPI_RES_STL

PAPI_L1_TCM

PAPI_CA_SHR

PAPI_VEC_INS

PAPI_SR_INS

PAPI_BR_INS

PAPI_L2_ICA

PAPI_L1_ICM

PAPI_L2_ICM

PAPI_TLB_DM

PAPI_TLB_IM

Page 23: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Counter Ranking

Ranking counters based on coefficient percentage

Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m)

Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r)

Runtime f1(C11, C12, …,C1n)

System Power f2(C21, C22, …,C2m)

Memory Power f4(C41, C42, …,C4r)

CPU Power f3(C31, C32, …,C3s)

Ranking counters with percentage (>1%)

(from the highest to the lowest)

C1, C2, …,Ck Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM:10.49% L1_ICM: 9.75% L2_ICA: 1.40%

Node Power VEC_INS: 76.64% CA_SHR: 22.45%

Memory Power VEC_INS: 83.91% CA_CLN: 13.74%

CPU Power VEC_INS: 99.15%

TLB_IM, VEC_INS, TLB_DM, L2_ICM, L1_ICM, L2_ICA, CA_SHR, CA_CLN

Page 24: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Correlation Analysis Using Pair-wise Spearman

n  TLB_IM: Occurred in Runtime TLB_DM: Corr Value=0.89217296 : Occurred in Runtime BR_NTK: Corr Value=0.83305966 : Occurred in CPU, Memory L2_ICM: Corr Value=0.88451013 : Occurred in Runtime L1_ICM: Corr Value=0.96934866 : Occurred in Runtime L2_ICA: Corr Value=0.97044335 : Occurred in Runtime BR_TKN: Corr Value=0.88122605 : Occurred in Memory BR_INS: Corr Value=0.88122605 : Occurred in Runtime

n  VEC_INS: Occurred in System, CPU, Memory

Final counters: TLB_IM and VEC_INS for optimization focus

Page 25: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

16

32

64

128

256

1 2 4 8 16 32 64 128

Tim

e (s

)(log

2)

Number of Cores

Performance for PMLB with 128x128x128 on SystemG Original Optimized

Page 26: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

250

260

270

280

290

300

310

320

330

340

350

1 2 4 8 16 32 64 128

Pow

er p

er n

ode

(W)

Number of Cores

Node Power Comparison on SystemG Original Optimized

Page 27: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

4096

8192

16384

32768

65536

1 2 4 8 16 32 64 128

Ener

gy p

er n

ode

(J)(l

og2)

Number of Cores (log2)

Energy Comparison for PMLB on SystemG Original Optimized

Page 28: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

0

10

20

30

40

50

60

70

80

90

100

Runtime System Power CPU Power Memory Power

Coe

ffici

ent P

erce

ntag

e (%

)

Models

Counter Ranking for Original PMLB on Mira

PAPI_FDV_INS

PAPI_FML_INS

PAPI_RES_STL

PAPI_VEC_INS

PAPI_FP_INS

PAPI_SR_INS

PAPI_BR_NTK

PAPI_BR_MSP

PAPI_L1_ICM

PAPI_HW_INT

Page 29: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

4

8

16

32

64

128

256

Tim

e (s

) (lo

g2)

Number of Nodes X Number of Threads per Node

Performance Comparison on Mira Orignial Optimized

Page 30: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

45

47

49

51

53

55

57

59

61

Pow

er p

er n

ode

(W)

Number of Nodes X Number of Threads per Node

System Power Comparison on Mira Orignial Optimized

Page 31: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

256

512

1024

2048

4096

8192

16384

Ener

gy p

er N

ode

(I) (l

og2)

Number of Nodes X Number of Threads per Node

Energy Comparison for PMLB with 512x512x512 on Mira Orignial Optimized

Page 32: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Counter Ranking on SystemG

0

10

20

30

40

50

60

70

80

90

100

Runtime System Power CPU Power Memory Power

Coe

ffici

ent P

erce

ntag

e (%

)

Models

Counter Ranking for Original eq3dyna on SystemG

PAPI_SR_INS

PAPI_FDV_INS

PAPI_L2_STM

PAPI_FML_INS

PAPI_L1_TCA

PAPI_RES_STL

PAPI_TLB_DM

PAPI_L1_STM

PAPI_L2_TCW

PAPI_BR_NTK

PAPI_FP_INS

PAPI_L2_DCW

PAPI_L2_ICA

PAPI_L1_ICM

Page 33: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

4000

8000

16000

32000

64000

128000

256000

512000

1 2 4 8 16 32 64 128 256

Ener

gy p

er N

ode

(J) (

log2

)

Number of Cores (log2)

Energy Comparison of eq3dyna on SystemG Original Optimized

Page 34: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Counter Ranking on ANL BGQ Mira

0

10

20

30

40

50

60

70

80

90

100

Runtime System Power CPU Power Memory Power

Coe

ffici

ent P

erce

ntag

e (%

)

Models

Counter Ranking for Original eq3dyna on ANL BGQ Mira

PAPI_L1_DCM

PAPI_SR_INS

PAPI_BR_NTK

PAPI_L1_STM

PAPI_RES_STL

PAPI_LD_INS

PAPI_BR_MSP

PAPI_VEC_INS

Page 35: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

0

10000

20000

30000

40000

50000

60000

32x16 32x32 32x64 64x16 64x32 64x64 128x16 128x32 128x64 192x16 192x32 192x64 256x16 256x32 256x64

Ener

gy p

er n

ode

(J)

Number of Nodes x Nunber of Threads per node (max number of threads per core is 4)

Energy Comparison for eq3dyna with 100m on ANL BG/Q Mira Orignial Optimized

Page 36: Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power ) Models Counter Ranking for Original

http://www.mummi.org

Summary and Future Work

n  We used runtime and power models to identify the most important counters for optimization focus

n  Our counters-guided optimizations save energy by u Eq3dyna:

Average 48.65% on Mira; 30.67% on SystemG u PMLB:

Average 18.28% on Mira; 11.28% on SystemG n  We believe that the modeling and optimization

methodology can be applied to large-scale scientific applications on other architectures such as GPU and Xeon Phi by utilizing counters for GPU and Xeon Phi.