Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100...

http://www.mummi.org

Using MuMMI to Model and Optimize Energy and Performance

Xingfu Wu and Valerie Taylor Texas A&M University

Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015


Outline

n Recent Development of MuMMI

n Power Measurement Tools

n Hardware Performance Counter Tools

n Performance Counter-based Modeling

n Performance Counter-guided Optimization


MuMMI (Multiple Metrics Modeling Infrastructure) Project

Multicore/Heterogeneous System for Execution

PAPI PowerPack Database

Application

E-AMOM

EMON API


MuMMI Database Schema

Application Executable Run Application Performance

Modules

Function Performance

Basic Unit Performance

Data Structure Performance

Inputs

Systems

Resource

Connection

Functions Module_Info

Control Flow

Compilers

Model Template

Function_Info

Library Model_Info

User

Counters

Sys_Comp

Sys_Comm

Power Coupling

Power Counters


Data Collection: MAIDE System Source code

MAIDE

Instrumented source code

Compiler

Instrumented executable

Performance, HW counters, Power and Energy Data

MuMMI Database

Call Graph Power and HW Counters

SOAP Server with Perl Script


Power Measurement Tools

n  IBM EMON API on BlueGene P/Q (MonEQ)

n  Intel RAPL

n  NVIDA MLPM

n  PowerMon2 (RENCI)

n  PowerPack (VT)


PowerPack Schema (Virginia Tech)

Power sampling frequency: 1 sample per second

http://scape.cs.vt.edu/software/powerpack-2-0/


PowerPack


IBM EMON API

Power sampling frequency: ~2 samples per second

Source: IBM


IBM EMON API


IBM EMON API

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 10 20 30 40 50 60 70 80 90 100

Pow

er (W

)

Time (seconds)

Power per node card for GTC on 16384 nodes of ANL BGQ Mira

Node_Card

Chip_Core

DRAM

Network

SRAM

Optics

PCIexpress

Link_Chip_Core


IBM EMON API

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

Pow

er (W

)

Time (seconds)

Average Power per node for GTC on 16384 nodes of ANL BGQ Mira

Node Power CPU Power Memory Power Network


Performance Counter Tools

n  Perf_events (Linux)

n  HPM (IBM)

n  perfmon (Linux)

n  PAPI (UTK)


Outline




n Performance Counter-based Modeling n Performance Counter-guided Optimization


Overview: Performance Counter-based Modeling

Runtime and Power Modeling

HPC Application

Application or function-level Performance

Counters (Ci)

Metric

Spearman PCA

PAPI

Application or Function-level Runtime and

Power (CPU,mem)

Predicted runtime

Predicted power (node, CPU, mem)

Counters

Model Regression

f(C1,C2,…,Cn)

Four metrics: runtime, node power, CPU power, memory power

Recommendations for Improvements


Four Models of Parallel eq3dyna on SystemG


Four Models of Parallel eq3dyna on ANL Mira


Prediction Error Rates on ANL Mira


Outline




n Performance Counter-based Modeling

n Performance Counter-guided Optimization


Overview: Counter-Guided Optimizations

Ranking counters based on coefficient percentage

Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m)

Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r)

Runtime f1(C11, C12, …,C1n)

System Power f2(C21, C22, …,C2m)

Memory Power f4(C41, C42, …,C4r)

CPU Power f3(C31, C32, …,C3s)

Ranking counters with percentage (>1%)

(rank from the highest to the lowest)

C1, C2, …,Ck

Pair-wise spearman correlation analysis

Final counters (rank from the highest to the lowest) C1, C2, …,Cj (j < k)


Counter Ranking








For example, given a parallel aerospace simulation PMLB:

Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM: 10.49% L1_ICM: 9.75% L2_ICA: 1.40% BR_INS: 0.03% SR_INS: 0.01%

Node Power VEC_INS: 76.64% CA_SHR: 22.45% L1_TCM: 0.89% RES_STL: 0.02%

Memory Power VEC_INS: 83.91% CA_CLN: 13.74% BR_NTK: 0.98% L1_TCM: 0.92% RES_STL: 0.18% BR_TKN: 0.16% L1_ICA: 0.11%

CPU Power VEC_INS: 99.15% BR_NTK: 0.81% RES_STL: 0.04%


0

10

20

30

40

50

60

70

80

90

100

Runtime System Power CPU Power Memory Power

Coe

ffici

ent P

erce

ntag

e (%

)

Models

Counter Ranking for Original PMLB on SystemG

PAPI_L1_ICA

PAPI_BR_TKN

PAPI_CA_CLN

PAPI_BR_NTK

PAPI_RES_STL

PAPI_L1_TCM

PAPI_CA_SHR

PAPI_VEC_INS

PAPI_SR_INS

PAPI_BR_INS

PAPI_L2_ICA

PAPI_L1_ICM

PAPI_L2_ICM

PAPI_TLB_DM

PAPI_TLB_IM


Counter Ranking








Ranking counters with percentage (>1%)

(from the highest to the lowest)

C1, C2, …,Ck Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM:10.49% L1_ICM: 9.75% L2_ICA: 1.40%

Node Power VEC_INS: 76.64% CA_SHR: 22.45%

Memory Power VEC_INS: 83.91% CA_CLN: 13.74%

CPU Power VEC_INS: 99.15%

TLB_IM, VEC_INS, TLB_DM, L2_ICM, L1_ICM, L2_ICA, CA_SHR, CA_CLN


Correlation Analysis Using Pair-wise Spearman

n  TLB_IM: Occurred in Runtime TLB_DM: Corr Value=0.89217296 : Occurred in Runtime BR_NTK: Corr Value=0.83305966 : Occurred in CPU, Memory L2_ICM: Corr Value=0.88451013 : Occurred in Runtime L1_ICM: Corr Value=0.96934866 : Occurred in Runtime L2_ICA: Corr Value=0.97044335 : Occurred in Runtime BR_TKN: Corr Value=0.88122605 : Occurred in Memory BR_INS: Corr Value=0.88122605 : Occurred in Runtime

n  VEC_INS: Occurred in System, CPU, Memory

Final counters: TLB_IM and VEC_INS for optimization focus


16

32

64

128

256

1 2 4 8 16 32 64 128

Tim

e (s

)(log

2)

Number of Cores

Performance for PMLB with 128x128x128 on SystemG Original Optimized


250

260

270

280

290

300

310

320

330

340

350

1 2 4 8 16 32 64 128

Pow

er p

er n

ode

(W)

Number of Cores

Node Power Comparison on SystemG Original Optimized


4096

8192

16384

32768

65536

1 2 4 8 16 32 64 128

Ener

gy p

er n

ode

(J)(l

og2)

Number of Cores (log2)

Energy Comparison for PMLB on SystemG Original Optimized


0

10

20

30

40

50

60

70

80

90

100


Coe

ffici

ent P

erce

ntag

e (%

)

Models

Counter Ranking for Original PMLB on Mira

PAPI_FDV_INS

PAPI_FML_INS

PAPI_RES_STL

PAPI_VEC_INS

PAPI_FP_INS

PAPI_SR_INS

PAPI_BR_NTK

PAPI_BR_MSP

PAPI_L1_ICM

PAPI_HW_INT


4

8

16

32

64

128

256

Tim

e (s

) (lo

g2)

Number of Nodes X Number of Threads per Node

Performance Comparison on Mira Orignial Optimized


45

47

49

51

53

55

57

59

61

Pow

er p

er n

ode

(W)


System Power Comparison on Mira Orignial Optimized


256

512

1024

2048

4096

8192

16384

Ener

gy p

er N

ode

(I) (l

og2)


Energy Comparison for PMLB with 512x512x512 on Mira Orignial Optimized


Counter Ranking on SystemG

0

10

20

30

40

50

60

70

80

90

100


Coe

ffici

ent P

erce

ntag

e (%

)

Models

Counter Ranking for Original eq3dyna on SystemG

PAPI_SR_INS

PAPI_FDV_INS

PAPI_L2_STM

PAPI_FML_INS

PAPI_L1_TCA

PAPI_RES_STL

PAPI_TLB_DM

PAPI_L1_STM

PAPI_L2_TCW

PAPI_BR_NTK

PAPI_FP_INS

PAPI_L2_DCW

PAPI_L2_ICA

PAPI_L1_ICM


4000

8000

16000

32000

64000

128000

256000

512000

1 2 4 8 16 32 64 128 256

Ener

gy p

er N

ode

(J) (

log2

)

Number of Cores (log2)

Energy Comparison of eq3dyna on SystemG Original Optimized


Counter Ranking on ANL BGQ Mira

0

10

20

30

40

50

60

70

80

90

100


Coe

ffici

ent P

erce

ntag

e (%

)

Models

Counter Ranking for Original eq3dyna on ANL BGQ Mira

PAPI_L1_DCM

PAPI_SR_INS

PAPI_BR_NTK

PAPI_L1_STM

PAPI_RES_STL

PAPI_LD_INS

PAPI_BR_MSP

PAPI_VEC_INS


0

10000

20000

30000

40000

50000

60000

32x16 32x32 32x64 64x16 64x32 64x64 128x16 128x32 128x64 192x16 192x32 192x64 256x16 256x32 256x64

Ener

gy p

er n

ode

(J)

Number of Nodes x Nunber of Threads per node (max number of threads per core is 4)

Energy Comparison for eq3dyna with 100m on ANL BG/Q Mira Orignial Optimized


Summary and Future Work

n  We used runtime and power models to identify the most important counters for optimization focus

n  Our counters-guided optimizations save energy by u Eq3dyna:

Average 48.65% on Mira; 30.67% on SystemG u PMLB:

Average 18.28% on Mira; 11.28% on SystemG n  We believe that the modeling and optimization

methodology can be applied to large-scale scientific applications on other architectures such as GPU and Xeon Phi by utilizing counters for GPU and Xeon Phi.

Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100...

Documents

Transcript of Using MuMMI to Model and Optimize Energy and Performance€¦ · 0 10 20 30 40 50 60 70 80 90 100...