Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M....

Post on 03-Jan-2021

0 views 0 download

Transcript of Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M....

1

Performance & Energy Optimization @

Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane

Ahmad Qawasmeh Barbara M. Chapman

1

11/28/15

2

Layout of the talk

Ø Overview Ø Motivation Ø Factors that affect the performance & Energy Optimization Ø Experimental Results Ø Conclusion & Future Work

3

OpenMP

Ø De-facto standard for shared memory parallel programming

Ø Thread based parallelism Ø Mainly two kinds of parallelism

Ø Regular parallelism (work sharing constructs) Ø Irregular parallelism (task based constructs)

4

Main Barrier Towards Exascale Computing…

Ø Power, power and power Ø 20MW power limit for exascale machines (DOE) Ø Usually processor vendors concern Ø But to reach the exascale limit software stack have to

chip in Ø Any solution????

5

Power Constrained Computing (Overprovisioning)

Ø Usually not all application use maximum node power all the time

Ø Capping the power at lower limit Ø Allows extra node to be added at the similar power

budget

ExtraNode ExtraComputePower

6

Power Constrained Computing(Contd.)

Ø More focus on overall system level performance Ø Some related work,

Ø Sarood et al. [1] Ø Patki et al. [2] Ø Rountree et al. [3]

1.  Sarood,Osman,etal."Op?mizingpoweralloca?ontoCPUandmemorysubsystemsinoverprovisionedHPCsystems."ClusterCompu,ng(CLUSTER),2013IEEEInterna,onalConferenceon.IEEE,2013.

2.  Patki,Tapasya,etal."Exploringhardwareoverprovisioninginpower-constrained,highperformancecompu?ng."Proceedingsofthe27thinterna,onalACMconferenceonInterna,onalconferenceonsupercompu,ng.ACM,2013.

3.  Rountree,Barry,etal."BeyondDVFS:Afirstlookatperformanceunderahardware-enforcedpowerbound."ParallelandDistributedProcessingSymposiumWorkshops&PhDForum(IPDPSW),2012IEEE26thInterna,onal.IEEE,2012.

7

Why OpenMP???

Ø Current Issue: Less focus on per-node performance Ø Challenge: To reach the peak throughput, per-node performance

must be improved Ø OpenMP is the most popular language of choice for intra node

parallelism

8

Factors That Impact Work Sharing Parallelism…

Ø How many workers are working? ~ Thread Ø How the work is scheduled? ~ Scheduling Policy Ø How much work they are given at one time? ~ Chunk Size Ø How the data is laid out for the workers? ~ Thread Affinity Ø What do the workers do during their break? ~ Wait Policy

9

Experimental Details Ø Selected parameters

Ø No. Of threads (2, 4, 8, 16, 24, 32) Ø Scheduling policy (STATIC, DYNAMIC, GUIDED) Ø Chunk size(1, 8, 32, 64, 128, 256, 512) Ø Wait policy (active, passive) Ø Thread affinity (OMP_PLACES + OMP_PROC_BIND)

Ø Power cap levels Ø (55, 70, 85, 100, 115)w

Ø Used technology: Ø  Intel RAPL (for power capping & energy measurement) Ø OMPT for kernel level measurement

Ø Benchmark ~ NPB

10

0102030405060708090

100

CG_c

onj_gr

ad_1

CG_m

ain_

1CG_m

ain_

2CG_m

ain_

3CG_m

ain_

4CG_m

ain_

5CG_m

ain_

6EP

_main_

3FT

_c_s1

_1

FT_c

_s3

_1

FT_c

_ts2_

1*F

T_c_

i_1

**FT

_c_i_c

_1

FT_e

volve_

1FT

_init_

ui_1

IS_a

lloc

_key

_bu

IS_c

reat

e_se

q_1

LU_e

rhs_

1LU

_set

bv_1

LU

_se?

v_1

MG_z

ero3

_1

MG_z

ran3

_1

MG_z

ran3

_2

MG_z

ran3

_3

SP_a

dd_1

SP

_com

pute

_rhs

SP_e

rror

_nor

m_

SP_e

xact_r

hs_1

SP

_ini?alize_

1SP

_ninvr_1

SP

_pinvr_1

SP

_rhs

_nor

m_1

SP

_txinv

r_1

SP_t

zetar_

1SP

_x_s

olve

_1

SP_y

_solve

_1

SP_z

_solve

_1

UA_g

eom1_

2UA_m

ortar_

3UA_m

ove_

1

%Perform

anceIm

provem

ent

Kernels

55W 70W 85W 100W 115W

Performance improvement using the best configuration compared to default across all kernels

11

-20

0

20

40

60

80

100

CG_c

onj_gr

ad_1

CG_m

ain_

1CG_m

ain_

2CG_m

ain_

3CG_m

ain_

4CG_m

ain_

5CG_m

ain_

6EP

_main_

3FT

_c_s1

_1

FT_c

_s3

_1

FT_c

_ts2_

1*F

T_c_

i_1

**FT

_c_i_c

_1

FT_e

volve_

1FT

_init_

ui_1

IS_a

lloc

_key

_buff

_1

IS_c

reat

e_se

q_1

LU_e

rhs_

1LU

_set

bv_1

LU

_se?

v_1

MG_z

ero3

_1

MG_z

ran3

_1

MG_z

ran3

_2

MG_z

ran3

_3

SP_a

dd_1

SP

_com

pute

_rhs

_1

SP_e

rror

_nor

m_1

SP

_exa

ct_r

hs_1

SP

_ini?alize_

1SP

_ninvr_1

SP

_pinvr_1

SP

_rhs

_nor

m_1

SP

_txinv

r_1

SP_t

zetar_

1SP

_x_s

olve

_1

SP_y

_solve

_1

SP_z

_solve

_1

UA_g

eom1_

2UA_m

ortar_

3UA_m

ove_

1%EnergyIm

provem

ent

Kernels

55W 70W 85W 100W 115W

Energy consumption improvement using the best configuration compared to default across all kernels

12

24,GUIDED,8

24,DYNAM

IC,8

24,GUIDED,8

24,GUIDED,8

32,DYNAM

IC,8

32,STATIC,1

32,STATIC,1

32,STATIC,1

32,STATIC,1

32,STATIC,1

115W,32,STATIC,1

115W,32,STATIC,1

115W,32,STATIC,1

115W,32,STATIC,1

115W,32,STATIC,1

0

0.01

0.02

0.03

0.04

0.05

0.06

55W 70W 85W 100W 115W

Execu?

onTim

e(Sec)

DifferentPowerCapLevels

Execution time comparison among different configurations (an LU kernel)

BestConfigura?on DefaultConfigura?on DefaultConfigura?onWithoutPowerCap

13

OpenMP ICVs on DRAM Power

Ø Developing a model for power consumption of openmp applications

Dat

a Si

ze X

mea

ns th

e ar

ray

size

for S

TREA

M b

ench

mar

k is

19,

200,

000*

X.

Thes

e re

sults

are

bas

ed o

n ST

REAM

ben

chm

ark.

Cou

rtesy

: Mill

ad G

hane

DataSize

Power(W

)

Power(W

)

Power(W

)

DataSize

DataSize

14

UTS

FloorPlan

Courtesy:

Ahmad

Qaw

asmeh

Impact of threads & scheduling policy in task based parallelism

15

Ongoing Work Ø Dynamic adaptation (APEX),

Ø Active harmony Ø Modeling

Ø Across different software stack (OpenMP runtime), Ø Openuh Ø GCC Ø Intel

Ø Across different hardware architecture Ø Intel sandybridge Ø IBM power8

16

Future Work

Ø More concrete configuration selection Ø DRAM capping Ø Fine grain (core level) control Ø Other energy efficient techniques,

Ø DVFS, frequency modulation etc.

Ø Combining it with a inter-node (MPI) programming models for hybrid applications

17

Summary Ø  Overview

Ø  Motivation

Ø  Factors that affect the performance & Energy Optimization

Ø  Experimental Results

Ø  Conclusion & Future Work