Post on 03-Jan-2021
1
Performance & Energy Optimization @
Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane
Ahmad Qawasmeh Barbara M. Chapman
1
11/28/15
2
Layout of the talk
Ø Overview Ø Motivation Ø Factors that affect the performance & Energy Optimization Ø Experimental Results Ø Conclusion & Future Work
3
OpenMP
Ø De-facto standard for shared memory parallel programming
Ø Thread based parallelism Ø Mainly two kinds of parallelism
Ø Regular parallelism (work sharing constructs) Ø Irregular parallelism (task based constructs)
4
Main Barrier Towards Exascale Computing…
Ø Power, power and power Ø 20MW power limit for exascale machines (DOE) Ø Usually processor vendors concern Ø But to reach the exascale limit software stack have to
chip in Ø Any solution????
5
Power Constrained Computing (Overprovisioning)
Ø Usually not all application use maximum node power all the time
Ø Capping the power at lower limit Ø Allows extra node to be added at the similar power
budget
ExtraNode ExtraComputePower
6
Power Constrained Computing(Contd.)
Ø More focus on overall system level performance Ø Some related work,
Ø Sarood et al. [1] Ø Patki et al. [2] Ø Rountree et al. [3]
1. Sarood,Osman,etal."Op?mizingpoweralloca?ontoCPUandmemorysubsystemsinoverprovisionedHPCsystems."ClusterCompu,ng(CLUSTER),2013IEEEInterna,onalConferenceon.IEEE,2013.
2. Patki,Tapasya,etal."Exploringhardwareoverprovisioninginpower-constrained,highperformancecompu?ng."Proceedingsofthe27thinterna,onalACMconferenceonInterna,onalconferenceonsupercompu,ng.ACM,2013.
3. Rountree,Barry,etal."BeyondDVFS:Afirstlookatperformanceunderahardware-enforcedpowerbound."ParallelandDistributedProcessingSymposiumWorkshops&PhDForum(IPDPSW),2012IEEE26thInterna,onal.IEEE,2012.
7
Why OpenMP???
Ø Current Issue: Less focus on per-node performance Ø Challenge: To reach the peak throughput, per-node performance
must be improved Ø OpenMP is the most popular language of choice for intra node
parallelism
8
Factors That Impact Work Sharing Parallelism…
Ø How many workers are working? ~ Thread Ø How the work is scheduled? ~ Scheduling Policy Ø How much work they are given at one time? ~ Chunk Size Ø How the data is laid out for the workers? ~ Thread Affinity Ø What do the workers do during their break? ~ Wait Policy
9
Experimental Details Ø Selected parameters
Ø No. Of threads (2, 4, 8, 16, 24, 32) Ø Scheduling policy (STATIC, DYNAMIC, GUIDED) Ø Chunk size(1, 8, 32, 64, 128, 256, 512) Ø Wait policy (active, passive) Ø Thread affinity (OMP_PLACES + OMP_PROC_BIND)
Ø Power cap levels Ø (55, 70, 85, 100, 115)w
Ø Used technology: Ø Intel RAPL (for power capping & energy measurement) Ø OMPT for kernel level measurement
Ø Benchmark ~ NPB
10
0102030405060708090
100
CG_c
onj_gr
ad_1
CG_m
ain_
1CG_m
ain_
2CG_m
ain_
3CG_m
ain_
4CG_m
ain_
5CG_m
ain_
6EP
_main_
3FT
_c_s1
_1
FT_c
_s3
_1
FT_c
_ts2_
1*F
T_c_
i_1
**FT
_c_i_c
_1
FT_e
volve_
1FT
_init_
ui_1
IS_a
lloc
_key
_bu
IS_c
reat
e_se
q_1
LU_e
rhs_
1LU
_set
bv_1
LU
_se?
v_1
MG_z
ero3
_1
MG_z
ran3
_1
MG_z
ran3
_2
MG_z
ran3
_3
SP_a
dd_1
SP
_com
pute
_rhs
SP_e
rror
_nor
m_
SP_e
xact_r
hs_1
SP
_ini?alize_
1SP
_ninvr_1
SP
_pinvr_1
SP
_rhs
_nor
m_1
SP
_txinv
r_1
SP_t
zetar_
1SP
_x_s
olve
_1
SP_y
_solve
_1
SP_z
_solve
_1
UA_g
eom1_
2UA_m
ortar_
3UA_m
ove_
1
%Perform
anceIm
provem
ent
Kernels
55W 70W 85W 100W 115W
Performance improvement using the best configuration compared to default across all kernels
11
-20
0
20
40
60
80
100
CG_c
onj_gr
ad_1
CG_m
ain_
1CG_m
ain_
2CG_m
ain_
3CG_m
ain_
4CG_m
ain_
5CG_m
ain_
6EP
_main_
3FT
_c_s1
_1
FT_c
_s3
_1
FT_c
_ts2_
1*F
T_c_
i_1
**FT
_c_i_c
_1
FT_e
volve_
1FT
_init_
ui_1
IS_a
lloc
_key
_buff
_1
IS_c
reat
e_se
q_1
LU_e
rhs_
1LU
_set
bv_1
LU
_se?
v_1
MG_z
ero3
_1
MG_z
ran3
_1
MG_z
ran3
_2
MG_z
ran3
_3
SP_a
dd_1
SP
_com
pute
_rhs
_1
SP_e
rror
_nor
m_1
SP
_exa
ct_r
hs_1
SP
_ini?alize_
1SP
_ninvr_1
SP
_pinvr_1
SP
_rhs
_nor
m_1
SP
_txinv
r_1
SP_t
zetar_
1SP
_x_s
olve
_1
SP_y
_solve
_1
SP_z
_solve
_1
UA_g
eom1_
2UA_m
ortar_
3UA_m
ove_
1%EnergyIm
provem
ent
Kernels
55W 70W 85W 100W 115W
Energy consumption improvement using the best configuration compared to default across all kernels
12
24,GUIDED,8
24,DYNAM
IC,8
24,GUIDED,8
24,GUIDED,8
32,DYNAM
IC,8
32,STATIC,1
32,STATIC,1
32,STATIC,1
32,STATIC,1
32,STATIC,1
115W,32,STATIC,1
115W,32,STATIC,1
115W,32,STATIC,1
115W,32,STATIC,1
115W,32,STATIC,1
0
0.01
0.02
0.03
0.04
0.05
0.06
55W 70W 85W 100W 115W
Execu?
onTim
e(Sec)
DifferentPowerCapLevels
Execution time comparison among different configurations (an LU kernel)
BestConfigura?on DefaultConfigura?on DefaultConfigura?onWithoutPowerCap
13
OpenMP ICVs on DRAM Power
Ø Developing a model for power consumption of openmp applications
Dat
a Si
ze X
mea
ns th
e ar
ray
size
for S
TREA
M b
ench
mar
k is
19,
200,
000*
X.
Thes
e re
sults
are
bas
ed o
n ST
REAM
ben
chm
ark.
Cou
rtesy
: Mill
ad G
hane
DataSize
Power(W
)
Power(W
)
Power(W
)
DataSize
DataSize
14
UTS
FloorPlan
Courtesy:
Ahmad
Qaw
asmeh
Impact of threads & scheduling policy in task based parallelism
15
Ongoing Work Ø Dynamic adaptation (APEX),
Ø Active harmony Ø Modeling
Ø Across different software stack (OpenMP runtime), Ø Openuh Ø GCC Ø Intel
Ø Across different hardware architecture Ø Intel sandybridge Ø IBM power8
16
Future Work
Ø More concrete configuration selection Ø DRAM capping Ø Fine grain (core level) control Ø Other energy efficient techniques,
Ø DVFS, frequency modulation etc.
Ø Combining it with a inter-node (MPI) programming models for hybrid applications
17
Summary Ø Overview
Ø Motivation
Ø Factors that affect the performance & Energy Optimization
Ø Experimental Results
Ø Conclusion & Future Work