Power and Energy Normalized Speedup Models for ...async.org.uk/presentations/ACSD_34.pdf ·...
Transcript of Power and Energy Normalized Speedup Models for ...async.org.uk/presentations/ACSD_34.pdf ·...
-
Power and Energy Normalized Speedup Models for Heterogeneous Many Core Computing
MohammedA.N.Al-hayanni1,AshurRafiev2,RishadShafik1,FeiXia1
SchoolofEEE1andCS2,NewcastleUniversityNewcastleUponTyne,NE17RU,UK
ACSD 2016
-
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
2
-
Amdahl’s Law • Fixedworkload
– (50%parallelizableP=0.5)– OnasequenHalprocessor(singlecore)takes1unitofHmetocomplete
3
-
Amdahl’s Law • Withtwocores…
– Parallelizablepartisdistributedbetweenthetwocores– TotalHme0.75– Speedup=1/0.75=1.333
4
-
Amdahl’s Law • Withthreecores…
– Speedup=1/0.667=1.5
5
-
Amdahl’s Law • Withtencores…
– Speedup=1.82!" ! = !(1)!(!) =
11− ! + !!
6
-
Amdahl’s Law • With∞cores…
– Speedup=2!" ! = !(1)!(!) =
11− ! + !!
!" ∞ = 11− !
7
-
Amdahl’s Law
!" ! = !(1)!(!) =1
1− ! + !!
!" ∞ = 11− ! 8
-
Gustafson’s Law • WorkloadscaleswithcompuHngfaciliHes
– (50%parallelizableP=0.5)– OnasequenHalprocessor(singlecore)takes1unitofHmetocompleteworkloadW(1)designedwithasinglecoreinmind
OpenCLhostoperaHon OCLKernelsfor480x270pixels
9
-
Gustafson’s Law • Withfourcores…
– Complete960x540imageinthesameHme– 0.5x5=2.5Hmestheworkload(speedup=2.5)
OpenCLhostoperaHon OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
10
-
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels Gustafson’s Law • With16cores…
– CompleteFHDimageinthesameHme– Speedup=0.5x17=8.5
OpenCLhostoperaHon OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
OCLKernelsfor480x270pixels
11
-
Gustafson’s Law • SpeedupistheraHobetweentheimprovedworkloadand
theworkloadbeforeimprovement
– CalculatedatfixedHme!(1) = 1− ! ! + !!!(!) = 1− ! ! + !!!!" ! =!(!)!(1) = 1− ! + !"
12
-
Gustafson’s Law
!(1) = 1− ! ! + !!!(!) = 1− ! ! + !!!!" ! =!(!)!(1) = 1− ! + !"
13
-
Sun-Ni Law • Memory-boundspeedupmodel
– Parallelworkloadpercorerestrictedbymemorystructure(mulH-levelcaches,sharedmemory/interfaces,etc.)
– Onecore’sworkloadcapabilityrestrictedbyM–thememoryofonecore,Ncores’workloadcapabilityrestrictedbyNxM
– ForthePpart: !(1) = !(!)! ! = ! !×! = !(!×!!! ! 1 )
14
-
Sun-Ni Law • MulH-corespeedupisderivedthusly:
! ! = 1− ! ! 1 + !×!(!×!!! ! 1 )! 1 = 1− ! ! 1 + !×!(!)! 1 = 1− ! ! 1 + !×!(!×!
!! ! 1 )!
!" ! =!(!)!(1) =
1− ! !(1)+ !×!(!×!!! ! 1 )
1− ! !(1)+ !×!(!×!!! ! 1 )!
15
-
Sun-Ni Law • TryingtoremoveW(1):
!" ! ! = !!! !"#ℎ !"#$%&"' ! !"# !
! !×! = !!×!"! = !!×! ! , !ℎ!"! !×!!! ! 1 = !!×!(!!! ! 1 = !!×!(1)!" ! = 1− ! + !×!(!)
1− ! + !×!(!)!,!"#ℎ ! ! = !!
16
!" ! =!(!)!(1) =1− ! !(1)+ !×!(!×!!! ! 1 )
1− ! !(1)+ !×!(!×!!! ! 1 )!
-
Sun-Ni Law • Dependingong(N)
– Sub-linearscaling(Amdahl’sifg(N)=1)– Linearscaling(Gustafson’sifg(N)=N)– Super-linearscaling(ifg(N)>N)
• Ifyouhadmorememorythancores,andtheproblemismemory-bound,youcanscaletohigherspeedupthanwhatyourcoresallowforcompute-boundproblems
17
-
Comparing the three models
18
0
S(k)
k
Speed-up, k/ 1
5 10 15
1
5
10
15
20
20
1
p = 1
p = 0
p = 0.95
p = 0.9
p = 0.85p = 0.8
...
Amdahl's Law
1
S(k) = 1(1 – p) + pk
2S(k) = (1 – p) + pk
3S(k) =
(1 – p) + pg(k)(1 – p) + pg(k)k
Gustafson's Model Sun and Ni's Model0 k5 10 151
5
10
15
20
20
1
p = 1
p = 0
p = 0.95p = 0.9p = 0.85
p = 0.5
...
p = 0.15p = 0.1p = 0.05
...
S(k)Speed-up, k/ 1
0 k5 10 151
5
10
15
20
20
1
p = 1
p = 0
p = 0.15
p = 0.1
p = 0.05
p = 0.2
...
g(k) = k3/2
p = 0.25
S(k)Speed-up, k/ 1
-
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
19
-
Motivation • Extendtoahigherdegreeofcoreheterogeneity• Extendtopower/energy/efficiencyandcovermodeslike
dynamicvoltageandfrequencyscaling(DVFS)
• PotenHalapplicaHonsinrun-Hmemanagementsystemsofparallelsystems
20
-
Existing Speedup Models and the Extended Model
Homogeneity Heterogeneity Power Amdahl Gustafson SunandNiAmdahl Yes No No Yes No No
Gustafson Yes No No Yes Yes No
SunandNi Yes No No Yes Yes Yes
Hill-Marty Yes Simple No Yes No No
HaoandXie Yes Simple No Yes Yes Yes
WooandLee Yes Simple Yes Yes No No
SunandChen Yes No No Yes Yes Yes
ExtendedModel
Yes Normal Yes Yes Yes Yes
21
-
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
22
-
Heterogeneity • ExisHngheterogeneity
include‘asymmetric’and‘dynamic’structures(b)L
• WeextendtocoverthenormalformofcoreheterogeneityJ
• SHlliso-ISAandnotfullygeneralL
23
-
Heterogeneity • ParallelcomputaHonmaynotallfinishtogether(Amdahl’s)
24
-
!! = min! ∙ !!!
!!!
! = !! ,!! ,… ,!!
BCE performance equivalence • CalculaHngtheperformanceequivalentnumberofBCEs
– Basedontheslowest(lasttofinish)core
max!,avr!,opt! alsopossible
25
-
Speedup extension Amdahl’s • CalculaHngtheperformanceequivalentnumberofBCEs
– Basedontheslowest(lasttofinish)core
!! = min! ∙ !!!
!!!
!" ! = 11− !
!! +!!!
, (!"#$ℎ!!!)
26
-
SequenHalonfastestcore,ifαXisfastest
Speedup extension Amdahl’s • CalculaHngtheperformanceequivalentnumberofBCEs
– Basedontheslowest(lasttofinish)core
!! = min! ∙ !!!
!!!
!" ! = 11− !
!! +!!!
, (!"#$ℎ!!!)
Parallelsyncedtotheslowest,incaseofminα !" ! = !(1)!(!) =
11− ! + !!
Nowinvectorspace
27
-
Speedup extension Gustafson’s • Gustafson’sspeedupmodelextension
– AgainassumingsequenHalonatypeXcoreandparallelonNα
!" ! = 1− ! + !"
SequenHalonfastestcore,ifαXisfastest
Parallelsyncedtotheslowest,incaseofminα
!" ! = 1− ! !! + !!!
28
-
Speedup extension Sun-Ni’s • ExtendingSun-Ni’smodel
– AgainassumingsequenHalonatypeXcoreandparallelonNα
SequenHalonfastestcore,ifαXisfastest
Parallelsyncedtotheslowest,incaseofminα !" ! = 1− ! + !!(!)
1− ! + !!(!)!
MemoryboundfuncHonforallcores
!" ! = 1− ! + !"(!)1− ! !! + !"(!)!!
29 ReducestoextendedAmdahl’sandGustafson’sasexpected
-
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
30
-
Power • DividepowerintoeffecHveandidle
!!"!#$ =!(!)+!!"#$
EffecHvepower:powerusedbyworkload
IdlepowerincludesbothstaHcpowerandacHvepowerthat’snotusedby
workload
whenNiBCEsareidle31
!(!) =!!!! ! +!!!!(!)!! ! + !!(!)
!! = !!!! ,!! is the power of one !"#
!! =!! !!!!!
!!!= !!!!
!!"#$ = !! ∙!!
-
Effective power
32
• Theβsaresimilartotheαs,butpertaintopower– Anith-typecoreconsumesβiW1powerandhasαispeed(speedofone
coreis1–forthroughput,wedealwithspeedup,forpower,wedealwithwaqageandnotraHos)
– Nβisthepower-equivalentnumberofBCEs– Forsynchronizingontheslowestcore:
!! = min! ∙!!!!!!
!
!!!
-
Effective power
33
• EffecHvepowerformulashavebeenderivedforallthreetypesofmodels/laws
– Canbeviewedasresultsof‘powerscaling’withPSfuncHons:
!(!) = !" ! ∙ !"(!) ∙!!
With!! = !! = 1,∀!,and! = !allmodelstransformtohomogeneousforms
-
Efficiency
34
• Power-normalizedperformance– IPS/Waq
• EnergyperinstrucHon– Joules/InstrucHon
• Energy-normalizedperformance– IPS/Joule
• Allmodelsinthepaper
-
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
35
-
Experimental platform
36
-
System characterization
37
• BuildparametersthroughexperimentaHon– WA7,WA15,Widle(forthe‘whole’–didnottrydifferenHaHngdifferent
Wi–coresnotturnedoffeveninNA7=0andNA15=0cases)– αA7,αA15,βA7,βA15– Maybedifferentfordifferentapps/computaHons– CPU-heavytasksmainlyexperimentedinthisiniHalstudy,minα
scheduling
– log,sqrt,andintegerarithmeHctested
-
Exploration with models
38
• RunmodelswithvariousexecuHonscenariostoinvesHgatetheeffectsofP,DVFS,corescaling,etc.(largedatabaseavailablefromtechnicalreport,someexampledatainthepaper)
hqp://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2016-198.pdf
-
Exploration with models
39
• RunmodelswithvariousexecuHonscenariostoinvesHgatetheeffectsofP,DVFS,corescaling,etc.(largedatabaseavailablefromtechnicalreport,someexampledatainthepaper)
hqp://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2016-198.pdf
P=0.9 P=0.1
-
Cross-validation
40
• Amdahl’swithprogrammer-controlledPvalues
Max0.24%
Max2.17%
-
Outline • ExisHngspeedupmodels• MoHvaHon• Extendedheterogeneousspeedupmodels• PowerconsumpHonmodels• Powerandenergynormalizedspeedup• ExperimentalresultsandcrossvalidaHon• Conclusions
41
-
Conclusions
42
• ExtendedpopularparallelizaHonspeedupmodelstocoverawiderrangeofiso-ISA(orcoefficient-equivalentISA)heterogeneity
• Extendedpowermodelsandefficiencymodels• Firstcross-validaHonstudysuccessful• NeedtoinvesHgateevenwiderscopesofheterogeneity• Needtostudyotherspeedupmodels(e.g.Downey’s)• NeedtoinvesHgaterealisHcmemorysubsystems(cachemisses)• Needtoexploreusingthesemodelsforrun-Hmemanagement