Cell processor implementation of a MILC lattice QCD application
Improving the Performance of the MILC Code on … › documents ›...
Transcript of Improving the Performance of the MILC Code on … › documents ›...
ImprovingthePerformanceoftheMILCCodeonIntelKnightsLanding,
AnOverviewIXPUG2017FallMeeEng
September26th–28th,2017TexasAdvancedCompuEngCenter(TACC)
AusEn,TX
MILConKNLWorkingGroup1. IndianaUniversity:S.GoUlieb,R.Li2. UniversityofUtah:C.DeTar3. UniversityofArizona:D.Toussaint4. Intel:K.Raman,A.Jha,D.Kalamkar,M.Tolubaeva,T.Phung,R.Malladi
5. TataConsultancyServices(TCS):G.Bhaskar,P.Gaurav,J.Bhat
6. JeffersonLab:B.Joó7. LBL/NERSC:D.Doerfler
• ThiseffortissupportedbyIntelParallelCompuEngCenteratIndianaUniversity
• AndisaNERSC/NESAPTier1codeintheCoriPhase2(KNL)Project
2
ImpactofMILCQCDSimulaEons• Measuringthefundamentalparametersof
theStandardModelofparEclephysics• AndlookingfordeviaEonswhichsuggest
physicsNOTaccountedfor,i.e.NewPhysics!
• MethodistouseMonteCarloevaluaEonofthequantummechanicalpathintegral
• Plotshowsresultsachievedoverthelastseveralyears,usingresourcesfrommulEplefaciliEes
• Asthephysicalgrid(lagce)spacingdecreases,thecomputaEonalcomplexityincreases
• CoriishelpingwiththethecalculaEonsassociatedwitha=0.043femtometers
RaEoofdecayconstantsoftheKmesontothePion
3
BenchmarkingPlaiorms• Endeavor-Intel– Intel’sinternaldevelopmentcluster– KnightsLanding(KNL)andSkyLake(SKX)nodes– IntelOPAhigh-speedinterconnect
• Cori-NERSC– CrayXC-40Architecture– IntelHaswell&KNLprocessors
• Edison-NERSC– CrayXC-30Architecture– IntelIvyBridgeprocessors
• Theta–ArgonneNaEonalLab– CrayXC-40withKNLnodes
• Stampede–TACC– Dell,IntelandSeagate– IntelKNLwithOPAinterconnect
4
MILCcomputaEonalphases:OpEmizaEons&Performance
5
WheredoesMILCspenditsEme?• RepresentaEverunEmebreakdown(singlenode)• su3_rhmd_hisqapplicaEon
0 100 200 300 400 500 600
Haswell:2MPI+16OMP
Haswell:MPI-only32ranks
Haswell:OMP-only32threads
KNL:MPI-only64ranks
KNL:OMP-only64threads
StaggeredCGFFGFFLothers
16x16x16x16lagce
seconds 6
ProfilingToolsandMethods• IntelVTuneandAdvisor• Goodol’Linuxprofandgprof• Rooflineanalysis
– e.g.usedtolookhowclosetotheKNLMCDRAMBWroofline• IntegratedPerformanceMonitoring(IPM)tool• And,MILChasextensiveEmingsupport,inparEcularfuncEonsknownto
beEmeconsuming(Emeinsec.andGF/sreports)…– However,wefoundsomeEmingsdidn’taddup
• NeedingaddiEonalEmingforcodenotrepresented• andthisledtorouEnesgegngOpenMPthatwereassumedtobetrivial
– E.g.Update_u()callsfromthe“trajectoryupdate”• FuncEonspeedupauerthreadingwas42.5x• resulEnginanoverall12%improvementtothetrajectoryupdate
7
OpenMPEnhancementsinBaselineMILCDirectory #offilestotal #fileswith
candidateloops
#ofloops #ofloopsmodified
%ofloopsrewri7en
generic 102 43 229 71 31%
generic_ks 185 42 529 17 3.2%
ks_imp_rhmc 20 8 36 4 11%
• MILCabstractsloopswithFORALLSITESandFORALLFIELDSITES.– ConverttoFORALLxxx_OMPmacros– CandidateOpenMPloopsneedtobeexaminedforOMPprivatevariablesandreducEons
• ExaminaEonofallloopswouldbetedious,soweusedthetoolsmenEonedintheslideabovetohelpusidenEfyloopswithpotenEallythemostimpact
8
IntegraEngQPhiXSolverforStaggeredFermions(akaTheBigWin)
• QPhiX,StaggeredFermionversion,developedtoimprovevectorizaEon,andthreadingperformance
• Staggeredfermionsvs.Wilson/Clover– Lookingatadifferent“acEon”thanWilson/Clover– MulE-massCGsolverimplementaEon– Usesasinglerighthandside
àlimitsreuseanddecreasesarithmeEcintensity– 3complexvaluespergridpointvs.12– Primarilyusedwithdouble-precisionvariables
• X-YblockingforSIMDvectorizaEon– Datastoredasarraysofstructuresofarrays– SoAintheXdimension– SIMD_width/SOA_lengthintheYdimension– EnablesefficientcacheblockingofX-Z
SIMDwidth
SoAlengthx
y
9hUps://github.com/JeffersonLab/qphix
0
20
40
60
80
100
120
1 2 4 8 16
MILCCGPerformanceKNL
L=16
L=32
MulE-massCGSolver(SingleNode)
• TheQPhiXsolverprovidesa1.5xforL=32lagce• Forsmalllagcesizes,performanceislimitedbydata
remappingEme10
GFLO
Ps/sec
IncreasingMPIranksDecreasingthreads
0
20
40
60
80
100
120
1 2 4 8 16
MILCCG(w.QPhiX)PerformanceKNL
L=16
L=32
GaugeForce(SingleNode)
• GaugeForceperformanceimprovementsareupto4xforL=32lagcesize
11
GFLO
Ps/sec
0
50
100
150
200
250
300
1 2 4 8 16
MILCGFPerformanceKNL
L=16
L=32
0
50
100
150
200
250
300
1 2 4 8 16
MILCGF(w.QPhiX)PerformanceKNL
L=16
L=32
IncreasingMPIranksDecreasingthreads
0
20
40
60
80
100
120
1 2 4 8 16
MILCCG(w.QPhiX)L=32
SKXbase
SKX
KNL
SkyLakevs.KNL
• ImprovementstranslatetolatestXeonprocessorwithAVX512aswell
• Architecturesbehavedifferentlywrtrank/threadtradeoffs12
GFLO
Ps/sec
IncreasingMPIranksDecreasingthreads
0
50
100
150
200
250
300
1 2 4 8 16
MILCGF(w.QPhiX)L=32
SKXbase
SKX
KNL
CGPerformanceonSKX+OPA(MulENode)
• WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster• Requiresminimum2Ranks/Nodeforbestperformance(1rankperNUMAnode)• CGSolverPerformancelimitedbymemorybandwidthonXeons.Hence,noimprovementwith
QPhiX• ParallelEfficiency:
– ~99%@64Nodesfor32^(4)LagceVolumeperNode
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
0 1 2 3 4 5 6 7
GFLOPS/sec/nod
e
log(nodes)/log(2)
MulTmassCG(w/QPhiX)LaUceVolumepernode:32^(4)
IntelXeon®6148Gold+Intel®OPA
1rank/Node
2ranks/Node
4ranks/Node
8ranks/Node
16ranks/Node
32ranks/Node
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
0 1 2 3 4 5 6 7
GFLOPS/sec/nod
e
log(nodes)/log(2)
MulTmassCG(BaselineMILC)LaUceVolumepernode:32^(4)
IntelXeon®6148Gold+Intel®OPA
1rank/Node
2ranks/Node
4ranks/Node
8ranks/Node
16ranks/Node
32ranks/Node
GFPerformanceonSKX+OPA(MulENode)
• WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster• ~3.5xNodeLevelPerformanceimprovementwithQPhiX• MulEpleRanks/NodegivesbeUerperformanceforGaugeForce• ParallelEfficiency:
– ~85%@64Nodesfor32^(4)LagceVolumeperNode
0
50
100
150
200
250
300
0 1 2 3 4 5 6 7
GFLOPS/sec/nod
e
log(nodes)/log(2)
GaugeForce(w/QPhiX)LaUceVolumeperNode:32^(4)
IntelXeon®6148Gold+Intel®OPA
1rank/Node
2ranks/Node
4ranks/Node
8ranks/Node
16ranks/Node
32ranks/Node
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7
GFLOPS/sec/nod
e
log(nodes)/log(2)
GaugeForce(BaselineMILC)LaUceVolume:32^(4)
IntelXeon®6148Gold+Intel®OPA
1rank/Node
2ranks/Node
4ranks/Node
8ranks/Node
16ranks/Node
32ranks/Node
MPIMessagingCharacterisEcs,16KNLNodes
• Messagesizesvarywithnumberofrankspernode(RPN)• AsRPNdecreases
– Lagcesize(volume)perrankincreases->messagesizesincrease– Surface-to-volumedecreases->totalamountofdatadecreases(proporEonalto1/N4)– Sameanalogywithincreasinglagcesize(volume)pernode
0102030405060708090
64 32 16 8 4 2 1
,me(secon
ds)
rankspernode
MILCMul,massCG:TotalTime16Nodes,L=16
Compute
Irecv
Isend
Allreduce
Wait1
10
1006K
8K
12K
16K
32K
64K
128K
256K
Num
bero
fMessages(1e6)
Messagesize(bytes)
MILCMulAmassCG,16Nodes,L=16
64rpn
32rpn
16rpn
8rpn
4rpn
2rpn
1rpn
15
MPICharacterisEcs,512Nodes
• Caveat1:atthisscale,the“full”IPMinstrumentaEonimpactsabsoluteEme,butrelaEveEmes“should”berepresentaEve(Ineedmoreconfidencethatthisistrue)
• Caveat2:Resultsarefromasingletrial,couldbeupto10%variability16
64 32 16 8 4 2 1rankspernode
QPhiXMul9massCG:TotalTime512Nodes,L=16x32x32x32
Compute
Irecv
Isend
Allreduce
Wait
64 32 16 8 4 2 1rankspernode
MILCMul7massCG:TotalTime512Nodes,L=16x32x32x32
Compute
Irecv
Isend
Allreduce
Wait
0
2
4
6
8
10
12
8 16
32
64
128
256
512 1k
2k
4k
8k
16k
32k
64k
128k
256k
512k
1M
GB/s
messagesizeinbytes
SMBMsgrateBandwidth:Cori/KNL
1/2BW
r=1
r=2
r=4
r=8
r=16
r=32
r=640
2
4
6
8
10
12
8 16
32
64
128
256
512 1k
2k
4k
8k
16k
32k
64k
128k
256k
512k
1M
GB/s
messagesizeinbytes
SMBMsgrateBandwidth:Cori/KNL
1/2BW
r=8
r=8w/huge2M
r=8eager=64K0
20
40
60
80
100
120
64 32 16 8 4 2 1
1 2 4 8 16 32 64
second
s
RanksperNode,TheadsperRank
MILCCGSolveTime:432Nodes72x72x72x144laEce
eager=8KB
eager=64KB
w/huge2M
Cori(CrayAries)HugePagesOpEmizaEon
• MPImessageratemicro-benchmarkidenEfiedanissuewhereBWdropssignificantlywhentransiEoningtotheRendezvousprotocol
17
• TwosoluEonstried:• MoveRendezvoustransiEonto
64KB• PerCrayadvice,usehugepages,
2Mpagesinthiscase
• ThishasasignificantimpactonperformancewhenusingalargenumberofMPIrankspernode
• RecommendaEonàUsehugepagesincommunicaEonintensivecodeswithmoderatemessagesize
Eagerprotocol Rendezvousprotocol
RooflineAnalysis• HelpstofocusonareasofcodetotargetforopEmizaEon• MILCisknowntomemoryBWbound,andQPhiXversionisnearroofline
model,butbaselineversionhashigherAIandlowerperformance?
RooflineModelforKNL MILCPerformancevs.Roofline 18
1
10
100
1000
0.1 1 10
GFLOP/s
Arithme3cIntensity(FLOP/byte)
MILCStaggeredCGSolveRooflineforKNL28x28x28x28laLce
Emp.MCDRAMRoofline
Emp.DDRRoofline
Baselinew/MCDRAM
QPhiXw/MCDRAM
QPhiXw/DDR
ImpactonProducEonCompuEng
19
TheRubberHigngtheRoad• CurrentCori(KNL)producEonruns
– 96x96x96x192lagceon128nodes– ks_spectrum_hisq:CalculaEonofmesonandbaryonspectrafromawidevarietyofsources
andsinks– su3_clov_hisq:generatescloverpropagatorsandcontracEngmesonandbaryontwopoint
funcEons• QPhiXversionofthemulE-massstaggeredfermionsolverisbeingused
20
StaggeredCG CloverBiCG
StandardMILC1 40GF/s 21-692GF/s
QPhiX3 52GF/s 69GF/s
QPhiXImprovement 1.3x 1xto3.3x1. 64rankspernode,1threadperrank,includesOpenMPimprovements2. 256nodesrunningadifferentproblem3. 32rankspernode,8threadsperrank
AdvantageofusingCori’sBurstBufferforI/O
• I/OoverheadisbeingsignificantlyreducedbyusingCori’sBurstBuffer(BB)subsystem
• Typicalwallclockfor128noderunis~5hours• UsingnominalLustre/scratch,53minutesspentwriEng61
temporaryfiles• UsingBB,I/Owasreducedto26minutes
– 2ximprovement
21
CurrentandFutureEfforts• FermionForcecalculaEonopEmizaEons• FatLinkscalcuaEonopEmizaEons• High-speedinterconnectexploraEons
– MulEpleendpointsimplementaEonofOPAMPI
• ConEnueimprovementstoOpenMP– RooflineanalysisshowsthatbaselineMILCperhapsnotBWbound
• InvesEgaEngGridsolver• ConEnueworkingonSciDAC/QOPversionofthecode
22
Discussion&QuesEons
23
Backup
24
CGPerformanceonSKX+OPA(MulENode)
• WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster• Requiresminimum2Ranks/Nodeforbestperformance(1rankperNUMAnode)• SmallerVolume[16^(4)perNode]becomescommunicaEonsensiEveatlargernodecounts• ParallelEfficiency:
– ~80%@64Nodesfor16^(4)LagceVolumeperNode
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
0 1 2 3 4 5 6 7
GFLO
PS/sec/nod
e
log(nodes)/log(2)
MulEmassCG(w/QPhiX)LagceVolumepernode:16^(4)
IntelXeon®6148Gold+Intel®OPA
1rank/Node
2ranks/Node
4ranks/Node
8ranks/Node
16ranks/Node
32ranks/Node
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
0 1 2 3 4 5 6 7
GFLO
PS/sec/nod
e
log(nodes)/log(2)
MulEmassCG(BaselineMILC)LagceVolumepernode:16^(4)
IntelXeon®6148Gold+Intel®OPA
1rank/Node
2ranks/Node
4ranks/Node
8ranks/Node
16ranks/Node
32ranks/Node
GFPerformanceonSKX+OPA(MulENode)
• WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster• ~3.5xNodeLevelPerformanceimprovementwithQPhiX
– NotseeingLLCeffectswithQPhiXasseenwithBaselineMILC(needtoinvesTgate)• MulEpleRanks/NodegivesbeUerperformanceforGaugeForce• ParallelEfficiency:
– ~92%@64Nodesfor16^(4)LagceVolumeperNode
0.00
50.00
100.00
150.00
200.00
250.00
300.00
0 1 2 3 4 5 6 7
GFLOPS/sec/nod
e
log(nodes)/log(2)
GaugeForce(w/QPhiX)LaUceVolumeperNode:16^(4)
IntelXeon®6148Gold+Intel®OPA
1rank/Node
2ranks/Node
4ranks/Node
8ranks/Node
16ranks/Node
32ranks/Node
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
0 1 2 3 4 5 6 7
GFLOPS/sec/nod
e
log(nodes)/log(2)
GaugeForce(BaselineMILC)LaUceVolumeperNode:16^(4)
IntelXeon®6148Gold+Intel®OPA
1rank/Node
2ranks/Node
4ranks/Node
8ranks/Node
16ranks/Node
32ranks/Node
CommunicaEonProfile
• ProfilecollectedusingIntel®MPIPerformanceSnapshotTool(partofITAC)• SmallerLagceVolume(16^(4))spendshigh%ofEmeinMPIasexpected
94.59 91.93 91.81 90.45 89.80 87.49
5.41 8.07 8.19 9.55 10.20 12.51
0.00
50.00
100.00
150.00
0 2 3 4 5 6
%Tim
e
log(nodes)/log(2)
Computevs.MPI(CGSolver)LagceVolumeperNode:32^(4)
8MPIRanks/Node
%ComputaEon
81.47 76.69 76.52 75.13 70.71
18.53 23.31 23.48 24.87 29.29
0.00
50.00
100.00
150.00
0 2 3 4 5
%Tim
e
log(nodes)/log(2)
Computevs.MPI(CGSolver)LagceVolume:16^(4)8MPIRanks/Node
%ComputaEon
MPIFuncEonSummary
• MostofMPIEmespentinMPIWait(i.e.P2PSend/RecvcompleEon)• CollecEveOpsTime(i.e.%MPI_Allreduce)increaseswithnodecounts
– MPI_Allreduce~40%ofMPITimeat64Nodes– PotenEalboUleneckatverylargenodes
80.98 73.03 61.39 50.53 44.18
13.55 20.0527.5
29.97 40.05
3.68 5.79 9.82 18.03 14.49
0
50
100
150
0 2 3 4 5%M
PITim
e(Normalize
d)
log(nodes)/log2
MPIFuncEonSummary(CGSolver)LagceVolume:16^(4)
%MPI_Wait %MPI_Allreduce
StaggeredMulE-massCG:LagceScalingStudy
MILCbaseline:• 64MPIrankspernode
StaggeredQPhiX:• 1MPIrankon1node• upto16ranksonmulEplenodes 29
QPhiXMILCbaseline
VariousMPIrank/OMPthreadcombinaEons,L=24
MulEmassCG:WeakScalingonCori
30
Benchmarks:SymanzikGaugeForce
MILCbaseline:• 64MPIrankspernode
QOPQDP:• 64MPIrankspernode
QPhiX:• 1MPIrankwith1node• 16ranks/nodeotherwise
GFperformanceinGFLOPS/node,variouslagcesizespernodeL 1thread/core
31
Benchmarks:HISQFermionForce
MILCbaseline:• 64MPIrankspernode
QOPQDP:• 64MPIrankspernode
1.00
3.00
5.00
7.00
9.00
1 16 64
L=16
L=24
L=32
SpeedupofQOPQDP
Modifiedalgorithmvs.MILCFF• Speedupduetoareduced
numberofcalculaEons(FLOPs)to17%ofbaseline
32
CoriandStampede• Bothusethesame68-coreKNLSKUs• CoriusesArieshigh-speedinterconnect,StampedeusesIntel
OmniPath(OPA)– CrayandIntelMPIrespecEvely
• CoriusesCrayCNL,Stampedeis???Linux• Standard(non-QPhiX)versionofMILC
– Primaryanalysisishigh-speedinterconnectperformance,usingrelaEvelysmalllagcesizepernode
• IntelCforboth,although17.0.2.xvs.17.0.0.x• Limitedtomaximumjobsizeof80nodesonStampede->limited
scalingstudyto64nodes
33
02468
10121416
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Band
width(G
B/s)
Messagesize(bytes)
Ping-PongBandwidth
Stampede Cori
0246810121416
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Band
width(G
B/s)
Messagesize(bytes)
Uni-direcBonalBandwidth
Stampede Cori
0246810121416
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Band
width(G
B/s)
Messagesize(bytes)
Bi-direcAonalBandwidth
Stampede Cori
OSUSinglerankPt-2-Pt
• Singlecore,point-to-pointbetween2nodes• Latencyiscomparableat~3.1μS(butabout2xthatofHaswell)• Ping-pongisexactlythat,singlemessageping-pongedbetween2nodes• Uni-direcEonalisa“streaming”exchangewithawindowofsize64• Bi-direcEonalisalso“streaming”• CorishowsbeUersmallmessageBW(messagerate)andStampedehigherpeakBW
Increasingmessagerate
34
OSUMulE-rankPt-2-Pt
0
2
4
6
8
10
12
14
64
128
256
512 1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Band
width(G
B/s)
Messagesize(bytes)
StampedeMulC-RankBandwidth
1/2BW
1rpn
2rpn
4rpn
8rpn
16rpn
32rpn
64rpn
• Asrankspernode(RPN)increaseinthisuni-direcEonaltest,Stampede’sperformanceimproves.• CorihasaBW“ceiling”below32RPNthatlimitslargemessageBWthatisnotobservedwithStampede
– CrayaUributesthistoaPCIlatencyissuebetweenKNLandAries,canbemiEgatedbymovingBTE“put”protocoltransiEontoasmallermessage(defaultis4MB)
• ItlooksasthoughStampedecouldusesometuninginitstransiEontoalargemessageprotocoltotakeadvantageofitshigherpeakBW
0
2
4
6
8
10
12
14
64
128
256
512 1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Band
width(G
B/s)
Messagesize(bytes)
CoriMulC-RankBandwidth
1/2BW
1rpn
2rpn
4rpn
8rpn
16rpn
32rpn
64rpn
Increasingmessagerate
35
SMBMulE-node,MulE-rankStencil
• 16nodes,6neighborsperrank(emulates3DstencilcommunicaEonpaUern)• Measuresbi-direcEonalmessagerate(convertedtoBW)• HereweseeStampedeisabletoachieveclosetoitspeakbi-direcEonBW(25GB/s)
– However,somethinghappensat1MBmessagesize,tobeinvesEgated• StampedecouldalsousesometuninginitsprotocoltransiEontolargemessages(>64KB)
0
5
10
15
20
25
30
64
128
256
512 1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
Band
width(G
B/s)
Messagesize(bytes)
StampedeSMB
1rpn
2rpn
4rpn
8rpn
16rpn
32rpn
64rpn0
5
10
15
20
25
30
64
128
256
512 1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
Band
width(G
B/s)
Messagesize(bytes)
CoriSMB
1rpn
2rpn
4rpn
8rpn
16rpn
32rpn
64rpn
Increasingmessagerate
36
WeakScaling:NumberofNodes
• WeakScaled:lagcesizeis16x16x16x16pernode• MPI/OpenMPtradeoffstudyat1,8,16,32and64nodes
– numberofcoresfixedat64– MPIranks/node(rpn)variedfrom1to64– Allresultsuse1threadpercore
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6
GFLO
P/s
numberofnodes(2^^N)
Cori:MulCmassCGL=16
1rpn
2rpn
4rpn
8rpn
16rpn
32rpn
64rpn 0
10
20
30
40
50
60
70
0 1 2 3 4 5 6
GFLO
P/s
numberofnodes(2^^N)
Stampede:MulEmassCGL=16
1rpn
2rpn
4rpn
8rpn
16rpn
32rpn
64rpn
37
WeakScaling:RanksperNode
• WeakScaled:lagcesizeis16x16x16x16pernode• MPI/OpenMPtradeoffstudyat1,8,16,32and64nodes
– numberofcoresfixedat64– MPIranks/node(rpn)variedfrom1to64
0
10
20
30
40
50
60
70
1 10 100
GFLO
P/s
MPIrankspernode
Cori:Mul@massCGL=16
1node
8nodes
16nodes
32nodes
64nodes0
10
20
30
40
50
60
70
1 10 100
GFLO
P/s
MPIrankspernode
Stampede:MulAmassCGL=16
1node
8nodes
16nodes
32nodes
64nodes
38
WeakScaling:SelectMPI/OMP
• CGEmeandCGscalingforselectedMPI/OpenMPcombinaEons– Idealscalingwouldbeaflathorizontalline
• StampedescalingisbeUerwithhigherRPN• ThereisascalingimprovementonCorigoingfrom32to64RPN,butIwouldn’treadtomuchintoitat
suchassmallscale
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
1 8 16 32 64
Time(sec)
numberofnodes
CoriCGPerformanceWeakScaling,L=16pernode
Cori64/1
Cori8/8
Cori2/32
Stampede64/1
Stampede8/8
Stampede2/320.80
0.85
0.90
0.95
1.00
16 32 64
scalingeffi
cien
cy
numberofnodes
Cori:CGWeakScaling,L=16
Cori64/1
Cori8/8
Cori2/32
Stampede64/1
Stampede8/8
Stampede2/32
39