Post on 28-Jan-2021
IMPROVINGTHESCALABILITYOFCFDCODES
FrancescoGava,Ghislain Lartigue,VincentMoureauCNRS-CORIA
2
ICARUS* PROJECT
05/06/2019
Context
ØObjective: Developmentofhigh-fidelitycalculationtoolsforthedesignofhotpartsofengines(aerospace+automotive)
ØTask: Optimisationofcodes’performancesonHPCmachinesØMotivation: Nextgeneration(2020)machineswillbemassivelyparallel.
CFDcodesarenotreadytotakefulladvantageofsuchsupercomputers.
ØFunding: FUI– Fonds UniqueInterministériel
*IntensiveCalculationforAeRo andautomotiveenginesUnsteadySimulations
305/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
405/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
5
Source:top500.org
05/06/2019
PerformancesoftheTop500
TheTop500isarankingofthe500mostpowerfulsupercomputersintheworld
6
Source:top500.org
05/06/2019
PerformancesoftheTop500
TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow
7
Source:top500.org
05/06/2019
PerformancesoftheTop500
Physicallimitsofmaterialsandenergyconsumptionarelimitingtheprocessorsfrequencies,hencetheperformances.
TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow
8
PreparedbyC.Batten – SchoolofElectricalandComputerEngineering– CornellUniversity– 2005– retrievedDec122012–http://www.cls.cornell.edu/courses/ece5950/handouts/ece5950-overview.pdf
05/06/2019
Multicorearchitectures
Sequentialperformancesarelimited,butwithmorecorestheparallelperformancescanstillincrease.
Almostallsupercomputersusemulticoreprocessors.
Thenumberofcorespersocketisever-increasingandmorevaried.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 11/200011/200611/201011/201411/2018
Systemshare
Date
TOP500Corespersocket
1 2 4
6 8 10
12 14 16
18 20 24
68 64 Others
9
CPU
L1Cache
L2Cache
L3Cache
RAM
CPU
L1Cache
L2Cache
L3Cache
RAM
CPU
L1Cache
L2Cache
L3Cache
RAM
Thememoryhierarchy
Mono-core Multi-core
05/06/2019
L3CacheItissharedamongstallCPUs
RAM
CPU
L1Cache
CPU
L1Cache
CPU
L1Cache
L2Cache L2Cache L2Cache
Network
Fastest32KB1cycle
Faster256KB3cycles
FastFewMB
10cycles
SlowManyGB100+cycles
1005/06/2019
Therooflinemodel
Codeperformancecanbelimitedby:Ø Processorspeed(computebound)Ø Memoryaccessspeed(memorybound)
ArithmeticIntensity(flops/byte)
AttainablePerform
ance(G
flops)
MemoryBound
ComputeBound
InCFDsolvers:Ø FastcomputationØ HighnumberofmemoryaccessesØ Largedatasizes
Theaimistomoveoverthere:Ø Exploitmemoryhierarchy
Ø WorkonsmallerdataØ Computeasmuchaspossible
onthesamedata
1105/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
1205/06/2019
ComputationalFluidDynamics
CFDSOLVER
PRECCINSTAburnerwithYales2
Generally,aCFDcode:Ø SolvesNavier-Stokes(andother)equations
Ø ReliesonlinearoperatorsØ Fastcomputations(additions,…)Ø Alotofmemoryread/write
Ø NeedtoexploitmemoryhierarchyØ Usesadiscretizeddomain
Ø ThefinerthediscretizationthehighertheprecisionØ LargemeshesmaynotfitintoRAMand
takelongertimetocomputeØ Useparallelsolvers
@u
@t+ (u ·r)u = �1
⇢rp+ ⌫r2u
1305/06/2019
FromincompressiblemomentumtoPoisson’sequation
@u
@t+ (u ·r)u = �1
⇢rp+ ⌫r2u
Solvetheincompressiblemomentumequationforu
Aprediction-correctionmethod[1]
Imposingthecontinuityequation
r · un+1 = 0LeadstothePoisson’sequationforpressure
r2pn+1 = rhs
+
Whichcanberewrittenasalinearsystem
Lp = b
Mustsolveforp tohaveu
[1]Chorin,A.J.(1967), "ThenumericalsolutionoftheNavier-Stokesequationsforanincompressiblefluid”, Bull.Am.Math.Soc., 73:928–931
1405/06/2019
Poisson’sequationandConjugateGradientmethod
Thelinearsystemhastobesolvedforp
Thiscanbesolvedwithaniterativemethod
Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork
untilconvergence
Lp = b
r0 = b� Lp0d0 = r0
k = 0
� = convergence criterion
err = ||r0||1while (err > �)
↵k =rTkrk
dTkLdk
pk+1 = pk + ↵kdk
rk+1 = rk � ↵kLdkerr = ||rk||1
�k =rTk+1rk+1rTkrk
dk+1 = rk+1 + �kdk
k = k + 1
end while
return pk as the result
TheConjugateGradientmethod
Thismethodistrivialwithoutparallelism
1505/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
1605/06/2019
Parallelcomputationanddomaindecomposition
LargeproblemscannotbecomputedbyasingleprocessØ Domaindecompositiontodividetheproblemamongstmanyprocesses
Ø MorememoryavailableØ MorecomputationalpowerØ Communicationneeded
Dataonthesenodeshavetobeexchangedbetweenprocesses
L3CacheItissharedamongstallCPUs
RAM
CPU
L1Cache
CPU
L1Cache
CPU
L1Cache
L2Cache L2Cache L2Cache
17
Computationonadomainnode
Insidethedomain Ondomainboundary
05/06/2019
i
Ø NeedscontributionofallneighbournodesØ Allsurroundingnodesbelongtothedomain
Ø Noproblem
Ø NeedscontributionofallneighbournodesØ Somenodesdonotbelongtothedomain
Ø Mustcommunicatewithneighbours
i
proc#1 proc#1
proc#2
r�i =X
j2Ni
f(�i,�j ,Mij)
1805/06/2019
ParallelConjugateGradientmethod
Thelinearsystemhastobesolvedforp
Thiscanbesolvedwithaniterativemethod
Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork
untilconvergence
Lp = b
r0 = b� Lp0d0 = r0
k = 0
� = convergence criterion
err = ||r0||1while (err > �)
↵k =rTkrk
dTkLdk
pk+1 = pk + ↵kdk
rk+1 = rk � ↵kLdkerr = ||rk||1
�k =rTk+1rk+1rTkrk
dk+1 = rk+1 + �kdk
k = k + 1
end while
return pk as the result
Thisalgorithmrequires4COLLECTIVEcommunications:oneforeachscalarproduct(3)and
oneforthenorm
TheConjugateGradientmethod
ThisalgorithmrequiresaPOINTTOPOINTcommunicationtocomputeLd
1905/06/2019
YALES2structure
Domaindecompositiontodividetheproblemamongstmanyprocessors
YALES2usesaDoubleDomainDecompositionØ Eachsubdomainissplitinsmallgroupsofelements
(EL_GRPs)whichwillfitintoL3,possiblyL2
Dataonthesenodeshavetobeexchangedbetweenprocessors.YALES2hasadedicateddatastructure:theexternalcommunicator(EC)
DataonthesenodeshavetobeexchangedbetweenEL_GRPsonthesameprocessor.YALES2hasadedicateddatastructure:theinternalcommunicator(IC)
gridofproc#1
el_grp
el_grp
el_grp
boundary boundary
int.comm ext.comm
ext.comm
proc#2
proc#3
ii
Ø InYALES2boundarynodesareduplicatedØ PartialvalueiscomputedoneachsideØ Totalvalueiscomputedonint.comm.
ThedecomposeddomainisstilltobigtofitintoL3cache
2005/06/2019
YALES2InternalCommunicator
Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode
Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes
itsowncontributionØ Contributionsareadded
ontheinternalcomm.Ø Totalvalueispossibly
copiedbackonel_grpnode
++++++++++++++++++
IC
2105/06/2019
YALES2InternalCommunicator
Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode
Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes
itsowncontributionØ Contributionsareadded
ontheinternalcomm.Ø Totalvalueispossibly
copiedbackonel_grpnode
++++++++++++++++++
IC
2205/06/2019
YALES2InternalCommunicator
Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode
Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes
itsowncontributionØ Contributionsareadded
ontheinternalcomm.Ø Totalvalueispossibly
copiedbackonel_grpnode
++++++++++++++++++
IC
2305/06/2019
YALES2InternalCommunicator
Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode
Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes
itsowncontributionØ Contributionsareadded
ontheinternalcomm.Ø Totalvalueispossibly
copiedbackonel_grpnodes
++++++++++++++++++
IC
24
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
25
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
26
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
Ø Contributionsareaddedontheinternalcomm.
27
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
Ø Contributionsareaddedontheinternalcomm.
Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess
28
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
Ø Contributionsareaddedontheinternalcomm.
Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess
Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator
29
SENDEC
RECVEC
05/06/2019
YALES2ExternalCommunicator
Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode
Ø Nodesonboundariesbetweenprocessesareduplicated
proc#2RECVEC
proc#2SENDEC
proc#3RECVEC
proc#3SENDEC
++++++++++++++++++++++++++++++++++++++
IC
Ø Eachel_grp computesitsowncontribution
Ø Contributionsareaddedontheinternalcomm.
Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess
Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator
Ø Finalvalueispossiblycopiedbackonel_grp nodes
3005/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
3105/06/2019
Dataexchange
Therearemainly3(and1/2)waystoachieveparallelism
MPI(MessagePassing)
OpenMP(SharedMemory)
PGAS(PartitionedGlobalAccessSpace)
MPI+OpenMP(Hybrid)
Oldbut(almost)gold
Easybut(very)limited
Promisingbut(too)new(Nottreatedhere)
3205/06/2019
MPI:MessagePassingInterface
Ø Reliesonamessageexchangeparadigm(oftenthroughnetwork)Ø Themostcommonparadigm
Ø VerywelltestedØ A lotofsupport
Ø FairlyeasytoimplementØ Couldbeusedonanyplatform
Ø WorksbetterondistributedmemorysystemsØ Doesnottakeadvantageofsharedmemory
Ø DoesnotscaleathighnumberofprocessesØ CollectivecommunicationsareabottleneckØ Cannotexploitwellhugesupercomputers
CurrentlyYALES2usesMPI
CPUL1
L2
L3
RAM
CPUL1
L2
L3
RAM
CPUL1
L2
L3
RAM
Network
3305/06/2019
OpenMP:SharedMemoryparallelism
Ø ReliesonmemorysharedamongstcoresØ Verycommon
Ø WelltestedØ Goodsupport
Ø Extremelyeasytoimplement(finegrain)Ø CanbeusedONLYonarchitectureswith
sharedmemoryØ CannotgobeyondNUMA-Domaincores
Ø Mustbeusedtogetherwithanotherparadigm(MPI,..)
Ø OverheadofpragmasisnotnegligibleØ Hardtogetfullparallelisation(Amdahl’slaw)
L3Cache(Shared)
RAM
CPUL1
CPUL1
CPUL1
L2 L2 L2
3405/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
3505/06/2019
HybridMPI+OpenMP:motivations
Ø MPIcodesdoesnotscaleindefinitelyØ ThreadingcanreducenumberofMPIrankshenceimprovescalability
Ø MPIalonecannottakefulladvantageofmulticorearchitecturesØ OpenMPcanexploitsharedmemory
https://nvidia-gpugenius.highspot.com/viewer/5bf5139e659e9366ed606a3e?iid=5bf5134ac714335696ba3410
3605/06/2019
Performancemeasuresspecifications
AllfollowingmeasureswereobtainedonMyria supercomputeratCRIANN:Processor :Bi-socketBroadwell (2x14cores@2.4GHz,128GB@2.4GHzDDR4RAM)Network :LowlatencyhighbandwidthIntelOmni-Path(100Gbit/s)MPIlibrary :Intel-MPI2017.1.132(othersgivesimilarresults)Testcase :Incompressible,non-reactivePRECCINSTAburner
PRECCINSTAburnerwithYales2
Myria supercomputeratCRIANN
37
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
MPIscalability
38
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
39
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
Strongscalability:constantglobalwork=>linearlydecreasingWCT
40
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
Weakscalability:constantworkperprocess=>constantWCT
41
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
42
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI
Realcasescenario
05/06/2019
MPIscalability
WallClockTime(Lowerisbetter)
Deviationfromidealmainlyduetocommunications
43
Collectivecommunications
05/06/2019
MPIscalabilitylimits
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPNPPN=ProcessesPerNode
44
Collectivecommunications
05/06/2019
MPIscalabilitylimits
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
ReducingthenumberofcommunicatingPPNwhilstmaintainingtheamountofcoresforcomputationwillreducethecostofcollectivecommunications
PPN=ProcessesPerNode
4505/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
4605/06/2019
MPI+OpenMPFineGrainmaster master
thread2
therad3
thread4
master master
thread2
thread3
thread4
master master
thread2
thread3
thread4
master
Ø Objective:Ø HavelargerdomainforoneMPIrank
Ø FewerMPIranksØ Lesscommunication
Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:
Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly
47
MPI+OpenMPFineGraindomaindecomposition
WithoutOpenMP WithOpenMPFineGrain
05/06/2019
Ø LargerMPIdomainØ FewerMPIranksØ ThreadsshareworkonEL_GRPs
Ø Musttakecareofdataraces
thread#4
thread#3
thread#1
thread#2
4805/06/2019
MPI+OpenMPFineGrain(Baseversion)
RuntimeBreakdown PercentageRuntimeBreakdown
Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadonlyØ Onlyloopswithindependentiterationsareparallelised
Ø NoconcurrencyØ Notmuchparallelised
OpenMP
Idealscaling
Sequential
With7threadsonly40%isexecutedinparallel
80%ofthecodeisparallelized
49
In-socketstrongscaling
05/06/2019
MPI+OpenMPFineGrain(Baseversion)performances
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Speedup
Numberofcores
IN-SOCKETSPEEDUP
REF:1.7MElements,1Core,MPI
IDEAL MPI OMP_FG(Base)
WorsescalabilitythanMPI
50
Realcasescenario
05/06/2019
MPI+OpenMPFineGrain(Baseversion)scaling
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)
(Lowerisbetter)
51
Realcasescenario
05/06/2019
MPI+OpenMPFineGrain(Baseversion)scaling
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)
Startswithconsiderableoverhead
(Lowerisbetter)
52
Realcasescenario
05/06/2019
MPI+OpenMPFineGrain(Baseversion)scaling
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)
Startswithconsiderableoverhead
(Lowerisbetter)
Slightlybetterscalability
53
Realcasescenario
05/06/2019
MPI+OpenMPFineGrain(Baseversion)scaling
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)
Startswithconsiderableoverhead
(Lowerisbetter)
Slightlybetterscalability
GloballynoimprovementwithrespecttoMPI
5405/06/2019
PercentageRuntimeBreakdownRuntimeBreakdownofstrongscaling
MPI+OpenMPFineGrain
Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadØ Alsoloopswithconcurrentiterationsareparallelised
Ø AlmosteverythingisparallelisedØ Overheadtoavoidconcurrency
OpenMP
Idealscaling
Sequential
With7threadsonly80%isexecutedinparallel
95%ofthecodeisparallelized
55
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Speedup
Numberofcores
IN-SOCKETSPEEDUP
REF:1.7MElements,1Core,MPI
IDEAL MPI OMP_FG(Base) OMP_FG
In-socketstrongscaling
05/06/2019
MPI+OpenMPFineGrainperformances
StillworsescalabilitythanMPI
BetterscalabilitythanBase
56
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG
Realcasescenario
05/06/2019
MPI+OpenMPFineGrainscaling
(Lowerisbetter)
57
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG
Realcasescenario
05/06/2019
MPI+OpenMPFineGrainscaling
Startswithconsiderableoverhead
(Lowerisbetter)
58
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG
Realcasescenario
05/06/2019
MPI+OpenMPFineGrainscaling
Startswithconsiderableoverhead
Betterscalability
(Lowerisbetter)
59
0,1
1
10
10 100 1000 10000
Norm
alize
dWCT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG
Realcasescenario
05/06/2019
MPI+OpenMPFineGrainscaling
Startswithconsiderableoverhead
Betterscalability
(Lowerisbetter)
GloballynoimprovementwithrespecttoMPI
60
y=0,0027x
0,001
0,01
0,1
1
10
100
1000
1 10 100 1000 10000 100000
Time[us]
LoopIterations
OpenMPscalability
0th 1th 2th 3th 4th 5th6th 7th 8th 9th 10th 11th12th 13th 14th Linear(0th)
Minimumamountofwork
05/06/2019
OpenMPFineGrainlimits
MinimumamountofworkperOpenMPregiontohavesomegain
ThereisanoverheadduetoOpenMP
Sequential(withOpenMPpragmas)
Sequential(noOpenMPpragmas)
61
0
0,2
0,4
0,6
0,8
1
1,2
0 2 4 6 8 10 12 14 16
Time[us]
Numberofthreads
Fork-Joinoverheadbyloop iterations
1 2 3 5 8 10 20 30 50 80 100 200 300 500 800 1000
Fork-Joinoverhead
05/06/2019
OpenMPFineGrainlimits
Overhead:Ø IndependentoftheamountofworkØ IncreaseswithnumberofthreadsØ Imposesminimumworktobeeffective
62
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
63
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
ValueisupdatesequentiallyinIC
64
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
ValueisupdatesequentiallyinIC
65
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
WithOpenMP
Thread2
66
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
WithOpenMP
Thread2
Noguaranteeofdatacoherency
67
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
WithOpenMP
Thread2
Datarace
Noguaranteeofdatacoherency
68
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
AugmentedIC
WithOpenMPOneadditionalnon-concurrentcopyismadeinordertoavoiddataracesonIC
Thread2
69
Dataraces
05/06/2019
OpenMPFineGrainlimits
++++++++++++++++++
IC
WithoutOpenMP
Thread1
++++++++++++++++++
IC
AugmentedIC
WithOpenMPTheICisupdatedinparallelwithoutconcurrency.Thenonconcurrentcopyisadditionalcost.
Thread2
7005/06/2019
MPI+OpenMPFineGrain:recapandconclusionsmaster master
thread2
therad3
thread4
master master
thread2
thread3
thread4
master master
thread2
thread3
thread4
master
Ø Objective:Ø HavelargerdomainforoneMPIrank
Ø FewerMPIranksØ Lesscommunication
Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:
Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly
Ø Conclusions:Ø MinimumamountofworkperOpenMPregiondonotallowcomplete
codeparallelisationØ ThreadingscalabilityislimitedbyAmdahl’slaw
Ø FewerMPIranksallowbetteroverallscalabilityanywayØ OpenMPpragmasanddataconcurrencyoverheadpreventbetter
performancesthanMPI
7105/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
7205/06/2019
MPI+OpenMPCoarseGrainmaster master
thread2
therad3
thread4
master master
thread2
thread3
thread4
master master
thread2
thread3
thread4
master
Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads
Ø FewerMPIranksØ Collectivecommunicationlessexpensive
Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication
thread2
thread3
thread4
thread2
thread3
thread4
m
m
m
m
t2
t2
m
m
t2
t2
m
m
m
m
m
m
m
m
THISISFASTERforlargenumbersofprocesses
m
m
t2
t2 t2
t2
73
Collectivecommunications
05/06/2019
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN
MPI+OpenMPCoarseGrain
m
mm
m
m
mm
m
74
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2
75
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2
MPIcost
76
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2
OpenMPcost
MPIcost
+
77
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2
OpenMPcost
MPIcost
+ OpenMPcostdonotincreasewithnumberofcores:f(nthreads)=constant
OpenMPcostcanbereducedwithbetteralgorithms(WIP)
78
Collectivecommunications
05/06/2019
MPI+OpenMPCoarseGrain
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
m
m
m
mt2
t2
m
mt2
t2
m
mt2
t2 t2
t2 Gain
OpenMPcost
MPIcost
79
MPI+OpenMPCoarseGraindomaindecomposition
WithoutOpenMP WithOpenMPCoarseGrain
05/06/2019
Ø ThreadssubstituteMPIranksØ FewerMPIranks
Ø TheentireworkisdoneinparallelØ Threadsmustcommunicate
thread#1
thread#2
thread#3
80
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Speedup
Numberofcores
IN-SOCKETSPEEDUPREF:1.7MElements,1Core,MPI
IDEAL MPI OMP_FG OMP_CG
In-socketstrongscaling
05/06/2019
MPI+OpenMPCoarseGrainperformances
SamescalabilityasMPI
BetterscalabilitythanFineGrain
81
0,1
1
10
10 100 1000 10000
Norm
alizedW
CT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG
Realcasescenario
05/06/2019
MPI+OpenMPCoarseGrainscaling
(Lowerisbetter)
82
0,1
1
10
10 100 1000 10000
Norm
alizedW
CT
Numberofcores
SCALABILITYRef:14MElements,28Cores
IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG
Realcasescenario
05/06/2019
MPI+OpenMPCoarseGrainscaling
Catastrophicperformances
(Lowerisbetter)
83
PointtoPointCommunications
05/06/2019
MPI+OpenMPCoarseGrainlimits
ALL2ALLvianon-blockingP2Pcommunicationononenode
MPIMPI_2PPN+CG
84
PointtoPointCommunications
05/06/2019
MPI+OpenMPCoarseGrainlimits
ALL2ALLvianon-blockingP2Pcommunicationononenode
40timesslower
MPIMPI_2PPN+CG
85
PointtoPointCommunications
05/06/2019
MPI+OpenMPCoarseGrainlimits
ALL2ALLvianon-blockingP2Pcommunicationononenode
MPIimplementationsallowmultithreadingbutsequentialize internally.
Impossibletoattainanyperformance
40timesslower
MPIMPI_2PPN+CG
8605/06/2019
MPI+OpenMPCoarseGrain:recapandconclusionsmaster master
thread2
therad3
thread4
master master
thread2
thread3
thread4
master master
thread2
thread3
thread4
master
Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads
Ø FewerMPIranksØ Collectivecommunicationlessexpensive
Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication
Ø Conclusions:Ø OpenMPCoarseGrainallowscompleteparallelizationofthecode
Ø SameperformancesasMPIØ ImproveincollectivecommunicationsØ ImplementationspreventfullymultithreadedconcurrentMPIcalls
Ø Point2Pointcommunicationskillperformances
thread2
thread3
thread4
thread2
thread3
thread4
8705/06/2019
PresentationPlanning
ØContext
ØCFDcodesØGeneralconceptsØParallelism
ØReviewofparallelismparadigms
ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain
ØPerspectives&Conclusions
88
MPI+OpenMP
05/06/2019
Perspectives
ØSolvePointtoPointcommunicationproblemforCoarseGrainØBesmarterondomaindecomposition
Ø MinimizenumberofneighborsondifferentranksØFunnelallcommunicationtoonethread
Ø MorecommunicationsfordesignatedthreadsØ Idletimefornon-communicatingthreadsØ Moresynchronizationpoints
ØMPI4standardmayintroduceendpointsconceptØ FullymultithreadedMPI(Hopefully)Ø Mustwaitforlibrariestoimplementit
89
Perspectives
MPI+MPI3ØMPI3allowscreationofsharedwindowsinsideanode
ØSamesolutionasOpenMPCGforcollectivescomm.
ØMustverifyperformancesØStartsfrom1PPNcurveØExpensivesynchronization(?)
ØNoproblemforP2Pcomm.
GASPI(PGAS)
05/06/2019
ØAlternativetoMPIØUsesRMAinsteadofmessagesØFullymultithreaded
ØShouldsolveP2PproblemØCanbecombinedwithMPI
ØUsefulforcomplexcollectives
ØNotsupportedonallmachines
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Time[us]
Numberofcores
MPIALLREDUCE
MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG
9005/06/2019
Conclusions
Ø MPIreachedscalabilitylimitsinmodernarchitecturesØ Hybridcodescouldimproveperformances
Ø ItisnoteasytowriteaperforminghybridMPI+OpenMPcodeØ OpenMPFineGrainsuffersfromfork-joinoverheadandAmdahl’slawØ OpenMPCoarsegrainislimitedbyMPIimplementationsonP2Pcomms.
Ø OtherhybridsolutionsareworthexploringØ MPI+MPI3Ø (MPI+)GASPI+OpenMPØ …