IMPROVING THE SCALABILITY OF CFD CODES€¦ · CFD codes are not ready to take full advantage of...

IMPROVINGTHESCALABILITYOFCFDCODES

FrancescoGava,Ghislain Lartigue,VincentMoureauCNRS-CORIA

2

ICARUS* PROJECT

05/06/2019

Context

ØObjective: Developmentofhigh-fidelitycalculationtoolsforthedesignofhotpartsofengines(aerospace+automotive)

ØTask: Optimisationofcodes’performancesonHPCmachinesØMotivation: Nextgeneration(2020)machineswillbemassivelyparallel.

CFDcodesarenotreadytotakefulladvantageofsuchsupercomputers.

ØFunding: FUI– Fonds UniqueInterministériel

*IntensiveCalculationforAeRo andautomotiveenginesUnsteadySimulations

305/06/2019

PresentationPlanning

ØContext

ØCFDcodesØGeneralconceptsØParallelism

ØReviewofparallelismparadigms

ØDesignchoicesforanhybridcodeØMotivationØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain

ØPerspectives&Conclusions

405/06/2019


ØContext





5

Source:top500.org

05/06/2019

PerformancesoftheTop500

TheTop500isarankingofthe500mostpowerfulsupercomputersintheworld

6

Source:top500.org

05/06/2019


TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow

7

Source:top500.org

05/06/2019


Physicallimitsofmaterialsandenergyconsumptionarelimitingtheprocessorsfrequencies,hencetheperformances.

TheTop500isarankingofthe500mostpowerfulsupercomputersintheworldChangeinthetrend:Performanceincreasesmuchslowernow

8

PreparedbyC.Batten – SchoolofElectricalandComputerEngineering– CornellUniversity– 2005– retrievedDec122012–http://www.cls.cornell.edu/courses/ece5950/handouts/ece5950-overview.pdf

05/06/2019

Multicorearchitectures

Sequentialperformancesarelimited,butwithmorecorestheparallelperformancescanstillincrease.

Almostallsupercomputersusemulticoreprocessors.

Thenumberofcorespersocketisever-increasingandmorevaried.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 11/200011/200611/201011/201411/2018

Systemshare

Date

TOP500Corespersocket

1 2 4

6 8 10

12 14 16

18 20 24

68 64 Others

9

CPU

L1Cache

L2Cache

L3Cache

RAM

CPU

L1Cache

L2Cache

L3Cache

RAM

CPU

L1Cache

L2Cache

L3Cache

RAM

Thememoryhierarchy

Mono-core Multi-core

05/06/2019

L3CacheItissharedamongstallCPUs

RAM

CPU

L1Cache

CPU

L1Cache

CPU

L1Cache

L2Cache L2Cache L2Cache

Network

Fastest32KB1cycle

Faster256KB3cycles

FastFewMB

10cycles

SlowManyGB100+cycles

1005/06/2019

Therooflinemodel

Codeperformancecanbelimitedby:Ø Processorspeed(computebound)Ø Memoryaccessspeed(memorybound)

ArithmeticIntensity(flops/byte)

AttainablePerform

ance(G

flops)

MemoryBound

ComputeBound

InCFDsolvers:Ø FastcomputationØ HighnumberofmemoryaccessesØ Largedatasizes

Theaimistomoveoverthere:Ø Exploitmemoryhierarchy

Ø WorkonsmallerdataØ Computeasmuchaspossible

onthesamedata

1105/06/2019


ØContext





1205/06/2019

ComputationalFluidDynamics

CFDSOLVER

PRECCINSTAburnerwithYales2

Generally,aCFDcode:Ø SolvesNavier-Stokes(andother)equations

Ø ReliesonlinearoperatorsØ Fastcomputations(additions,…)Ø Alotofmemoryread/write

Ø NeedtoexploitmemoryhierarchyØ Usesadiscretizeddomain

Ø ThefinerthediscretizationthehighertheprecisionØ LargemeshesmaynotfitintoRAMand

takelongertimetocomputeØ Useparallelsolvers

@u

@t+ (u ·r)u = �1

⇢rp+ ⌫r2u

1305/06/2019

FromincompressiblemomentumtoPoisson’sequation

@u

@t+ (u ·r)u = �1

⇢rp+ ⌫r2u

Solvetheincompressiblemomentumequationforu

Aprediction-correctionmethod[1]

Imposingthecontinuityequation

r · un+1 = 0LeadstothePoisson’sequationforpressure

r2pn+1 = rhs

+

Whichcanberewrittenasalinearsystem

Lp = b

Mustsolveforp tohaveu

[1]Chorin,A.J.(1967), "ThenumericalsolutionoftheNavier-Stokesequationsforanincompressiblefluid”, Bull.Am.Math.Soc., 73:928–931

1405/06/2019

Poisson’sequationandConjugateGradientmethod

Thelinearsystemhastobesolvedforp

Thiscanbesolvedwithaniterativemethod

Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork

untilconvergence

Lp = b

r0 = b� Lp0d0 = r0

k = 0

� = convergence criterion

err = ||r0||1while (err > �)

↵k =rTkrk

dTkLdk

pk+1 = pk + ↵kdk

rk+1 = rk � ↵kLdkerr = ||rk||1

�k =rTk+1rk+1rTkrk

dk+1 = rk+1 + �kdk

k = k + 1

end while

return pk as the result

TheConjugateGradientmethod

Thismethodistrivialwithoutparallelism

1505/06/2019


ØContext





1605/06/2019

Parallelcomputationanddomaindecomposition

LargeproblemscannotbecomputedbyasingleprocessØ Domaindecompositiontodividetheproblemamongstmanyprocesses

Ø MorememoryavailableØ MorecomputationalpowerØ Communicationneeded

Dataonthesenodeshavetobeexchangedbetweenprocesses

L3CacheItissharedamongstallCPUs

RAM

CPU

L1Cache

CPU

L1Cache

CPU

L1Cache

L2Cache L2Cache L2Cache

17

Computationonadomainnode

Insidethedomain Ondomainboundary

05/06/2019

i

Ø NeedscontributionofallneighbournodesØ Allsurroundingnodesbelongtothedomain

Ø Noproblem

Ø NeedscontributionofallneighbournodesØ Somenodesdonotbelongtothedomain

Ø Mustcommunicatewithneighbours

i

proc#1 proc#1

proc#2

r�i =X

j2Ni

f(�i,�j ,Mij)

1805/06/2019

ParallelConjugateGradientmethod

Thelinearsystemhastobesolvedforp

Thiscanbesolvedwithaniterativemethod

Letrk betheresidualatiterationk.Iteratealongthedirectiondk conjugatetork

untilconvergence

Lp = b

r0 = b� Lp0d0 = r0

k = 0

� = convergence criterion

err = ||r0||1while (err > �)

↵k =rTkrk

dTkLdk

pk+1 = pk + ↵kdk

rk+1 = rk � ↵kLdkerr = ||rk||1

�k =rTk+1rk+1rTkrk

dk+1 = rk+1 + �kdk

k = k + 1

end while

return pk as the result

Thisalgorithmrequires4COLLECTIVEcommunications:oneforeachscalarproduct(3)and

oneforthenorm

TheConjugateGradientmethod

ThisalgorithmrequiresaPOINTTOPOINTcommunicationtocomputeLd

1905/06/2019

YALES2structure

Domaindecompositiontodividetheproblemamongstmanyprocessors

YALES2usesaDoubleDomainDecompositionØ Eachsubdomainissplitinsmallgroupsofelements

(EL_GRPs)whichwillfitintoL3,possiblyL2

Dataonthesenodeshavetobeexchangedbetweenprocessors.YALES2hasadedicateddatastructure:theexternalcommunicator(EC)

DataonthesenodeshavetobeexchangedbetweenEL_GRPsonthesameprocessor.YALES2hasadedicateddatastructure:theinternalcommunicator(IC)

gridofproc#1

el_grp

el_grp

el_grp

boundary boundary

int.comm ext.comm

ext.comm

proc#2

proc#3

ii

Ø InYALES2boundarynodesareduplicatedØ PartialvalueiscomputedoneachsideØ Totalvalueiscomputedonint.comm.

ThedecomposeddomainisstilltobigtofitintoL3cache

2005/06/2019

YALES2InternalCommunicator

Theinternal communicatorisanarrayusedtocomputethecontributionofeachGROUP onasharednode

Ø NodesonboundariesbetweengroupsareduplicatedØ Eachel_grp computes

itsowncontributionØ Contributionsareadded

ontheinternalcomm.Ø Totalvalueispossibly

copiedbackonel_grpnode

++++++++++++++++++

IC

2105/06/2019







++++++++++++++++++

IC

2205/06/2019







++++++++++++++++++

IC

2305/06/2019






copiedbackonel_grpnodes

++++++++++++++++++

IC

24

SENDEC

RECVEC

05/06/2019

YALES2ExternalCommunicator

Theexternal communicatorisanarrayusedtoexchangethecontributionofeachPROCESS onasharednode

Ø Nodesonboundariesbetweenprocessesareduplicated

proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC

25

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC

Ø Eachel_grp computesitsowncontribution

26

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC


Ø Contributionsareaddedontheinternalcomm.

27

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC



Ø Totalvalueiscopiedontheexternalsendcommunicatorandsenttopartnerprocess

28

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC




Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator

29

SENDEC

RECVEC

05/06/2019




proc#2RECVEC

proc#2SENDEC

proc#3RECVEC

proc#3SENDEC

++++++++++++++++++++++++++++++++++++++

IC




Ø Valuereceivedonexternalreceivecommunicatorisaddedtointernalcommunicator

Ø Finalvalueispossiblycopiedbackonel_grp nodes

3005/06/2019


ØContext





3105/06/2019

Dataexchange

Therearemainly3(and1/2)waystoachieveparallelism

MPI(MessagePassing)

OpenMP(SharedMemory)

PGAS(PartitionedGlobalAccessSpace)

MPI+OpenMP(Hybrid)

Oldbut(almost)gold

Easybut(very)limited

Promisingbut(too)new(Nottreatedhere)

3205/06/2019

MPI:MessagePassingInterface

Ø Reliesonamessageexchangeparadigm(oftenthroughnetwork)Ø Themostcommonparadigm

Ø VerywelltestedØ A lotofsupport

Ø FairlyeasytoimplementØ Couldbeusedonanyplatform

Ø WorksbetterondistributedmemorysystemsØ Doesnottakeadvantageofsharedmemory

Ø DoesnotscaleathighnumberofprocessesØ CollectivecommunicationsareabottleneckØ Cannotexploitwellhugesupercomputers

CurrentlyYALES2usesMPI

CPUL1

L2

L3

RAM

CPUL1

L2

L3

RAM

CPUL1

L2

L3

RAM

Network

3305/06/2019

OpenMP:SharedMemoryparallelism

Ø ReliesonmemorysharedamongstcoresØ Verycommon

Ø WelltestedØ Goodsupport

Ø Extremelyeasytoimplement(finegrain)Ø CanbeusedONLYonarchitectureswith

sharedmemoryØ CannotgobeyondNUMA-Domaincores

Ø Mustbeusedtogetherwithanotherparadigm(MPI,..)

Ø OverheadofpragmasisnotnegligibleØ Hardtogetfullparallelisation(Amdahl’slaw)

L3Cache(Shared)

RAM

CPUL1

CPUL1

CPUL1

L2 L2 L2

3405/06/2019


ØContext



ØDesignchoicesforanhybridcodeØMotivationsØMPI+OpenMPFineGrainØMPI+OpenMPCoarseGrain


3505/06/2019

HybridMPI+OpenMP:motivations

Ø MPIcodesdoesnotscaleindefinitelyØ ThreadingcanreducenumberofMPIrankshenceimprovescalability

Ø MPIalonecannottakefulladvantageofmulticorearchitecturesØ OpenMPcanexploitsharedmemory

https://nvidia-gpugenius.highspot.com/viewer/5bf5139e659e9366ed606a3e?iid=5bf5134ac714335696ba3410

3605/06/2019

Performancemeasuresspecifications

AllfollowingmeasureswereobtainedonMyria supercomputeratCRIANN:Processor :Bi-socketBroadwell (2x14cores@2.4GHz,128GB@2.4GHzDDR4RAM)Network :LowlatencyhighbandwidthIntelOmni-Path(100Gbit/s)MPIlibrary :Intel-MPI2017.1.132(othersgivesimilarresults)Testcase :Incompressible,non-reactivePRECCINSTAburner

PRECCINSTAburnerwithYales2

Myria supercomputeratCRIANN

37

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores

SCALABILITYRef:14MElements,28Cores

IDEALWEAK IDEALSTRONG 14M 110M 870M

MPIscalability

38

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019

MPIscalability

WallClockTime(Lowerisbetter)

39

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019

MPIscalability


Strongscalability:constantglobalwork=>linearlydecreasingWCT

40

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019

MPIscalability


Weakscalability:constantworkperprocess=>constantWCT

41

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI

Realcasescenario

05/06/2019

MPIscalability


42

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI

Realcasescenario

05/06/2019

MPIscalability


Deviationfromidealmainlyduetocommunications

43

Collectivecommunications

05/06/2019

MPIscalabilitylimits

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE

MPI_28PPNPPN=ProcessesPerNode

44


05/06/2019

MPIscalabilitylimits

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE

MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN

ReducingthenumberofcommunicatingPPNwhilstmaintainingtheamountofcoresforcomputationwillreducethecostofcollectivecommunications

PPN=ProcessesPerNode

4505/06/2019


ØContext





4605/06/2019

MPI+OpenMPFineGrainmaster master

thread2

therad3

thread4

master master

thread2

thread3

thread4

master master

thread2

thread3

thread4

master

Ø Objective:Ø HavelargerdomainforoneMPIrank

Ø FewerMPIranksØ Lesscommunication

Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:

Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly

47

MPI+OpenMPFineGraindomaindecomposition

WithoutOpenMP WithOpenMPFineGrain

05/06/2019

Ø LargerMPIdomainØ FewerMPIranksØ ThreadsshareworkonEL_GRPs

Ø Musttakecareofdataraces

thread#4

thread#3

thread#1

thread#2

4805/06/2019

MPI+OpenMPFineGrain(Baseversion)

RuntimeBreakdown PercentageRuntimeBreakdown

Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadonlyØ Onlyloopswithindependentiterationsareparallelised

Ø NoconcurrencyØ Notmuchparallelised

OpenMP

Idealscaling

Sequential

With7threadsonly40%isexecutedinparallel

80%ofthecodeisparallelized

49

In-socketstrongscaling

05/06/2019

MPI+OpenMPFineGrain(Baseversion)performances

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Speedup

Numberofcores

IN-SOCKETSPEEDUP

REF:1.7MElements,1Core,MPI

IDEAL MPI OMP_FG(Base)

WorsescalabilitythanMPI

50

Realcasescenario

05/06/2019

MPI+OpenMPFineGrain(Baseversion)scaling

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG(Base)

(Lowerisbetter)

51

Realcasescenario

05/06/2019


0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Startswithconsiderableoverhead

(Lowerisbetter)

52

Realcasescenario

05/06/2019


0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores




(Lowerisbetter)

Slightlybetterscalability

53

Realcasescenario

05/06/2019


0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores




(Lowerisbetter)

Slightlybetterscalability

GloballynoimprovementwithrespecttoMPI

5405/06/2019

PercentageRuntimeBreakdownRuntimeBreakdownofstrongscaling

MPI+OpenMPFineGrain

Ø ProcesseshavelargerdomainØ ThreadsshareworkongroupsofelementØ CommunicationisdonebymasterthreadØ Alsoloopswithconcurrentiterationsareparallelised

Ø AlmosteverythingisparallelisedØ Overheadtoavoidconcurrency

OpenMP

Idealscaling

Sequential

With7threadsonly80%isexecutedinparallel

95%ofthecodeisparallelized

55

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Speedup

Numberofcores

IN-SOCKETSPEEDUP

REF:1.7MElements,1Core,MPI

IDEAL MPI OMP_FG(Base) OMP_FG


05/06/2019

MPI+OpenMPFineGrainperformances

StillworsescalabilitythanMPI

BetterscalabilitythanBase

56

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG

Realcasescenario

05/06/2019

MPI+OpenMPFineGrainscaling

(Lowerisbetter)

57

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019



(Lowerisbetter)

58

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019



Betterscalability

(Lowerisbetter)

59

0,1

1

10

10 100 1000 10000

Norm

alize

dWCT

Numberofcores



Realcasescenario

05/06/2019



Betterscalability

(Lowerisbetter)

GloballynoimprovementwithrespecttoMPI

60

y=0,0027x

0,001

0,01

0,1

1

10

100

1000

1 10 100 1000 10000 100000

Time[us]

LoopIterations

OpenMPscalability

0th 1th 2th 3th 4th 5th6th 7th 8th 9th 10th 11th12th 13th 14th Linear(0th)

Minimumamountofwork

05/06/2019

OpenMPFineGrainlimits

MinimumamountofworkperOpenMPregiontohavesomegain

ThereisanoverheadduetoOpenMP

Sequential(withOpenMPpragmas)

Sequential(noOpenMPpragmas)

61

0

0,2

0,4

0,6

0,8

1

1,2

0 2 4 6 8 10 12 14 16

Time[us]

Numberofthreads

Fork-Joinoverheadbyloop iterations

1 2 3 5 8 10 20 30 50 80 100 200 300 500 800 1000

Fork-Joinoverhead

05/06/2019


Overhead:Ø IndependentoftheamountofworkØ IncreaseswithnumberofthreadsØ Imposesminimumworktobeeffective

62

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

63

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

ValueisupdatesequentiallyinIC

64

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

ValueisupdatesequentiallyinIC

65

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

WithOpenMP

Thread2

66

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

WithOpenMP

Thread2

Noguaranteeofdatacoherency

67

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

WithOpenMP

Thread2

Datarace

Noguaranteeofdatacoherency

68

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

AugmentedIC

WithOpenMPOneadditionalnon-concurrentcopyismadeinordertoavoiddataracesonIC

Thread2

69

Dataraces

05/06/2019


++++++++++++++++++

IC

WithoutOpenMP

Thread1

++++++++++++++++++

IC

AugmentedIC

WithOpenMPTheICisupdatedinparallelwithoutconcurrency.Thenonconcurrentcopyisadditionalcost.

Thread2

7005/06/2019

MPI+OpenMPFineGrain:recapandconclusionsmaster master

thread2

therad3

thread4

master master

thread2

thread3

thread4

master master

thread2

thread3

thread4

master

Ø Objective:Ø HavelargerdomainforoneMPIrank

Ø FewerMPIranksØ Lesscommunication

Ø DividetheworkamongthreadsØ Basedonfork-joinmodel:

Ø SimplepragmasaroundloopsØ WorkonloopsissharedbyallthreadsØ Workoutsideloopsandcommunicationisdonebymasterthreadonly

Ø Conclusions:Ø MinimumamountofworkperOpenMPregiondonotallowcomplete

codeparallelisationØ ThreadingscalabilityislimitedbyAmdahl’slaw

Ø FewerMPIranksallowbetteroverallscalabilityanywayØ OpenMPpragmasanddataconcurrencyoverheadpreventbetter

performancesthanMPI

7105/06/2019


ØContext





7205/06/2019

MPI+OpenMPCoarseGrainmaster master

thread2

therad3

thread4

master master

thread2

thread3

thread4

master master

thread2

thread3

thread4

master

Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads

Ø FewerMPIranksØ Collectivecommunicationlessexpensive

Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication

thread2

thread3

thread4

thread2

thread3

thread4

m

m

m

m

t2

t2

m

m

t2

t2

m

m

m

m

m

m

m

m

THISISFASTERforlargenumbersofprocesses

m

m

t2

t2 t2

t2

73


05/06/2019

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE

MPI_28PPN

MPI+OpenMPCoarseGrain

m

mm

m

m

mm

m

74


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE

MPI_28PPN MPI_1PPN MPI_2PPN MPI_4PPN MPI_2PPN+CG MPI_4PPN+CG

m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2

75


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2

MPIcost

76


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2

OpenMPcost

MPIcost

+

77


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2

OpenMPcost

MPIcost

+ OpenMPcostdonotincreasewithnumberofcores:f(nthreads)=constant

OpenMPcostcanbereducedwithbetteralgorithms(WIP)

78


05/06/2019


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


m

m

m

mt2

t2

m

mt2

t2

m

mt2

t2 t2

t2 Gain

OpenMPcost

MPIcost

79

MPI+OpenMPCoarseGraindomaindecomposition

WithoutOpenMP WithOpenMPCoarseGrain

05/06/2019

Ø ThreadssubstituteMPIranksØ FewerMPIranks

Ø TheentireworkisdoneinparallelØ Threadsmustcommunicate

thread#1

thread#2

thread#3

80

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Speedup

Numberofcores

IN-SOCKETSPEEDUPREF:1.7MElements,1Core,MPI

IDEAL MPI OMP_FG OMP_CG


05/06/2019

MPI+OpenMPCoarseGrainperformances

SamescalabilityasMPI

BetterscalabilitythanFineGrain

81

0,1

1

10

10 100 1000 10000

Norm

alizedW

CT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG

Realcasescenario

05/06/2019

MPI+OpenMPCoarseGrainscaling

(Lowerisbetter)

82

0,1

1

10

10 100 1000 10000

Norm

alizedW

CT

Numberofcores


IDEALWEAK IDEALSTRONG 14M 110M 870M MPI OMP_FG OMP_CG

Realcasescenario

05/06/2019

MPI+OpenMPCoarseGrainscaling

Catastrophicperformances

(Lowerisbetter)

83

PointtoPointCommunications

05/06/2019

MPI+OpenMPCoarseGrainlimits

ALL2ALLvianon-blockingP2Pcommunicationononenode

MPIMPI_2PPN+CG

84


05/06/2019



40timesslower

MPIMPI_2PPN+CG

85


05/06/2019



MPIimplementationsallowmultithreadingbutsequentialize internally.

Impossibletoattainanyperformance

40timesslower

MPIMPI_2PPN+CG

8605/06/2019

MPI+OpenMPCoarseGrain:recapandconclusionsmaster master

thread2

therad3

thread4

master master

thread2

thread3

thread4

master master

thread2

thread3

thread4

master

Ø Objective:Ø Getridoffork-joinoverheadandsequentialcomputationØ SubstituteMPIranksbythreads

Ø FewerMPIranksØ Collectivecommunicationlessexpensive

Ø TheentirecodeisinsideoneOpenMPregion:Ø Entirecodemustbethread-safe(extremelyhardtocodeanddebug)Ø Threadsdoalltheworkandthecommunication

Ø Conclusions:Ø OpenMPCoarseGrainallowscompleteparallelizationofthecode

Ø SameperformancesasMPIØ ImproveincollectivecommunicationsØ ImplementationspreventfullymultithreadedconcurrentMPIcalls

Ø Point2Pointcommunicationskillperformances

thread2

thread3

thread4

thread2

thread3

thread4

8705/06/2019


ØContext





88

MPI+OpenMP

05/06/2019

Perspectives

ØSolvePointtoPointcommunicationproblemforCoarseGrainØBesmarterondomaindecomposition

Ø MinimizenumberofneighborsondifferentranksØFunnelallcommunicationtoonethread

Ø MorecommunicationsfordesignatedthreadsØ Idletimefornon-communicatingthreadsØ Moresynchronizationpoints

ØMPI4standardmayintroduceendpointsconceptØ FullymultithreadedMPI(Hopefully)Ø Mustwaitforlibrariestoimplementit

89

Perspectives

MPI+MPI3ØMPI3allowscreationofsharedwindowsinsideanode

ØSamesolutionasOpenMPCGforcollectivescomm.

ØMustverifyperformancesØStartsfrom1PPNcurveØExpensivesynchronization(?)

ØNoproblemforP2Pcomm.

GASPI(PGAS)

05/06/2019

ØAlternativetoMPIØUsesRMAinsteadofmessagesØFullymultithreaded

ØShouldsolveP2PproblemØCanbecombinedwithMPI

ØUsefulforcomplexcollectives

ØNotsupportedonallmachines

0

5

10

15

20

25

30

0 500 1000 1500 2000 2500 3000

Time[us]

Numberofcores

MPIALLREDUCE


9005/06/2019

Conclusions

Ø MPIreachedscalabilitylimitsinmodernarchitecturesØ Hybridcodescouldimproveperformances

Ø ItisnoteasytowriteaperforminghybridMPI+OpenMPcodeØ OpenMPFineGrainsuffersfromfork-joinoverheadandAmdahl’slawØ OpenMPCoarsegrainislimitedbyMPIimplementationsonP2Pcomms.

Ø OtherhybridsolutionsareworthexploringØ MPI+MPI3Ø (MPI+)GASPI+OpenMPØ …

IMPROVING THE SCALABILITY OF CFD CODES€¦ · CFD codes are not ready to take full advantage of...

Documents

Transcript of IMPROVING THE SCALABILITY OF CFD CODES€¦ · CFD codes are not ready to take full advantage of...

arrêté interministériel du 24 août 2004

CFD Analysis of a Low Head Propeller Turbine with ... with Comparison to Experimental ... 1 CFD Analysis of a Low Head Propeller Turbine with Comparison to ... two CFD codes, ANSYS

Basic Concepts about CFD Models - unipi.it · 2011. 1. 13. · CFD codes , mostly making use of “two -equation models ”, to be described later on General remarks on turbulent

FSI Simulation with Abaqus and Third -party CFD Codes · 2011-05-19 · FSI Simulation with Abaqus and Third -party CFD Codes Revision Status Lecture 1 5/11 Updated for 6.11 Lecture

Other CFD Methodsgtryggva/CFD-Course/2011-Lecture-29.pdfFinite Volume Methods are used in the vast majority of CFD codes. However, there are other ways to compute the motion of ﬂuids.

Fluid Simulation with CUDA · 2009-11-24 · CFD Code Design Considerations Legacy CFD codes in wide use CFD community is cautiously exploring GPU computing Major issue: How to handle

Benchmark and validation of Open Source CFD codes, …publications.lib.chalmers.se/records/fulltext/198965/198965.pdfBenchmark and validation of Open Source CFD codes, with focus on

CFD Models of a Serpentine Inlet, Fan, and Nozzle · CFD Models of a Serpentine Inlet, Fan, and Nozzle Several CFD codes were used to analyze the Versatile Integrated Inlet Propulsion

CFD Modelling of wall steam condensation by a two-phase ... · PDF fileExperimental Validation and Application of CFD and CMFD Codes to Nuclear Reactor Safety Issues (XCFD4NRS-3) Washington

Modelling of H2 Dispersion and Combustion Phenomena Using CFD Codes

COMPUTATIONAL FLUID DYNAMICS USING …web.stanford.edu/class/me469b/handouts/introductionb.pdfCOMPUTATIONAL FLUID DYNAMICS USING COMMERCIAL CFD CODES ... CFD of the flow in a Jet Engine

Forest Simulation with Industrial CFD Codeskth.diva-portal.org/smash/get/diva2:1353070/FULLTEXT01.pdf · 2019-09-20 · Forest Simulation with Industrial CFD Codes Petter Cedell 940409-3271

ASSESSMENT OF CFD CODES USED IN NUCLEAR ... - Korea …

Verification Study of a CFD-RANS Code for Turbulent Flow at … · 2016-03-07 · Abstract—This article surveys the verification and validation ... CFD codes and the commercial

FSI Simulation with Abaqus and Third -party CFD · PDF fileFSI Simulation with Abaqus and Third -party CFD Codes 2016 . ... Structural and Fluid Flow Optimization Topology , Sizing,

HICFD – Highly Efﬁcient Implementation of CFD Codes for HPC … · 2013-12-12 · HICFD – Highly Efﬁcient Implementation of CFD Codes for HPC Many-Core Architectures Achim

Computational Fluid Dynamics (CFD)evahid.com/upload/file/1517756164Chapter 1.pdf · · 2018-02-04Finite Difference Methods (FDM) ... programming language (most CFD codes are written

Extension of CFD Codes Application to Two-Phase Flow Safety Problems

Comparison of CFD and LP Codes for the Simulation of ... file809.1 Comparison of CFD and LP Codes for the Simulation of Hydrogen Combustion Experiments Tadej Holler1, Ed M. J. Komen2,

Simulation of flow in an idealised city using various CFD ...balczo/publications/Goricsan_IJEP... · Simulation of flow in an idealised city using various CFD codes 363 • the negative