Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory...

Post on 01-May-2020

1 views 0 download

Transcript of Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory...

DesigningHigh-PerformanceandScalableCollectivesfortheMany-coreEra:TheMVAPICH2Approach

S.Chakraborty,M.Bayatpour,J.Hashmi,H.SubramoniandDKPandaTheOhioStateUniversity

E-mail:{chakraborty.52,bayatpour.1,hashmi.29,subramoni.1,panda.2}@osu.edu

IXPUG’18Presentation

Presenter: JahanzebHashmi

IXPUG‘18 2NetworkBasedComputingLaboratory

ParallelProgrammingModelsOverviewP1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

PartitionedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

• Programmingmodelsprovideabstractmachinemodels

• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

• Programmingmodelsoffervariouscommunicationprimitives– Point-to-point(betweenpairofprocesses/threads)

– RemoteMemoryAccess(directlyaccessmemoryofanotherprocess)

– Collectives(groupcommunication)

IXPUG‘18 3NetworkBasedComputingLaboratory

SupportingProgrammingModelsforMulti-PetaflopandExaflopSystems:Challenges

ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.

ApplicationKernels/Applications

NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmni-Path)

Multi-/Many-coreArchitectures

Accelerators(GPUandFPGA)

MiddlewareCo-Design

Opportunitiesand

ChallengesacrossVarious

Layers

PerformanceScalabilityResilience

CommunicationLibraryorRuntimeforProgrammingModelsPoint-to-pointCommunication

CollectiveCommunication

Energy-Awareness

SynchronizationandLocks

I/OandFileSystems

FaultTolerance

IXPUG‘18 4NetworkBasedComputingLaboratory

99

58

2131

95

63

23 27

65

0

25

50

75

100

MPI-FFT VASP AMG COSMO Graph500 MiniFE MILC DL-POLY HOOMD-blue

Percentageof

Commun

icatio

nTimeSpent

inCollectiveOperatio

ns(%

)

WhyCollectiveCommunicationMatters?

http://www.hpcadvisorycouncil.com

Allto

all,Bcast

Allto

all,Allre

duce,A

lltoa

ll

Allre

duce,B

cast,A

lltoA

ll

Allre

duce

Allre

duce,B

cast

Allre

duce

Allre

duce

Allre

duce

Allre

duce,B

cast

• HPCAdvisoryCouncil(HPCAC)MPIapplicationprofiles

• Mostapplicationprofilesshowedmajorityoftimespentincollectiveoperations

• Optimizingcollectivecommunicationdirectlyimpactsscientificapplicationsleadingtoacceleratedscientificdiscovery

IXPUG‘18 5NetworkBasedComputingLaboratory

WhydifferentalgorithmsofevenadensecollectivesuchasAlltoalldonotachievetheoreticalpeakbandwidthofferedbythesystem?

AreCollectiveDesignsinMPIreadyforManycoreEra?

1

100

10000

1000000

1 2 4 8 16 32 64 128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

BrucksRecursive-DoublingScatter-DestinationPairwise-ExchangeTheoretical-Peak

Effectiveband

width(M

B/s)

AlltoallAlgorithmsonsingleKNL7250inCache-modeon64MPIprocessesusingMVAPICH2-2.3rc1

~35%ofpeak

IXPUG‘18 6NetworkBasedComputingLaboratory

• Exploitinghighconcurrencyandhighbandwidthofferedbymodernarchitectures

• Designing“zero-copy”and“contention-free”CollectiveCommunication

• Efficienthardwareoffloadingforbetteroverlapofcommunicationandcomputation

BroadChallengesduetoArchitecturalAdvances

HowdoesMVAPICH2asanMPIlibrarytacklesthesechallengesandprovideoptimalcollectivedesignsforemergingmulti-/many-cores?

IXPUG‘18 7NetworkBasedComputingLaboratory

OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.1),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,875organizationsin86countries

– Morethan464,000(>0.46million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(Nov‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 9th,556,104cores(Oakforest-PACS)inJapan

• 12th,368,928-core(Stampede2)atTACC

• 17th,241,108-core(Pleiades)atNASA

• 48th,76,032-core(Tsubame2.5)atTokyoInstituteofTechnology

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade

IXPUG‘18 8NetworkBasedComputingLaboratory

ArchitectureofMVAPICH2SoftwareFamily

HighPerformanceParallelProgrammingModels

MessagePassingInterface(MPI)

PGAS(UPC,OpenSHMEM,CAF,UPC++)

Hybrid--- MPI+X(MPI+PGAS+OpenMP/Cilk)

HighPerformanceandScalableCommunicationRuntimeDiverseAPIsandMechanisms

Point-to-point

Primitives

CollectivesAlgorithms

Energy-Awareness

RemoteMemoryAccess

I/OandFileSystems

FaultTolerance

Virtualization ActiveMessages

JobStartupIntrospection&Analysis

SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,Omni-Path)

SupportforModernMulti-/Many-coreArchitectures(Intel-Xeon,OpenPower,Xeon-Phi,ARM,NVIDIAGPGPU)

TransportProtocols ModernFeatures

RC XRC UD DC UMR ODPSR-IOV

MultiRail

TransportMechanismsSharedMemory

CMA IVSHMEM

ModernFeatures

MCDRAM* NVLink* CAPI*

* Upcoming

XPMEM*

IXPUG‘18 9NetworkBasedComputingLaboratory

MVAPICH2SoftwareFamilyHigh-PerformanceParallelProgrammingLibraries

MVAPICH2 SupportforInfiniBand,Omni-Path,Ethernet/iWARP,andRoCE

MVAPICH2-X AdvancedMPIfeatures,OSUINAM,PGAS(OpenSHMEM,UPC,UPC++,andCAF),andMPI+PGASprogrammingmodelswithunifiedcommunicationruntime

MVAPICH2-GDR OptimizedMPIforclusterswithNVIDIAGPUs

MVAPICH2-Virt High-performanceandscalableMPIforhypervisorandcontainerbasedHPCcloud

MVAPICH2-EA EnergyawareandHigh-performanceMPI

MVAPICH2-MIC OptimizedMPIforclusterswithIntelKNC

Microbenchmarks

OMB MicrobenchmarkssuitetoevaluateMPIandPGAS(OpenSHMEM,UPC,andUPC++)librariesforCPUsandGPUs

Tools

OSUINAM Networkmonitoring,profiling,andanalysisforclusterswithMPIandschedulerintegration

OEMT UtilitytomeasuretheenergyconsumptionofMPIapplications

IXPUG‘18 10NetworkBasedComputingLaboratory

• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point

– DirectShared-memory

– DataPartitionedMulti-Leader(DPML)

• Designing“zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

– Truezero-copycollectives

• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives

– CORE-DirectbasedNon-blockingcollectives

Agenda

IXPUG‘18 11NetworkBasedComputingLaboratory

• Commonlyusedapproachinimplementingcollectives

• Easytoexpressalgorithmsinmessagepassingsemantics

• AnaïveBroadcastcouldbeaseriesof“send”operationsfromroottoallthenon-rootprocesses

• Reliesontheimplementationofpointtopointprimitives

• Limitedbytheoverheadsexposedbytheseprimitives

– Tag-matching

– Rendezvoushand-shake

CollectiveDesignsbasedonPoint-to-pointPrimitives

IXPUG‘18 12NetworkBasedComputingLaboratory

// root = 0

// msg-size > eager-size

if (rank == 0) {

for (i = 1 to n-1) {

MPI_Send(buf,...i,...);

}

} else {

MPI_Recv(buf,...0,...);

}

ANaïveExampleofMPI_BcastwhenusingMPI_Send/MPI_RecvMPI_Send MPI_Recv MPI_Recv

MPI_Send

Applicationskewandlibraryoverhead

MPI_Send

.

.

.

• Overheadsofhandshakeforeachrendezvousmessagetransfer

• Isthereabetterway?

Tagmatching

ReceiverwastingCPUcycleswaitingforSender’srequest

rootMPI_Send

MPI_Recv MPI_Recv

IXPUG‘18 13NetworkBasedComputingLaboratory

• Alargeshared-memoryregion– Collectivealgorithmsarerealizedbyshared-

memorycopiesandsynchronizations

– Goodperformanceforsmallmessageviaexploitingcachelocality

– AvoidoverheadsassociatedwithMPIpoint-to-pointimplementations

• Requiresoneadditionalcopyforeachtransfer

• Performancedegradationsforlargemessagecommunication– memcpy ()isthedominantcostforlargemessages

• MostMPIlibrariesusesomevariantofDirectSHMEMcollectives

DirectSharedMemorybasedCollectives

ReductioncollectivesperformevenworsewithSHMEMbaseddesignbecauseofcompute+memcpy

1

10

100

1000

10000

100000

1000000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

MessageSize

KNL(64Processes)

DirectSHMEM

Point-to-point

PersonalizedAll-to-All

IXPUG‘18 14NetworkBasedComputingLaboratory

DataPartitioningbasedMulti-Leader(DPML)Designs• Hierarchicalalgorithmsdelegatelotofcomputationonthe“node-leader”

– Leaderprocessresponsibleforinter-nodereductionswhileintra-nodenon-rootprocesseswaitfortheleader

• ExistingdesignsforMPI_Allreducedonottakeadvantageofthevastparallelismavailableinmodernmulti-/many-coreprocessors

• DPML- anewsolutionforMPI_Allreduce• Takesadvantageoftheparallelismofferedby

– Multi-/many-corearchitectures– Highthroughputandhigh-endfeaturesofferedbyInfiniBandandOmni-Path

• MultiplepartitionsofreductionvectorsforarbitrarynumberofleadersM.Bayatpour,S.Chakraborty,H.Subramoni,X.Lu,andD.K.Panda,ScalableReductionCollectiveswithDataPartitioning-basedMulti-LeaderDesign,Supercomputing'17.

IXPUG‘18 15NetworkBasedComputingLaboratory

0

200

400

600

800

1000

1200

8K 16K 32K 64K 128K 256K

Latency(us)

MessageSize

MVAPICH2-2.2GADPMLIMPI

PerformanceofDPMLMPI_Allreduce OnDifferentNetworksandArchitectures

• 2XimprovementofoverMVAPICH2at256K

• HigherbenefitsofDPMLasthemessagesizeincreases

XEON+IB(64Nodes,28PPN)

2XLowerisbetter

0

1000

2000

3000

4000

8K 16K 32K 64K 128K 256KMessageSize

MVAPICH2-2.2GADPMLIMPI

KNL+OmniPath (64Nodes64PPN)

4X

• BenefitsofDPMLsustainedonKNL+OmniPath evenat4096 processes

• With32Kbytes,4X improvementoverMVAPICH2

IXPUG‘18 16NetworkBasedComputingLaboratory

ScalabilityofDPMLAllreduce OnStampede2-KNL(10,240Processes)

0

50

100

150

200

250

300

4 8 16 32 64 128 256 512 1024 2048 4096

Latency(us)

MessageSize

MVAPICH2 DPML IMPI

0200400600800

100012001400160018002000

8K 16K 32K 64K 128K 256K

MessageSize

MVAPICH2 DPML IMPI

OSUMicroBenchmark(64PPN)

2.4X

• ForMPI_Allreducelatencywith32Kbytes,DPMLdesigncanreducethelatencyby2.4X

AvailableinMVAPICH2-X2.3b

Lowerisbetter

IXPUG‘18 17NetworkBasedComputingLaboratory

PerformanceBenefitsofDPMLAllReduce onMiniAMR Kernel

0

20

40

60

80

448 896 1792NumberofProcesses

MVAPICH2DPMLIMPI

0

10

20

30

40

50

512 1024 1280MeshRe

finem

entT

ime(s)

NumberofProcesses

MVAPICH2DPMLIMPI

• ForMiniAMRApplicationwith4096processes,DPMLcanreducethelatencyby2.4XonKNL+Omni-Pathcluster

2.4X 1.5X

KNL+Omni-Path(32PPN) XEON+Omni-Path(28PPN)

Lowerisbetter

• OnXEON+Omni-Path,with1792processes,DPMLcanreducethelatencyby1.5X

IXPUG‘18 18NetworkBasedComputingLaboratory

• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point

– DirectShared-memory

– DataPartitionedMulti-Leader(DPML)

• Designing“Zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

– Truezero-copycollectives

• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives

– CORE-DirectbasedNon-blockingcollectives

Agenda

IXPUG‘18 19NetworkBasedComputingLaboratory

HowKernel-assisted“Zero-copy”works?

Kernel

MPISender MPIReceiver

SendBuffer

RecvBuffer

SharedMemory

Kernel

MPISender MPIReceiver

SendBuffer

RecvBuffer

SharedMemory– SHMEMRequirestwocopies

NosystemcalloverheadBetterforSmallMessages

Kernel-AssistedCopyRequiressinglecopySystemcalloverhead

BetterforLargeMessages

IXPUG‘18 20NetworkBasedComputingLaboratory

AVarietyofAvailable“Zero”-CopyMechanismsLiMIC KNEM CMA XPMEM

PermissionCheck NotSupported Supported Supported Supported

Availability KernelModule KernelModule Included inLinux3.2+ KernelModule

memcpy()invokedat Kernel-space Kernel-space Kernel-space User-space

memcpy()granularity Pagesize Pagesize Pagesize Anysize

LiMIC KNEM CMA XPMEMMVAPICH2 √ x √ √(upcomingrelease)OpenMPI2.1.0 x √ √ √IntelMPI2017 x x √ xCrayMPI x x √ √

MPILibrarySupport

CrossMemoryAttach(CMA)iswidelysupportedkernel-assistedmechanism

IXPUG‘18 21NetworkBasedComputingLaboratory

• Directalgorithmdesignsbasedonkernel-assistedzero-copymechanism– “Map” applicationbufferpagesinsidekernel

– Issue“Put” or“Get” operationsdirectlyontheapplicationbuffers

• Goodperformanceforlargemessages– Avoidunnecessarycopy overheadsofSHMEM

• Performancedependsonthecommunicationpattern ofthecollectiveprimitive

• Doesnotoffer“zero-copy”forReductionCollectives

DirectKernel-assisted(CMA-based)Collectives

Whataboutcontention?

1

100

10000

1000000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

MessageSize

KNL(64Processes)

DirectSHMEM

Directkernel-assisted

Point-to-point

CMAbasedPersonalizedAll-to-All

>50%better

Latency(us)

IXPUG‘18 22NetworkBasedComputingLaboratory

ImpactofCollectiveCommunicationPatternonCMACollectives

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

DifferentProcesses

PPN-2 PPN-4 PPN-8 PPN-16 PPN-32 PPN-64

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

SameProcess,SameBuffer

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

SameProcess,DiffBuffers

Latency(us)

All-to-All– GoodScalability One-to-All- PoorScalability One-to-All– PoorScalability

>100xworse

>100xworseNoincrease

withPPN

P0

P1P3

P2

P0

P1 P3

P2

P0

P1 P3

P2

ContentionisatProcesslevel

IXPUG‘18 23NetworkBasedComputingLaboratory

Contention-awareCMACollective

1

10

100

1000

10000

100000

1000000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M

MessageSize

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

Latency(us)

• Upto5x and2ximprovementforMPI_Scatter andMPI_Bcast onKNL• ForBcast,improvementsobtainedforlargemessagesonly(p-1copieswithCMA,pcopieswith

Sharedmemory)• AlltoAllLargemessageperformanceboundbysystembandwidth(5%-20%improvement)• FallbacktoSHMEMforsmallmessages

~5xbetter

S.Chakraborty,H.Subramoni,andD.K.Panda,ContentionAwareKernel-AssistedMPICollectivesforMulti/Many-coreSystems,IEEECluster’17,BESTPaperFinalist

MPI_Scatter MPI_Bcast

1

10

100

1000

10000

100000

1K 4K 16K 64K 256K 1M 4MMessageSize

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

UseCMA

UseSHMEM 1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

MPI_AlltoAll

IXPUG‘18 24NetworkBasedComputingLaboratory

Multi-NodeScalabilityUsingTwo-LevelAlgorithms

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4M

MessageSize

KNL(2Nodes,128procs)

MVAPICH2IntelMPIOpenMPITunedCMA

Latency(us)

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M

MessageSize

KNL(4Nodes,256Procs)

MVAPICH2IntelMPIOpenMPITunedCMA 1

10

100

1000

10000

100000

1000000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M

MessageSize

KNL(8Nodes,512Procs)

MVAPICH2IntelMPIOpenMPITunedCMA

• Significantlyfasterintra-nodecommunication• Newtwo-levelcollectivedesignscanbecomposed

• 4x-17x improvementin8nodeScatterandGathercomparedtodefaultMVAPICH2

~2.5xBetter

~3.2xBetter

~4xBetter

~17xBetter

Canwehavezero-copy“Reduction”collectiveswiththisapproach?Doyouseetheproblemhere???

1. Contention“avoidance”–Notremoval

2. Reductionrequiresextracopies

IXPUG‘18 25NetworkBasedComputingLaboratory

• OffloadReductioncomputationandcommunicationtopeerMPIranks– EveryPeerhasdirect“load/store”accesstootherpeer’sbuffers– Multiplepseudorootsindependentlycarry-outreductionsforintra-andinter-node

– Directlyputreduceddataintoroot’sreceivebuffer

• True“Zero-copy” designforAllreduce andReduce– NocopiesrequireduringtheentiredurationofReductionoperation

– Scalabletomultiplenodes

• Zerocontention overheads asmemorycopieshappenin“user-space”

SharedAddressSpace(XPMEM-based)Collectives

J.Hashmi,S.Chakraborty,M.Bayatpour,H.Subramoni,andD.Panda,DesigningEfficientSharedAddressSpaceReductionCollectivesforMulti-/Many-cores,InternationalParallel&DistributedProcessingSymposium(IPDPS'18),May2018.

IXPUG‘18 26NetworkBasedComputingLaboratory

SharedAddressSpace(XPMEM)-basedCollectivesDesign

1

100

10000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency(us)

MessageSize

MVAPICH2IntelMPIMVAPICH2-XPMEM

OSU_Allreduce(Broadwell256procs)

• “SharedAddressSpace”-basedtrue zero-copy ReductioncollectivedesignsinMVAPICH2

• Offloadedcomputation/communicationtopeersranksinreductioncollectiveoperation

• Upto4X improvementfor4MBReduceandupto1.8X improvementfor4MAllReduce

1.8X

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4MMessageSize

MVAPICH2

IntelMPI

MVAPICH2-XPMEM

OSU_Reduce(Broadwell256procs)

4X

37%

WillbeavailableinupcomingMVAPICH2-Xrelease

50%

IXPUG‘18 27NetworkBasedComputingLaboratory

Application-LevelBenefitsofXPMEM-BasedCollectives

MiniAMR(Broadwell,ppn=16)

• Upto20%benefitsoverIMPIforCNTKDNNtrainingusingAllReduce• Upto27% benefitsoverIMPIandupto15% improvementoverMVAPICH2for

MiniAMRapplicationkernel

0

200

400

600

800

28 56 112 224

ExecutionTime(s)

No.ofProcesses

IntelMPIMVAPICH2MVAPICH2-XPMEM

CNTKAlexNetTraining(Broadwell,B.S=default,iteration=50,ppn=28)

0

20

40

60

80

16 32 64 128 256ExecutionTime(s)

No.ofProcesses

IntelMPIMVAPICH2MVAPICH2-XPMEM20%

9%

27%

15%

IXPUG‘18 28NetworkBasedComputingLaboratory

• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point

– DirectShared-memory

– DataPartitionedMulti-Leader(DPML)

• Designing“zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

– Truezero-copycollectives

• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives

– CORE-DirectbasedNon-blockingcollectives

Agenda

IXPUG‘18 29NetworkBasedComputingLaboratory

• Applicationprocessesschedulecollectiveoperation

• Checkperiodicallyifoperationiscomplete

• Overlapofcomputationandcommunication=>BetterPerformance

• Catch:Whowillprogresscommunication

ConceptofNon-blockingCollectivesApplicationProcess

ApplicationProcess

ApplicationProcess

ApplicationProcess

Computation

Communication

CommunicationSupportEntity

CommunicationSupportEntity

CommunicationSupportEntity

CommunicationSupportEntity

ScheduleOperation

ScheduleOperation

ScheduleOperation

ScheduleOperation

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

IXPUG‘18 30NetworkBasedComputingLaboratory

§ ManagementandexecutionofMPIoperationsinthenetworkbyusingSHArP§ Manipulationofdatawhileitisbeingtransferredintheswitch

network

§ SHArPprovidesanabstractiontorealizethereductionoperation§ DefinesAggregationNodes(AN),AggregationTree,and

AggregationGroups

§ ANlogicisimplementedasanInfiniBandTargetChannelAdapter(TCA)integratedintotheswitchASIC*

§ UsesRCforcommunicationbetweenANsandbetweenANandhostsintheAggregationTree*

OffloadingwithScalableHierarchicalAggregationProtocol(SHArP)

PhysicalNetworkTopology*

LogicalSHArPTree**Blochetal.ScalableHierarchicalAggregationProtocol(SHArP):AHardwareArchitectureforEfficientDataReduction

IXPUG‘18 31NetworkBasedComputingLaboratory

0

0.1

0.2

(4,28) (8,28) (16,28)

ExecutionTime(s)

(NumberofNodes,PPN)

MVAPICH2

MVAPICH2-SHArP

MeshRefinementTimeofMiniAMR

SHArP basedblockingAllreduce CollectiveDesignsinMVAPICH2

0

0.1

0.2

0.3

0.4

(4,28) (8,28) (16,28)

ExecutionTime(s)

(NumberofNodes,PPN)

MVAPICH2

MVAPICH2-SHArP

AvgDDOTAllreducetimeofHPCG

SHArPSupportisavailablesinceMVAPICH22.3a

M.Bayatpour,S.Chakraborty,H.Subramoni,X.Lu,andD.K.Panda,ScalableReductionCollectiveswithDataPartitioning-basedMulti-LeaderDesign,SuperComputing'17.

13%12%

IXPUG‘18 32NetworkBasedComputingLaboratory

0

0.2

0.4

0.6

56 224 448

DDOT

Tim

ing(Secon

ds)

NumberofProcesses

MVAPICH2NUMA-Aware(proposed)MVAPICH2+SHArP

0

20

40

60

4 8 16 32 64 128 256 512 1K 2K 4K

Latency(us)

MessageSize(Byte)

MVAPICH2

NUMA-aware(Proposed)

MVAPICH2+SHArP

PerformanceofNUMA-awareSHArP DesignonXEON+IBCluster

• Asthemessagesizedecreases,thebenefitsofusingSocket-baseddesignincreases

• NUMA-awaredesigncanreducethelatencybyupto23%forDDOTphaseofHPCGandupto40%formicro-benchmarks

OSUMicroBenchmark(16Nodes,28PPN) HPCG(16nodes,28PPN)

23%

40%

Lowerisbetter

IXPUG‘18 33NetworkBasedComputingLaboratory

0

2

4

6

8

10

4 8 16 32 64 128

Latency(us)

MessageSize(Bytes)

1PPN*,8Nodes

MVAPICH2MVAPICH2-SHArP

0.1

1

10

100

4 8 16 32 64 128

Overla

p(%

)

MessageSize(Bytes)

1PPN,8Nodes

MVAPICH2MVAPICH2-SHArP

SHArP basedNon-BlockingAllreduce inMVAPICH2MPI_Iallreduce Benchmark

2.3x

*PPN:ProcessesPerNode

• CompleteoffloadofAllreducecollectiveoperationto“Switch”

o higheroverlapofcommunicationandcomputation

AvailablesinceMVAPICH22.3a

Lowerisbetter

Higherisbetter

IXPUG‘18 34NetworkBasedComputingLaboratory

• MellanoxCORE-Directtechnologyallowsforoffloadingthecollectivecommunicationtothenetworkadapter

• MVAPICH2supportsCORE-Directbasedoffloadingofnon-blockingcollectives

– Coversallthenon-blockingcollectives

– Enabledbyconfigureandruntimeparameters

• CORE-DirectbasedMPI_Ibcast designimprovestheperformanceofHighPerformanceLinpack (HPL)benchmark

NICoffloadbasedNon-blockingCollectivesusingCORE-Direct

AvailablesinceMVAPICH2-X2.2a

IXPUG‘18 35NetworkBasedComputingLaboratory

Co-designingHPLwithCore-DirectandPerformanceBenefits

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70Normalize

dHP

LPe

rforman

ce

HPLProblemSize(N)as%ofTotalMemory

HPL-Offload HPL-1ring HPL-Host

HPLPerformanceComparisonwith512ProcessesHPL-OffloadconsistentlyoffershigherthroughputthanHPL-1ringandHPL-Host.Improvespeakthroughputbyupto4.5%forlargeproblemsizes

4.5%

0500100015002000250030003500400045005000

0102030405060708090

64 128 256 512

Thro

ughp

ut (G

Flop

s)

Mem

ory

Con

sum

ptio

n (%

)

System Size (Number of Processes)

HPL-Offload HPL-1ring HPL-HostHPL-Offload HPL-1ring HPL-Host

HPL-OffloadsurpassesthepeakthroughputofHPL-1ringwithsignificantlysmallerproblemsizesandrun-times!

K.Kandalla,H.Subramoni,J.Vienne,S.PaiRaikar,K.Tomko,S.Sur,andDKPanda,DesigningNon-blockingBroadcastwithCollectiveOffloadonInfiniBandClusters:ACaseStudywithHPL,(HOTI2011)

Higherisbetter

IXPUG‘18 36NetworkBasedComputingLaboratory

• Many-corenodeswillbethefoundationblocksforemergingExascalesystems

• Communicationmechanismsandruntimesneedtobere-designed totakeadvantageofthehighconcurrency offeredbymanycores

• Presentedasetofnoveldesigns forcollectivecommunication primitivesinMPIthataddressseveralchallenges

• Demonstratedtheperformancebenefits ofourproposeddesignsunderavarietyofmulti-/many-coresandhigh-speednetworks

• SomeofthesedesignsarealreadyavailableinMVAPICH2libraries

• ThenewdesignswillbeavailableinupcomingMVAPICH2libraries

ConcludingRemarks

IXPUG‘18 37NetworkBasedComputingLaboratory

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

hashmi.29@osu.edu

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/

IXPUG‘18 38NetworkBasedComputingLaboratory

CollectiveCommunicationinMVAPICH2

Multi/Many-CoreAwareDesigns

Communicationonly

Inter-NodeCommunication

Intra-NodeCommunication

PointtoPoint(SHMEM,

LiMIC,CMA,XPMEM)

DirectSharedMemory

DirectKernelAssisted

(CMA,XPMEM,LiMIC)

PointtoPoint

HardwareMulticast SHARP RDMA

DesignedforPerformance&Overlap

Communication+Computation

Blockingandnon-blockingCollectiveCommunicationPrimitivesinMPI

IXPUG‘18 39NetworkBasedComputingLaboratory

BreakdownofaCMAReadoperation• CMAreliesonget_user_pages()function

• Takesapagetablelockonthetargetprocess

• Lockcontentionincreaseswithnumberofconcurrentreaders

• Over90%oftotaltimespentinlockcontention

• One-to-allcommunicationonBroadwell,profiledusingftrace

• Lockcontentionistherootcauseofperformancedegradation• Presentinotherkernel-assistedschemessuchasKNEM,LiMiC aswell

Latency(us)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

10Pages

50Pages

100Pages

10Pages

50Pages

100Pages

10Pages

50Pages

100Pages

NoContention 14Readers 28Readers

SystemCallPermissionCheckAcquireLocksCopyDataPinPages

IXPUG‘18 40NetworkBasedComputingLaboratory

voidmain()

{

MPI_Init()

…..

MPI_Ialltoall(…)

ComputationthatdoesnotdependonresultofAlltoall

MPI_Test(forIalltoall)/*Checkifcomplete(non-blocking)*/

ComputationthatdoesnotdependonresultofAlltoall

MPI_Wait(forIalltoall)/*Waittillcomplete(Blocking)*/

MPI_Finalize()

}

HowdoIwriteapplicationswithNon-blockingCollectives?

IXPUG‘18 41NetworkBasedComputingLaboratory

• AsetofdesignslideswillbeaddedinthissectionbasedonthefollowingpaperacceptedforIPDPS‘18

J.Hashmi,S.Chakraborty,M.Bayatpour,H.Subramoni,andD.K.Panda,DesigningEfficientSharedAddressSpaceReductionCollectivesforMulti-/Many-cores,32ndIEEEInternationalParallel&DistributedProcessingSymposium(IPDPS'18),May2018

XPMEM-basedSharedAddressSpaceCollectives