Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory...

41
Designing High-Performance and Scalable Collectives for the Many-core Era: The MVAPICH2 Approach S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni and DK Panda The Ohio State University E-mail: {chakraborty.52,bayatpour.1,hashmi.29,subramoni.1,panda.2}@osu.edu IXPUG ’18 Presentation Presenter: Jahanzeb Hashmi

Transcript of Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory...

Page 1: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

DesigningHigh-PerformanceandScalableCollectivesfortheMany-coreEra:TheMVAPICH2Approach

S.Chakraborty,M.Bayatpour,J.Hashmi,H.SubramoniandDKPandaTheOhioStateUniversity

E-mail:{chakraborty.52,bayatpour.1,hashmi.29,subramoni.1,panda.2}@osu.edu

IXPUG’18Presentation

Presenter: JahanzebHashmi

Page 2: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 2NetworkBasedComputingLaboratory

ParallelProgrammingModelsOverviewP1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

PartitionedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

• Programmingmodelsprovideabstractmachinemodels

• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

• Programmingmodelsoffervariouscommunicationprimitives– Point-to-point(betweenpairofprocesses/threads)

– RemoteMemoryAccess(directlyaccessmemoryofanotherprocess)

– Collectives(groupcommunication)

Page 3: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 3NetworkBasedComputingLaboratory

SupportingProgrammingModelsforMulti-PetaflopandExaflopSystems:Challenges

ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.

ApplicationKernels/Applications

NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmni-Path)

Multi-/Many-coreArchitectures

Accelerators(GPUandFPGA)

MiddlewareCo-Design

Opportunitiesand

ChallengesacrossVarious

Layers

PerformanceScalabilityResilience

CommunicationLibraryorRuntimeforProgrammingModelsPoint-to-pointCommunication

CollectiveCommunication

Energy-Awareness

SynchronizationandLocks

I/OandFileSystems

FaultTolerance

Page 4: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 4NetworkBasedComputingLaboratory

99

58

2131

95

63

23 27

65

0

25

50

75

100

MPI-FFT VASP AMG COSMO Graph500 MiniFE MILC DL-POLY HOOMD-blue

Percentageof

Commun

icatio

nTimeSpent

inCollectiveOperatio

ns(%

)

WhyCollectiveCommunicationMatters?

http://www.hpcadvisorycouncil.com

Allto

all,Bcast

Allto

all,Allre

duce,A

lltoa

ll

Allre

duce,B

cast,A

lltoA

ll

Allre

duce

Allre

duce,B

cast

Allre

duce

Allre

duce

Allre

duce

Allre

duce,B

cast

• HPCAdvisoryCouncil(HPCAC)MPIapplicationprofiles

• Mostapplicationprofilesshowedmajorityoftimespentincollectiveoperations

• Optimizingcollectivecommunicationdirectlyimpactsscientificapplicationsleadingtoacceleratedscientificdiscovery

Page 5: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 5NetworkBasedComputingLaboratory

WhydifferentalgorithmsofevenadensecollectivesuchasAlltoalldonotachievetheoreticalpeakbandwidthofferedbythesystem?

AreCollectiveDesignsinMPIreadyforManycoreEra?

1

100

10000

1000000

1 2 4 8 16 32 64 128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

BrucksRecursive-DoublingScatter-DestinationPairwise-ExchangeTheoretical-Peak

Effectiveband

width(M

B/s)

AlltoallAlgorithmsonsingleKNL7250inCache-modeon64MPIprocessesusingMVAPICH2-2.3rc1

~35%ofpeak

Page 6: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 6NetworkBasedComputingLaboratory

• Exploitinghighconcurrencyandhighbandwidthofferedbymodernarchitectures

• Designing“zero-copy”and“contention-free”CollectiveCommunication

• Efficienthardwareoffloadingforbetteroverlapofcommunicationandcomputation

BroadChallengesduetoArchitecturalAdvances

HowdoesMVAPICH2asanMPIlibrarytacklesthesechallengesandprovideoptimalcollectivedesignsforemergingmulti-/many-cores?

Page 7: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 7NetworkBasedComputingLaboratory

OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.1),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,875organizationsin86countries

– Morethan464,000(>0.46million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(Nov‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 9th,556,104cores(Oakforest-PACS)inJapan

• 12th,368,928-core(Stampede2)atTACC

• 17th,241,108-core(Pleiades)atNASA

• 48th,76,032-core(Tsubame2.5)atTokyoInstituteofTechnology

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade

Page 8: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 8NetworkBasedComputingLaboratory

ArchitectureofMVAPICH2SoftwareFamily

HighPerformanceParallelProgrammingModels

MessagePassingInterface(MPI)

PGAS(UPC,OpenSHMEM,CAF,UPC++)

Hybrid--- MPI+X(MPI+PGAS+OpenMP/Cilk)

HighPerformanceandScalableCommunicationRuntimeDiverseAPIsandMechanisms

Point-to-point

Primitives

CollectivesAlgorithms

Energy-Awareness

RemoteMemoryAccess

I/OandFileSystems

FaultTolerance

Virtualization ActiveMessages

JobStartupIntrospection&Analysis

SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,Omni-Path)

SupportforModernMulti-/Many-coreArchitectures(Intel-Xeon,OpenPower,Xeon-Phi,ARM,NVIDIAGPGPU)

TransportProtocols ModernFeatures

RC XRC UD DC UMR ODPSR-IOV

MultiRail

TransportMechanismsSharedMemory

CMA IVSHMEM

ModernFeatures

MCDRAM* NVLink* CAPI*

* Upcoming

XPMEM*

Page 9: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 9NetworkBasedComputingLaboratory

MVAPICH2SoftwareFamilyHigh-PerformanceParallelProgrammingLibraries

MVAPICH2 SupportforInfiniBand,Omni-Path,Ethernet/iWARP,andRoCE

MVAPICH2-X AdvancedMPIfeatures,OSUINAM,PGAS(OpenSHMEM,UPC,UPC++,andCAF),andMPI+PGASprogrammingmodelswithunifiedcommunicationruntime

MVAPICH2-GDR OptimizedMPIforclusterswithNVIDIAGPUs

MVAPICH2-Virt High-performanceandscalableMPIforhypervisorandcontainerbasedHPCcloud

MVAPICH2-EA EnergyawareandHigh-performanceMPI

MVAPICH2-MIC OptimizedMPIforclusterswithIntelKNC

Microbenchmarks

OMB MicrobenchmarkssuitetoevaluateMPIandPGAS(OpenSHMEM,UPC,andUPC++)librariesforCPUsandGPUs

Tools

OSUINAM Networkmonitoring,profiling,andanalysisforclusterswithMPIandschedulerintegration

OEMT UtilitytomeasuretheenergyconsumptionofMPIapplications

Page 10: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 10NetworkBasedComputingLaboratory

• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point

– DirectShared-memory

– DataPartitionedMulti-Leader(DPML)

• Designing“zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

– Truezero-copycollectives

• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives

– CORE-DirectbasedNon-blockingcollectives

Agenda

Page 11: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 11NetworkBasedComputingLaboratory

• Commonlyusedapproachinimplementingcollectives

• Easytoexpressalgorithmsinmessagepassingsemantics

• AnaïveBroadcastcouldbeaseriesof“send”operationsfromroottoallthenon-rootprocesses

• Reliesontheimplementationofpointtopointprimitives

• Limitedbytheoverheadsexposedbytheseprimitives

– Tag-matching

– Rendezvoushand-shake

CollectiveDesignsbasedonPoint-to-pointPrimitives

Page 12: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 12NetworkBasedComputingLaboratory

// root = 0

// msg-size > eager-size

if (rank == 0) {

for (i = 1 to n-1) {

MPI_Send(buf,...i,...);

}

} else {

MPI_Recv(buf,...0,...);

}

ANaïveExampleofMPI_BcastwhenusingMPI_Send/MPI_RecvMPI_Send MPI_Recv MPI_Recv

MPI_Send

Applicationskewandlibraryoverhead

MPI_Send

.

.

.

• Overheadsofhandshakeforeachrendezvousmessagetransfer

• Isthereabetterway?

Tagmatching

ReceiverwastingCPUcycleswaitingforSender’srequest

rootMPI_Send

MPI_Recv MPI_Recv

Page 13: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 13NetworkBasedComputingLaboratory

• Alargeshared-memoryregion– Collectivealgorithmsarerealizedbyshared-

memorycopiesandsynchronizations

– Goodperformanceforsmallmessageviaexploitingcachelocality

– AvoidoverheadsassociatedwithMPIpoint-to-pointimplementations

• Requiresoneadditionalcopyforeachtransfer

• Performancedegradationsforlargemessagecommunication– memcpy ()isthedominantcostforlargemessages

• MostMPIlibrariesusesomevariantofDirectSHMEMcollectives

DirectSharedMemorybasedCollectives

ReductioncollectivesperformevenworsewithSHMEMbaseddesignbecauseofcompute+memcpy

1

10

100

1000

10000

100000

1000000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

MessageSize

KNL(64Processes)

DirectSHMEM

Point-to-point

PersonalizedAll-to-All

Page 14: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 14NetworkBasedComputingLaboratory

DataPartitioningbasedMulti-Leader(DPML)Designs• Hierarchicalalgorithmsdelegatelotofcomputationonthe“node-leader”

– Leaderprocessresponsibleforinter-nodereductionswhileintra-nodenon-rootprocesseswaitfortheleader

• ExistingdesignsforMPI_Allreducedonottakeadvantageofthevastparallelismavailableinmodernmulti-/many-coreprocessors

• DPML- anewsolutionforMPI_Allreduce• Takesadvantageoftheparallelismofferedby

– Multi-/many-corearchitectures– Highthroughputandhigh-endfeaturesofferedbyInfiniBandandOmni-Path

• MultiplepartitionsofreductionvectorsforarbitrarynumberofleadersM.Bayatpour,S.Chakraborty,H.Subramoni,X.Lu,andD.K.Panda,ScalableReductionCollectiveswithDataPartitioning-basedMulti-LeaderDesign,Supercomputing'17.

Page 15: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 15NetworkBasedComputingLaboratory

0

200

400

600

800

1000

1200

8K 16K 32K 64K 128K 256K

Latency(us)

MessageSize

MVAPICH2-2.2GADPMLIMPI

PerformanceofDPMLMPI_Allreduce OnDifferentNetworksandArchitectures

• 2XimprovementofoverMVAPICH2at256K

• HigherbenefitsofDPMLasthemessagesizeincreases

XEON+IB(64Nodes,28PPN)

2XLowerisbetter

0

1000

2000

3000

4000

8K 16K 32K 64K 128K 256KMessageSize

MVAPICH2-2.2GADPMLIMPI

KNL+OmniPath (64Nodes64PPN)

4X

• BenefitsofDPMLsustainedonKNL+OmniPath evenat4096 processes

• With32Kbytes,4X improvementoverMVAPICH2

Page 16: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 16NetworkBasedComputingLaboratory

ScalabilityofDPMLAllreduce OnStampede2-KNL(10,240Processes)

0

50

100

150

200

250

300

4 8 16 32 64 128 256 512 1024 2048 4096

Latency(us)

MessageSize

MVAPICH2 DPML IMPI

0200400600800

100012001400160018002000

8K 16K 32K 64K 128K 256K

MessageSize

MVAPICH2 DPML IMPI

OSUMicroBenchmark(64PPN)

2.4X

• ForMPI_Allreducelatencywith32Kbytes,DPMLdesigncanreducethelatencyby2.4X

AvailableinMVAPICH2-X2.3b

Lowerisbetter

Page 17: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 17NetworkBasedComputingLaboratory

PerformanceBenefitsofDPMLAllReduce onMiniAMR Kernel

0

20

40

60

80

448 896 1792NumberofProcesses

MVAPICH2DPMLIMPI

0

10

20

30

40

50

512 1024 1280MeshRe

finem

entT

ime(s)

NumberofProcesses

MVAPICH2DPMLIMPI

• ForMiniAMRApplicationwith4096processes,DPMLcanreducethelatencyby2.4XonKNL+Omni-Pathcluster

2.4X 1.5X

KNL+Omni-Path(32PPN) XEON+Omni-Path(28PPN)

Lowerisbetter

• OnXEON+Omni-Path,with1792processes,DPMLcanreducethelatencyby1.5X

Page 18: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 18NetworkBasedComputingLaboratory

• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point

– DirectShared-memory

– DataPartitionedMulti-Leader(DPML)

• Designing“Zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

– Truezero-copycollectives

• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives

– CORE-DirectbasedNon-blockingcollectives

Agenda

Page 19: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 19NetworkBasedComputingLaboratory

HowKernel-assisted“Zero-copy”works?

Kernel

MPISender MPIReceiver

SendBuffer

RecvBuffer

SharedMemory

Kernel

MPISender MPIReceiver

SendBuffer

RecvBuffer

SharedMemory– SHMEMRequirestwocopies

NosystemcalloverheadBetterforSmallMessages

Kernel-AssistedCopyRequiressinglecopySystemcalloverhead

BetterforLargeMessages

Page 20: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 20NetworkBasedComputingLaboratory

AVarietyofAvailable“Zero”-CopyMechanismsLiMIC KNEM CMA XPMEM

PermissionCheck NotSupported Supported Supported Supported

Availability KernelModule KernelModule Included inLinux3.2+ KernelModule

memcpy()invokedat Kernel-space Kernel-space Kernel-space User-space

memcpy()granularity Pagesize Pagesize Pagesize Anysize

LiMIC KNEM CMA XPMEMMVAPICH2 √ x √ √(upcomingrelease)OpenMPI2.1.0 x √ √ √IntelMPI2017 x x √ xCrayMPI x x √ √

MPILibrarySupport

CrossMemoryAttach(CMA)iswidelysupportedkernel-assistedmechanism

Page 21: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 21NetworkBasedComputingLaboratory

• Directalgorithmdesignsbasedonkernel-assistedzero-copymechanism– “Map” applicationbufferpagesinsidekernel

– Issue“Put” or“Get” operationsdirectlyontheapplicationbuffers

• Goodperformanceforlargemessages– Avoidunnecessarycopy overheadsofSHMEM

• Performancedependsonthecommunicationpattern ofthecollectiveprimitive

• Doesnotoffer“zero-copy”forReductionCollectives

DirectKernel-assisted(CMA-based)Collectives

Whataboutcontention?

1

100

10000

1000000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

MessageSize

KNL(64Processes)

DirectSHMEM

Directkernel-assisted

Point-to-point

CMAbasedPersonalizedAll-to-All

>50%better

Latency(us)

Page 22: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 22NetworkBasedComputingLaboratory

ImpactofCollectiveCommunicationPatternonCMACollectives

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

DifferentProcesses

PPN-2 PPN-4 PPN-8 PPN-16 PPN-32 PPN-64

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

SameProcess,SameBuffer

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

SameProcess,DiffBuffers

Latency(us)

All-to-All– GoodScalability One-to-All- PoorScalability One-to-All– PoorScalability

>100xworse

>100xworseNoincrease

withPPN

P0

P1P3

P2

P0

P1 P3

P2

P0

P1 P3

P2

ContentionisatProcesslevel

Page 23: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 23NetworkBasedComputingLaboratory

Contention-awareCMACollective

1

10

100

1000

10000

100000

1000000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M

MessageSize

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

Latency(us)

• Upto5x and2ximprovementforMPI_Scatter andMPI_Bcast onKNL• ForBcast,improvementsobtainedforlargemessagesonly(p-1copieswithCMA,pcopieswith

Sharedmemory)• AlltoAllLargemessageperformanceboundbysystembandwidth(5%-20%improvement)• FallbacktoSHMEMforsmallmessages

~5xbetter

S.Chakraborty,H.Subramoni,andD.K.Panda,ContentionAwareKernel-AssistedMPICollectivesforMulti/Many-coreSystems,IEEECluster’17,BESTPaperFinalist

MPI_Scatter MPI_Bcast

1

10

100

1000

10000

100000

1K 4K 16K 64K 256K 1M 4MMessageSize

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

UseCMA

UseSHMEM 1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

MPI_AlltoAll

Page 24: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 24NetworkBasedComputingLaboratory

Multi-NodeScalabilityUsingTwo-LevelAlgorithms

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M 4M

MessageSize

KNL(2Nodes,128procs)

MVAPICH2IntelMPIOpenMPITunedCMA

Latency(us)

1

10

100

1000

10000

100000

1000000

1K 4K 16K 64K 256K 1M

MessageSize

KNL(4Nodes,256Procs)

MVAPICH2IntelMPIOpenMPITunedCMA 1

10

100

1000

10000

100000

1000000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M

MessageSize

KNL(8Nodes,512Procs)

MVAPICH2IntelMPIOpenMPITunedCMA

• Significantlyfasterintra-nodecommunication• Newtwo-levelcollectivedesignscanbecomposed

• 4x-17x improvementin8nodeScatterandGathercomparedtodefaultMVAPICH2

~2.5xBetter

~3.2xBetter

~4xBetter

~17xBetter

Canwehavezero-copy“Reduction”collectiveswiththisapproach?Doyouseetheproblemhere???

1. Contention“avoidance”–Notremoval

2. Reductionrequiresextracopies

Page 25: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 25NetworkBasedComputingLaboratory

• OffloadReductioncomputationandcommunicationtopeerMPIranks– EveryPeerhasdirect“load/store”accesstootherpeer’sbuffers– Multiplepseudorootsindependentlycarry-outreductionsforintra-andinter-node

– Directlyputreduceddataintoroot’sreceivebuffer

• True“Zero-copy” designforAllreduce andReduce– NocopiesrequireduringtheentiredurationofReductionoperation

– Scalabletomultiplenodes

• Zerocontention overheads asmemorycopieshappenin“user-space”

SharedAddressSpace(XPMEM-based)Collectives

J.Hashmi,S.Chakraborty,M.Bayatpour,H.Subramoni,andD.Panda,DesigningEfficientSharedAddressSpaceReductionCollectivesforMulti-/Many-cores,InternationalParallel&DistributedProcessingSymposium(IPDPS'18),May2018.

Page 26: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 26NetworkBasedComputingLaboratory

SharedAddressSpace(XPMEM)-basedCollectivesDesign

1

100

10000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency(us)

MessageSize

MVAPICH2IntelMPIMVAPICH2-XPMEM

OSU_Allreduce(Broadwell256procs)

• “SharedAddressSpace”-basedtrue zero-copy ReductioncollectivedesignsinMVAPICH2

• Offloadedcomputation/communicationtopeersranksinreductioncollectiveoperation

• Upto4X improvementfor4MBReduceandupto1.8X improvementfor4MAllReduce

1.8X

1

10

100

1000

10000

100000

16K 32K 64K 128K 256K 512K 1M 2M 4MMessageSize

MVAPICH2

IntelMPI

MVAPICH2-XPMEM

OSU_Reduce(Broadwell256procs)

4X

37%

WillbeavailableinupcomingMVAPICH2-Xrelease

50%

Page 27: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 27NetworkBasedComputingLaboratory

Application-LevelBenefitsofXPMEM-BasedCollectives

MiniAMR(Broadwell,ppn=16)

• Upto20%benefitsoverIMPIforCNTKDNNtrainingusingAllReduce• Upto27% benefitsoverIMPIandupto15% improvementoverMVAPICH2for

MiniAMRapplicationkernel

0

200

400

600

800

28 56 112 224

ExecutionTime(s)

No.ofProcesses

IntelMPIMVAPICH2MVAPICH2-XPMEM

CNTKAlexNetTraining(Broadwell,B.S=default,iteration=50,ppn=28)

0

20

40

60

80

16 32 64 128 256ExecutionTime(s)

No.ofProcesses

IntelMPIMVAPICH2MVAPICH2-XPMEM20%

9%

27%

15%

Page 28: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 28NetworkBasedComputingLaboratory

• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point

– DirectShared-memory

– DataPartitionedMulti-Leader(DPML)

• Designing“zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

– Truezero-copycollectives

• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives

– CORE-DirectbasedNon-blockingcollectives

Agenda

Page 29: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 29NetworkBasedComputingLaboratory

• Applicationprocessesschedulecollectiveoperation

• Checkperiodicallyifoperationiscomplete

• Overlapofcomputationandcommunication=>BetterPerformance

• Catch:Whowillprogresscommunication

ConceptofNon-blockingCollectivesApplicationProcess

ApplicationProcess

ApplicationProcess

ApplicationProcess

Computation

Communication

CommunicationSupportEntity

CommunicationSupportEntity

CommunicationSupportEntity

CommunicationSupportEntity

ScheduleOperation

ScheduleOperation

ScheduleOperation

ScheduleOperation

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

CheckifComplete

Page 30: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 30NetworkBasedComputingLaboratory

§ ManagementandexecutionofMPIoperationsinthenetworkbyusingSHArP§ Manipulationofdatawhileitisbeingtransferredintheswitch

network

§ SHArPprovidesanabstractiontorealizethereductionoperation§ DefinesAggregationNodes(AN),AggregationTree,and

AggregationGroups

§ ANlogicisimplementedasanInfiniBandTargetChannelAdapter(TCA)integratedintotheswitchASIC*

§ UsesRCforcommunicationbetweenANsandbetweenANandhostsintheAggregationTree*

OffloadingwithScalableHierarchicalAggregationProtocol(SHArP)

PhysicalNetworkTopology*

LogicalSHArPTree**Blochetal.ScalableHierarchicalAggregationProtocol(SHArP):AHardwareArchitectureforEfficientDataReduction

Page 31: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 31NetworkBasedComputingLaboratory

0

0.1

0.2

(4,28) (8,28) (16,28)

ExecutionTime(s)

(NumberofNodes,PPN)

MVAPICH2

MVAPICH2-SHArP

MeshRefinementTimeofMiniAMR

SHArP basedblockingAllreduce CollectiveDesignsinMVAPICH2

0

0.1

0.2

0.3

0.4

(4,28) (8,28) (16,28)

ExecutionTime(s)

(NumberofNodes,PPN)

MVAPICH2

MVAPICH2-SHArP

AvgDDOTAllreducetimeofHPCG

SHArPSupportisavailablesinceMVAPICH22.3a

M.Bayatpour,S.Chakraborty,H.Subramoni,X.Lu,andD.K.Panda,ScalableReductionCollectiveswithDataPartitioning-basedMulti-LeaderDesign,SuperComputing'17.

13%12%

Page 32: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 32NetworkBasedComputingLaboratory

0

0.2

0.4

0.6

56 224 448

DDOT

Tim

ing(Secon

ds)

NumberofProcesses

MVAPICH2NUMA-Aware(proposed)MVAPICH2+SHArP

0

20

40

60

4 8 16 32 64 128 256 512 1K 2K 4K

Latency(us)

MessageSize(Byte)

MVAPICH2

NUMA-aware(Proposed)

MVAPICH2+SHArP

PerformanceofNUMA-awareSHArP DesignonXEON+IBCluster

• Asthemessagesizedecreases,thebenefitsofusingSocket-baseddesignincreases

• NUMA-awaredesigncanreducethelatencybyupto23%forDDOTphaseofHPCGandupto40%formicro-benchmarks

OSUMicroBenchmark(16Nodes,28PPN) HPCG(16nodes,28PPN)

23%

40%

Lowerisbetter

Page 33: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 33NetworkBasedComputingLaboratory

0

2

4

6

8

10

4 8 16 32 64 128

Latency(us)

MessageSize(Bytes)

1PPN*,8Nodes

MVAPICH2MVAPICH2-SHArP

0.1

1

10

100

4 8 16 32 64 128

Overla

p(%

)

MessageSize(Bytes)

1PPN,8Nodes

MVAPICH2MVAPICH2-SHArP

SHArP basedNon-BlockingAllreduce inMVAPICH2MPI_Iallreduce Benchmark

2.3x

*PPN:ProcessesPerNode

• CompleteoffloadofAllreducecollectiveoperationto“Switch”

o higheroverlapofcommunicationandcomputation

AvailablesinceMVAPICH22.3a

Lowerisbetter

Higherisbetter

Page 34: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 34NetworkBasedComputingLaboratory

• MellanoxCORE-Directtechnologyallowsforoffloadingthecollectivecommunicationtothenetworkadapter

• MVAPICH2supportsCORE-Directbasedoffloadingofnon-blockingcollectives

– Coversallthenon-blockingcollectives

– Enabledbyconfigureandruntimeparameters

• CORE-DirectbasedMPI_Ibcast designimprovestheperformanceofHighPerformanceLinpack (HPL)benchmark

NICoffloadbasedNon-blockingCollectivesusingCORE-Direct

AvailablesinceMVAPICH2-X2.2a

Page 35: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 35NetworkBasedComputingLaboratory

Co-designingHPLwithCore-DirectandPerformanceBenefits

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70Normalize

dHP

LPe

rforman

ce

HPLProblemSize(N)as%ofTotalMemory

HPL-Offload HPL-1ring HPL-Host

HPLPerformanceComparisonwith512ProcessesHPL-OffloadconsistentlyoffershigherthroughputthanHPL-1ringandHPL-Host.Improvespeakthroughputbyupto4.5%forlargeproblemsizes

4.5%

0500100015002000250030003500400045005000

0102030405060708090

64 128 256 512

Thro

ughp

ut (G

Flop

s)

Mem

ory

Con

sum

ptio

n (%

)

System Size (Number of Processes)

HPL-Offload HPL-1ring HPL-HostHPL-Offload HPL-1ring HPL-Host

HPL-OffloadsurpassesthepeakthroughputofHPL-1ringwithsignificantlysmallerproblemsizesandrun-times!

K.Kandalla,H.Subramoni,J.Vienne,S.PaiRaikar,K.Tomko,S.Sur,andDKPanda,DesigningNon-blockingBroadcastwithCollectiveOffloadonInfiniBandClusters:ACaseStudywithHPL,(HOTI2011)

Higherisbetter

Page 36: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 36NetworkBasedComputingLaboratory

• Many-corenodeswillbethefoundationblocksforemergingExascalesystems

• Communicationmechanismsandruntimesneedtobere-designed totakeadvantageofthehighconcurrency offeredbymanycores

• Presentedasetofnoveldesigns forcollectivecommunication primitivesinMPIthataddressseveralchallenges

• Demonstratedtheperformancebenefits ofourproposeddesignsunderavarietyofmulti-/many-coresandhigh-speednetworks

• SomeofthesedesignsarealreadyavailableinMVAPICH2libraries

• ThenewdesignswillbeavailableinupcomingMVAPICH2libraries

ConcludingRemarks

Page 37: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 37NetworkBasedComputingLaboratory

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/

Page 38: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 38NetworkBasedComputingLaboratory

CollectiveCommunicationinMVAPICH2

Multi/Many-CoreAwareDesigns

Communicationonly

Inter-NodeCommunication

Intra-NodeCommunication

PointtoPoint(SHMEM,

LiMIC,CMA,XPMEM)

DirectSharedMemory

DirectKernelAssisted

(CMA,XPMEM,LiMIC)

PointtoPoint

HardwareMulticast SHARP RDMA

DesignedforPerformance&Overlap

Communication+Computation

Blockingandnon-blockingCollectiveCommunicationPrimitivesinMPI

Page 39: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 39NetworkBasedComputingLaboratory

BreakdownofaCMAReadoperation• CMAreliesonget_user_pages()function

• Takesapagetablelockonthetargetprocess

• Lockcontentionincreaseswithnumberofconcurrentreaders

• Over90%oftotaltimespentinlockcontention

• One-to-allcommunicationonBroadwell,profiledusingftrace

• Lockcontentionistherootcauseofperformancedegradation• Presentinotherkernel-assistedschemessuchasKNEM,LiMiC aswell

Latency(us)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

10Pages

50Pages

100Pages

10Pages

50Pages

100Pages

10Pages

50Pages

100Pages

NoContention 14Readers 28Readers

SystemCallPermissionCheckAcquireLocksCopyDataPinPages

Page 40: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 40NetworkBasedComputingLaboratory

voidmain()

{

MPI_Init()

…..

MPI_Ialltoall(…)

ComputationthatdoesnotdependonresultofAlltoall

MPI_Test(forIalltoall)/*Checkifcomplete(non-blocking)*/

ComputationthatdoesnotdependonresultofAlltoall

MPI_Wait(forIalltoall)/*Waittillcomplete(Blocking)*/

MPI_Finalize()

}

HowdoIwriteapplicationswithNon-blockingCollectives?

Page 41: Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory IXPUG ‘18 6 •Exploiting high concurrency and high bandwidth offered by modern

IXPUG‘18 41NetworkBasedComputingLaboratory

• AsetofdesignslideswillbeaddedinthissectionbasedonthefollowingpaperacceptedforIPDPS‘18

J.Hashmi,S.Chakraborty,M.Bayatpour,H.Subramoni,andD.K.Panda,DesigningEfficientSharedAddressSpaceReductionCollectivesforMulti-/Many-cores,32ndIEEEInternationalParallel&DistributedProcessingSymposium(IPDPS'18),May2018

XPMEM-basedSharedAddressSpaceCollectives