Post on 01-May-2020
DesigningHigh-PerformanceandScalableCollectivesfortheMany-coreEra:TheMVAPICH2Approach
S.Chakraborty,M.Bayatpour,J.Hashmi,H.SubramoniandDKPandaTheOhioStateUniversity
E-mail:{chakraborty.52,bayatpour.1,hashmi.29,subramoni.1,panda.2}@osu.edu
IXPUG’18Presentation
Presenter: JahanzebHashmi
IXPUG‘18 2NetworkBasedComputingLaboratory
ParallelProgrammingModelsOverviewP1 P2 P3
SharedMemory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory MemoryLogicalsharedmemory
SharedMemoryModel
SHMEM,DSMDistributedMemoryModel
MPI(MessagePassingInterface)
PartitionedGlobalAddressSpace(PGAS)
GlobalArrays,UPC,Chapel,X10,CAF,…
• Programmingmodelsprovideabstractmachinemodels
• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.
• Programmingmodelsoffervariouscommunicationprimitives– Point-to-point(betweenpairofprocesses/threads)
– RemoteMemoryAccess(directlyaccessmemoryofanotherprocess)
– Collectives(groupcommunication)
IXPUG‘18 3NetworkBasedComputingLaboratory
SupportingProgrammingModelsforMulti-PetaflopandExaflopSystems:Challenges
ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.
ApplicationKernels/Applications
NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmni-Path)
Multi-/Many-coreArchitectures
Accelerators(GPUandFPGA)
MiddlewareCo-Design
Opportunitiesand
ChallengesacrossVarious
Layers
PerformanceScalabilityResilience
CommunicationLibraryorRuntimeforProgrammingModelsPoint-to-pointCommunication
CollectiveCommunication
Energy-Awareness
SynchronizationandLocks
I/OandFileSystems
FaultTolerance
IXPUG‘18 4NetworkBasedComputingLaboratory
99
58
2131
95
63
23 27
65
0
25
50
75
100
MPI-FFT VASP AMG COSMO Graph500 MiniFE MILC DL-POLY HOOMD-blue
Percentageof
Commun
icatio
nTimeSpent
inCollectiveOperatio
ns(%
)
WhyCollectiveCommunicationMatters?
http://www.hpcadvisorycouncil.com
Allto
all,Bcast
Allto
all,Allre
duce,A
lltoa
ll
Allre
duce,B
cast,A
lltoA
ll
Allre
duce
Allre
duce,B
cast
Allre
duce
Allre
duce
Allre
duce
Allre
duce,B
cast
• HPCAdvisoryCouncil(HPCAC)MPIapplicationprofiles
• Mostapplicationprofilesshowedmajorityoftimespentincollectiveoperations
• Optimizingcollectivecommunicationdirectlyimpactsscientificapplicationsleadingtoacceleratedscientificdiscovery
IXPUG‘18 5NetworkBasedComputingLaboratory
WhydifferentalgorithmsofevenadensecollectivesuchasAlltoalldonotachievetheoreticalpeakbandwidthofferedbythesystem?
AreCollectiveDesignsinMPIreadyforManycoreEra?
1
100
10000
1000000
1 2 4 8 16 32 64 128
256
512 1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
BrucksRecursive-DoublingScatter-DestinationPairwise-ExchangeTheoretical-Peak
Effectiveband
width(M
B/s)
AlltoallAlgorithmsonsingleKNL7250inCache-modeon64MPIprocessesusingMVAPICH2-2.3rc1
~35%ofpeak
IXPUG‘18 6NetworkBasedComputingLaboratory
• Exploitinghighconcurrencyandhighbandwidthofferedbymodernarchitectures
• Designing“zero-copy”and“contention-free”CollectiveCommunication
• Efficienthardwareoffloadingforbetteroverlapofcommunicationandcomputation
BroadChallengesduetoArchitecturalAdvances
HowdoesMVAPICH2asanMPIlibrarytacklesthesechallengesandprovideoptimalcollectivedesignsforemergingmulti-/many-cores?
IXPUG‘18 7NetworkBasedComputingLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.1),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,875organizationsin86countries
– Morethan464,000(>0.46million)downloadsfromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(Nov‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China
• 9th,556,104cores(Oakforest-PACS)inJapan
• 12th,368,928-core(Stampede2)atTACC
• 17th,241,108-core(Pleiades)atNASA
• 48th,76,032-core(Tsubame2.5)atTokyoInstituteofTechnology
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade
IXPUG‘18 8NetworkBasedComputingLaboratory
ArchitectureofMVAPICH2SoftwareFamily
HighPerformanceParallelProgrammingModels
MessagePassingInterface(MPI)
PGAS(UPC,OpenSHMEM,CAF,UPC++)
Hybrid--- MPI+X(MPI+PGAS+OpenMP/Cilk)
HighPerformanceandScalableCommunicationRuntimeDiverseAPIsandMechanisms
Point-to-point
Primitives
CollectivesAlgorithms
Energy-Awareness
RemoteMemoryAccess
I/OandFileSystems
FaultTolerance
Virtualization ActiveMessages
JobStartupIntrospection&Analysis
SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,Omni-Path)
SupportforModernMulti-/Many-coreArchitectures(Intel-Xeon,OpenPower,Xeon-Phi,ARM,NVIDIAGPGPU)
TransportProtocols ModernFeatures
RC XRC UD DC UMR ODPSR-IOV
MultiRail
TransportMechanismsSharedMemory
CMA IVSHMEM
ModernFeatures
MCDRAM* NVLink* CAPI*
* Upcoming
XPMEM*
IXPUG‘18 9NetworkBasedComputingLaboratory
MVAPICH2SoftwareFamilyHigh-PerformanceParallelProgrammingLibraries
MVAPICH2 SupportforInfiniBand,Omni-Path,Ethernet/iWARP,andRoCE
MVAPICH2-X AdvancedMPIfeatures,OSUINAM,PGAS(OpenSHMEM,UPC,UPC++,andCAF),andMPI+PGASprogrammingmodelswithunifiedcommunicationruntime
MVAPICH2-GDR OptimizedMPIforclusterswithNVIDIAGPUs
MVAPICH2-Virt High-performanceandscalableMPIforhypervisorandcontainerbasedHPCcloud
MVAPICH2-EA EnergyawareandHigh-performanceMPI
MVAPICH2-MIC OptimizedMPIforclusterswithIntelKNC
Microbenchmarks
OMB MicrobenchmarkssuitetoevaluateMPIandPGAS(OpenSHMEM,UPC,andUPC++)librariesforCPUsandGPUs
Tools
OSUINAM Networkmonitoring,profiling,andanalysisforclusterswithMPIandschedulerintegration
OEMT UtilitytomeasuretheenergyconsumptionofMPIapplications
IXPUG‘18 10NetworkBasedComputingLaboratory
• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point
– DirectShared-memory
– DataPartitionedMulti-Leader(DPML)
• Designing“zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns
– Truezero-copycollectives
• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives
– CORE-DirectbasedNon-blockingcollectives
Agenda
IXPUG‘18 11NetworkBasedComputingLaboratory
• Commonlyusedapproachinimplementingcollectives
• Easytoexpressalgorithmsinmessagepassingsemantics
• AnaïveBroadcastcouldbeaseriesof“send”operationsfromroottoallthenon-rootprocesses
• Reliesontheimplementationofpointtopointprimitives
• Limitedbytheoverheadsexposedbytheseprimitives
– Tag-matching
– Rendezvoushand-shake
CollectiveDesignsbasedonPoint-to-pointPrimitives
IXPUG‘18 12NetworkBasedComputingLaboratory
// root = 0
// msg-size > eager-size
if (rank == 0) {
for (i = 1 to n-1) {
MPI_Send(buf,...i,...);
}
} else {
MPI_Recv(buf,...0,...);
}
ANaïveExampleofMPI_BcastwhenusingMPI_Send/MPI_RecvMPI_Send MPI_Recv MPI_Recv
MPI_Send
Applicationskewandlibraryoverhead
MPI_Send
.
.
.
• Overheadsofhandshakeforeachrendezvousmessagetransfer
• Isthereabetterway?
Tagmatching
ReceiverwastingCPUcycleswaitingforSender’srequest
rootMPI_Send
MPI_Recv MPI_Recv
IXPUG‘18 13NetworkBasedComputingLaboratory
• Alargeshared-memoryregion– Collectivealgorithmsarerealizedbyshared-
memorycopiesandsynchronizations
– Goodperformanceforsmallmessageviaexploitingcachelocality
– AvoidoverheadsassociatedwithMPIpoint-to-pointimplementations
• Requiresoneadditionalcopyforeachtransfer
• Performancedegradationsforlargemessagecommunication– memcpy ()isthedominantcostforlargemessages
• MostMPIlibrariesusesomevariantofDirectSHMEMcollectives
DirectSharedMemorybasedCollectives
ReductioncollectivesperformevenworsewithSHMEMbaseddesignbecauseofcompute+memcpy
1
10
100
1000
10000
100000
1000000
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M
MessageSize
KNL(64Processes)
DirectSHMEM
Point-to-point
PersonalizedAll-to-All
IXPUG‘18 14NetworkBasedComputingLaboratory
DataPartitioningbasedMulti-Leader(DPML)Designs• Hierarchicalalgorithmsdelegatelotofcomputationonthe“node-leader”
– Leaderprocessresponsibleforinter-nodereductionswhileintra-nodenon-rootprocesseswaitfortheleader
• ExistingdesignsforMPI_Allreducedonottakeadvantageofthevastparallelismavailableinmodernmulti-/many-coreprocessors
• DPML- anewsolutionforMPI_Allreduce• Takesadvantageoftheparallelismofferedby
– Multi-/many-corearchitectures– Highthroughputandhigh-endfeaturesofferedbyInfiniBandandOmni-Path
• MultiplepartitionsofreductionvectorsforarbitrarynumberofleadersM.Bayatpour,S.Chakraborty,H.Subramoni,X.Lu,andD.K.Panda,ScalableReductionCollectiveswithDataPartitioning-basedMulti-LeaderDesign,Supercomputing'17.
IXPUG‘18 15NetworkBasedComputingLaboratory
0
200
400
600
800
1000
1200
8K 16K 32K 64K 128K 256K
Latency(us)
MessageSize
MVAPICH2-2.2GADPMLIMPI
PerformanceofDPMLMPI_Allreduce OnDifferentNetworksandArchitectures
• 2XimprovementofoverMVAPICH2at256K
• HigherbenefitsofDPMLasthemessagesizeincreases
XEON+IB(64Nodes,28PPN)
2XLowerisbetter
0
1000
2000
3000
4000
8K 16K 32K 64K 128K 256KMessageSize
MVAPICH2-2.2GADPMLIMPI
KNL+OmniPath (64Nodes64PPN)
4X
• BenefitsofDPMLsustainedonKNL+OmniPath evenat4096 processes
• With32Kbytes,4X improvementoverMVAPICH2
IXPUG‘18 16NetworkBasedComputingLaboratory
ScalabilityofDPMLAllreduce OnStampede2-KNL(10,240Processes)
0
50
100
150
200
250
300
4 8 16 32 64 128 256 512 1024 2048 4096
Latency(us)
MessageSize
MVAPICH2 DPML IMPI
0200400600800
100012001400160018002000
8K 16K 32K 64K 128K 256K
MessageSize
MVAPICH2 DPML IMPI
OSUMicroBenchmark(64PPN)
2.4X
• ForMPI_Allreducelatencywith32Kbytes,DPMLdesigncanreducethelatencyby2.4X
AvailableinMVAPICH2-X2.3b
Lowerisbetter
IXPUG‘18 17NetworkBasedComputingLaboratory
PerformanceBenefitsofDPMLAllReduce onMiniAMR Kernel
0
20
40
60
80
448 896 1792NumberofProcesses
MVAPICH2DPMLIMPI
0
10
20
30
40
50
512 1024 1280MeshRe
finem
entT
ime(s)
NumberofProcesses
MVAPICH2DPMLIMPI
• ForMiniAMRApplicationwith4096processes,DPMLcanreducethelatencyby2.4XonKNL+Omni-Pathcluster
2.4X 1.5X
KNL+Omni-Path(32PPN) XEON+Omni-Path(28PPN)
Lowerisbetter
• OnXEON+Omni-Path,with1792processes,DPMLcanreducethelatencyby1.5X
IXPUG‘18 18NetworkBasedComputingLaboratory
• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point
– DirectShared-memory
– DataPartitionedMulti-Leader(DPML)
• Designing“Zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns
– Truezero-copycollectives
• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives
– CORE-DirectbasedNon-blockingcollectives
Agenda
IXPUG‘18 19NetworkBasedComputingLaboratory
HowKernel-assisted“Zero-copy”works?
Kernel
MPISender MPIReceiver
SendBuffer
RecvBuffer
SharedMemory
Kernel
MPISender MPIReceiver
SendBuffer
RecvBuffer
SharedMemory– SHMEMRequirestwocopies
NosystemcalloverheadBetterforSmallMessages
Kernel-AssistedCopyRequiressinglecopySystemcalloverhead
BetterforLargeMessages
IXPUG‘18 20NetworkBasedComputingLaboratory
AVarietyofAvailable“Zero”-CopyMechanismsLiMIC KNEM CMA XPMEM
PermissionCheck NotSupported Supported Supported Supported
Availability KernelModule KernelModule Included inLinux3.2+ KernelModule
memcpy()invokedat Kernel-space Kernel-space Kernel-space User-space
memcpy()granularity Pagesize Pagesize Pagesize Anysize
LiMIC KNEM CMA XPMEMMVAPICH2 √ x √ √(upcomingrelease)OpenMPI2.1.0 x √ √ √IntelMPI2017 x x √ xCrayMPI x x √ √
MPILibrarySupport
CrossMemoryAttach(CMA)iswidelysupportedkernel-assistedmechanism
IXPUG‘18 21NetworkBasedComputingLaboratory
• Directalgorithmdesignsbasedonkernel-assistedzero-copymechanism– “Map” applicationbufferpagesinsidekernel
– Issue“Put” or“Get” operationsdirectlyontheapplicationbuffers
• Goodperformanceforlargemessages– Avoidunnecessarycopy overheadsofSHMEM
• Performancedependsonthecommunicationpattern ofthecollectiveprimitive
• Doesnotoffer“zero-copy”forReductionCollectives
DirectKernel-assisted(CMA-based)Collectives
Whataboutcontention?
1
100
10000
1000000
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M
MessageSize
KNL(64Processes)
DirectSHMEM
Directkernel-assisted
Point-to-point
CMAbasedPersonalizedAll-to-All
>50%better
Latency(us)
IXPUG‘18 22NetworkBasedComputingLaboratory
ImpactofCollectiveCommunicationPatternonCMACollectives
1
10
100
1000
10000
100000
1000000
1K 4K 16K 64K 256K 1M 4MMessageSize
DifferentProcesses
PPN-2 PPN-4 PPN-8 PPN-16 PPN-32 PPN-64
1
10
100
1000
10000
100000
1000000
1K 4K 16K 64K 256K 1M 4MMessageSize
SameProcess,SameBuffer
1
10
100
1000
10000
100000
1000000
1K 4K 16K 64K 256K 1M 4MMessageSize
SameProcess,DiffBuffers
Latency(us)
All-to-All– GoodScalability One-to-All- PoorScalability One-to-All– PoorScalability
>100xworse
>100xworseNoincrease
withPPN
P0
P1P3
P2
P0
P1 P3
P2
P0
P1 P3
P2
ContentionisatProcesslevel
IXPUG‘18 23NetworkBasedComputingLaboratory
Contention-awareCMACollective
1
10
100
1000
10000
100000
1000000
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M
MessageSize
KNL(64Processes)
MVAPICH2
IntelMPI
OpenMPI
Proposed
Latency(us)
• Upto5x and2ximprovementforMPI_Scatter andMPI_Bcast onKNL• ForBcast,improvementsobtainedforlargemessagesonly(p-1copieswithCMA,pcopieswith
Sharedmemory)• AlltoAllLargemessageperformanceboundbysystembandwidth(5%-20%improvement)• FallbacktoSHMEMforsmallmessages
~5xbetter
S.Chakraborty,H.Subramoni,andD.K.Panda,ContentionAwareKernel-AssistedMPICollectivesforMulti/Many-coreSystems,IEEECluster’17,BESTPaperFinalist
MPI_Scatter MPI_Bcast
1
10
100
1000
10000
100000
1K 4K 16K 64K 256K 1M 4MMessageSize
KNL(64Processes)
MVAPICH2
IntelMPI
OpenMPI
Proposed
UseCMA
UseSHMEM 1
10
100
1000
10000
100000
1000000
1K 4K 16K 64K 256K 1M 4MMessageSize
KNL(64Processes)
MVAPICH2
IntelMPI
OpenMPI
Proposed
MPI_AlltoAll
IXPUG‘18 24NetworkBasedComputingLaboratory
Multi-NodeScalabilityUsingTwo-LevelAlgorithms
1
10
100
1000
10000
100000
1000000
1K 4K 16K 64K 256K 1M 4M
MessageSize
KNL(2Nodes,128procs)
MVAPICH2IntelMPIOpenMPITunedCMA
Latency(us)
1
10
100
1000
10000
100000
1000000
1K 4K 16K 64K 256K 1M
MessageSize
KNL(4Nodes,256Procs)
MVAPICH2IntelMPIOpenMPITunedCMA 1
10
100
1000
10000
100000
1000000
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M
MessageSize
KNL(8Nodes,512Procs)
MVAPICH2IntelMPIOpenMPITunedCMA
• Significantlyfasterintra-nodecommunication• Newtwo-levelcollectivedesignscanbecomposed
• 4x-17x improvementin8nodeScatterandGathercomparedtodefaultMVAPICH2
~2.5xBetter
~3.2xBetter
~4xBetter
~17xBetter
Canwehavezero-copy“Reduction”collectiveswiththisapproach?Doyouseetheproblemhere???
1. Contention“avoidance”–Notremoval
2. Reductionrequiresextracopies
IXPUG‘18 25NetworkBasedComputingLaboratory
• OffloadReductioncomputationandcommunicationtopeerMPIranks– EveryPeerhasdirect“load/store”accesstootherpeer’sbuffers– Multiplepseudorootsindependentlycarry-outreductionsforintra-andinter-node
– Directlyputreduceddataintoroot’sreceivebuffer
• True“Zero-copy” designforAllreduce andReduce– NocopiesrequireduringtheentiredurationofReductionoperation
– Scalabletomultiplenodes
• Zerocontention overheads asmemorycopieshappenin“user-space”
SharedAddressSpace(XPMEM-based)Collectives
J.Hashmi,S.Chakraborty,M.Bayatpour,H.Subramoni,andD.Panda,DesigningEfficientSharedAddressSpaceReductionCollectivesforMulti-/Many-cores,InternationalParallel&DistributedProcessingSymposium(IPDPS'18),May2018.
IXPUG‘18 26NetworkBasedComputingLaboratory
SharedAddressSpace(XPMEM)-basedCollectivesDesign
1
100
10000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Latency(us)
MessageSize
MVAPICH2IntelMPIMVAPICH2-XPMEM
OSU_Allreduce(Broadwell256procs)
• “SharedAddressSpace”-basedtrue zero-copy ReductioncollectivedesignsinMVAPICH2
• Offloadedcomputation/communicationtopeersranksinreductioncollectiveoperation
• Upto4X improvementfor4MBReduceandupto1.8X improvementfor4MAllReduce
1.8X
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4MMessageSize
MVAPICH2
IntelMPI
MVAPICH2-XPMEM
OSU_Reduce(Broadwell256procs)
4X
37%
WillbeavailableinupcomingMVAPICH2-Xrelease
50%
IXPUG‘18 27NetworkBasedComputingLaboratory
Application-LevelBenefitsofXPMEM-BasedCollectives
MiniAMR(Broadwell,ppn=16)
• Upto20%benefitsoverIMPIforCNTKDNNtrainingusingAllReduce• Upto27% benefitsoverIMPIandupto15% improvementoverMVAPICH2for
MiniAMRapplicationkernel
0
200
400
600
800
28 56 112 224
ExecutionTime(s)
No.ofProcesses
IntelMPIMVAPICH2MVAPICH2-XPMEM
CNTKAlexNetTraining(Broadwell,B.S=default,iteration=50,ppn=28)
0
20
40
60
80
16 32 64 128 256ExecutionTime(s)
No.ofProcesses
IntelMPIMVAPICH2MVAPICH2-XPMEM20%
9%
27%
15%
IXPUG‘18 28NetworkBasedComputingLaboratory
• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point
– DirectShared-memory
– DataPartitionedMulti-Leader(DPML)
• Designing“zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns
– Truezero-copycollectives
• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives
– CORE-DirectbasedNon-blockingcollectives
Agenda
IXPUG‘18 29NetworkBasedComputingLaboratory
• Applicationprocessesschedulecollectiveoperation
• Checkperiodicallyifoperationiscomplete
• Overlapofcomputationandcommunication=>BetterPerformance
• Catch:Whowillprogresscommunication
ConceptofNon-blockingCollectivesApplicationProcess
ApplicationProcess
ApplicationProcess
ApplicationProcess
Computation
Communication
CommunicationSupportEntity
CommunicationSupportEntity
CommunicationSupportEntity
CommunicationSupportEntity
ScheduleOperation
ScheduleOperation
ScheduleOperation
ScheduleOperation
CheckifComplete
CheckifComplete
CheckifComplete
CheckifComplete
CheckifComplete
CheckifComplete
CheckifComplete
CheckifComplete
IXPUG‘18 30NetworkBasedComputingLaboratory
§ ManagementandexecutionofMPIoperationsinthenetworkbyusingSHArP§ Manipulationofdatawhileitisbeingtransferredintheswitch
network
§ SHArPprovidesanabstractiontorealizethereductionoperation§ DefinesAggregationNodes(AN),AggregationTree,and
AggregationGroups
§ ANlogicisimplementedasanInfiniBandTargetChannelAdapter(TCA)integratedintotheswitchASIC*
§ UsesRCforcommunicationbetweenANsandbetweenANandhostsintheAggregationTree*
OffloadingwithScalableHierarchicalAggregationProtocol(SHArP)
PhysicalNetworkTopology*
LogicalSHArPTree**Blochetal.ScalableHierarchicalAggregationProtocol(SHArP):AHardwareArchitectureforEfficientDataReduction
IXPUG‘18 31NetworkBasedComputingLaboratory
0
0.1
0.2
(4,28) (8,28) (16,28)
ExecutionTime(s)
(NumberofNodes,PPN)
MVAPICH2
MVAPICH2-SHArP
MeshRefinementTimeofMiniAMR
SHArP basedblockingAllreduce CollectiveDesignsinMVAPICH2
0
0.1
0.2
0.3
0.4
(4,28) (8,28) (16,28)
ExecutionTime(s)
(NumberofNodes,PPN)
MVAPICH2
MVAPICH2-SHArP
AvgDDOTAllreducetimeofHPCG
SHArPSupportisavailablesinceMVAPICH22.3a
M.Bayatpour,S.Chakraborty,H.Subramoni,X.Lu,andD.K.Panda,ScalableReductionCollectiveswithDataPartitioning-basedMulti-LeaderDesign,SuperComputing'17.
13%12%
IXPUG‘18 32NetworkBasedComputingLaboratory
0
0.2
0.4
0.6
56 224 448
DDOT
Tim
ing(Secon
ds)
NumberofProcesses
MVAPICH2NUMA-Aware(proposed)MVAPICH2+SHArP
0
20
40
60
4 8 16 32 64 128 256 512 1K 2K 4K
Latency(us)
MessageSize(Byte)
MVAPICH2
NUMA-aware(Proposed)
MVAPICH2+SHArP
PerformanceofNUMA-awareSHArP DesignonXEON+IBCluster
• Asthemessagesizedecreases,thebenefitsofusingSocket-baseddesignincreases
• NUMA-awaredesigncanreducethelatencybyupto23%forDDOTphaseofHPCGandupto40%formicro-benchmarks
OSUMicroBenchmark(16Nodes,28PPN) HPCG(16nodes,28PPN)
23%
40%
Lowerisbetter
IXPUG‘18 33NetworkBasedComputingLaboratory
0
2
4
6
8
10
4 8 16 32 64 128
Latency(us)
MessageSize(Bytes)
1PPN*,8Nodes
MVAPICH2MVAPICH2-SHArP
0.1
1
10
100
4 8 16 32 64 128
Overla
p(%
)
MessageSize(Bytes)
1PPN,8Nodes
MVAPICH2MVAPICH2-SHArP
SHArP basedNon-BlockingAllreduce inMVAPICH2MPI_Iallreduce Benchmark
2.3x
*PPN:ProcessesPerNode
• CompleteoffloadofAllreducecollectiveoperationto“Switch”
o higheroverlapofcommunicationandcomputation
AvailablesinceMVAPICH22.3a
Lowerisbetter
Higherisbetter
IXPUG‘18 34NetworkBasedComputingLaboratory
• MellanoxCORE-Directtechnologyallowsforoffloadingthecollectivecommunicationtothenetworkadapter
• MVAPICH2supportsCORE-Directbasedoffloadingofnon-blockingcollectives
– Coversallthenon-blockingcollectives
– Enabledbyconfigureandruntimeparameters
• CORE-DirectbasedMPI_Ibcast designimprovestheperformanceofHighPerformanceLinpack (HPL)benchmark
NICoffloadbasedNon-blockingCollectivesusingCORE-Direct
AvailablesinceMVAPICH2-X2.2a
IXPUG‘18 35NetworkBasedComputingLaboratory
Co-designingHPLwithCore-DirectandPerformanceBenefits
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70Normalize
dHP
LPe
rforman
ce
HPLProblemSize(N)as%ofTotalMemory
HPL-Offload HPL-1ring HPL-Host
HPLPerformanceComparisonwith512ProcessesHPL-OffloadconsistentlyoffershigherthroughputthanHPL-1ringandHPL-Host.Improvespeakthroughputbyupto4.5%forlargeproblemsizes
4.5%
0500100015002000250030003500400045005000
0102030405060708090
64 128 256 512
Thro
ughp
ut (G
Flop
s)
Mem
ory
Con
sum
ptio
n (%
)
System Size (Number of Processes)
HPL-Offload HPL-1ring HPL-HostHPL-Offload HPL-1ring HPL-Host
HPL-OffloadsurpassesthepeakthroughputofHPL-1ringwithsignificantlysmallerproblemsizesandrun-times!
K.Kandalla,H.Subramoni,J.Vienne,S.PaiRaikar,K.Tomko,S.Sur,andDKPanda,DesigningNon-blockingBroadcastwithCollectiveOffloadonInfiniBandClusters:ACaseStudywithHPL,(HOTI2011)
Higherisbetter
IXPUG‘18 36NetworkBasedComputingLaboratory
• Many-corenodeswillbethefoundationblocksforemergingExascalesystems
• Communicationmechanismsandruntimesneedtobere-designed totakeadvantageofthehighconcurrency offeredbymanycores
• Presentedasetofnoveldesigns forcollectivecommunication primitivesinMPIthataddressseveralchallenges
• Demonstratedtheperformancebenefits ofourproposeddesignsunderavarietyofmulti-/many-coresandhigh-speednetworks
• SomeofthesedesignsarealreadyavailableinMVAPICH2libraries
• ThenewdesignswillbeavailableinupcomingMVAPICH2libraries
ConcludingRemarks
IXPUG‘18 37NetworkBasedComputingLaboratory
ThankYou!
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
hashmi.29@osu.edu
TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/
TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/
TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/
IXPUG‘18 38NetworkBasedComputingLaboratory
CollectiveCommunicationinMVAPICH2
Multi/Many-CoreAwareDesigns
Communicationonly
Inter-NodeCommunication
Intra-NodeCommunication
PointtoPoint(SHMEM,
LiMIC,CMA,XPMEM)
DirectSharedMemory
DirectKernelAssisted
(CMA,XPMEM,LiMIC)
PointtoPoint
HardwareMulticast SHARP RDMA
DesignedforPerformance&Overlap
Communication+Computation
Blockingandnon-blockingCollectiveCommunicationPrimitivesinMPI
IXPUG‘18 39NetworkBasedComputingLaboratory
BreakdownofaCMAReadoperation• CMAreliesonget_user_pages()function
• Takesapagetablelockonthetargetprocess
• Lockcontentionincreaseswithnumberofconcurrentreaders
• Over90%oftotaltimespentinlockcontention
• One-to-allcommunicationonBroadwell,profiledusingftrace
• Lockcontentionistherootcauseofperformancedegradation• Presentinotherkernel-assistedschemessuchasKNEM,LiMiC aswell
Latency(us)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
10Pages
50Pages
100Pages
10Pages
50Pages
100Pages
10Pages
50Pages
100Pages
NoContention 14Readers 28Readers
SystemCallPermissionCheckAcquireLocksCopyDataPinPages
IXPUG‘18 40NetworkBasedComputingLaboratory
voidmain()
{
MPI_Init()
…..
MPI_Ialltoall(…)
ComputationthatdoesnotdependonresultofAlltoall
MPI_Test(forIalltoall)/*Checkifcomplete(non-blocking)*/
ComputationthatdoesnotdependonresultofAlltoall
MPI_Wait(forIalltoall)/*Waittillcomplete(Blocking)*/
…
MPI_Finalize()
}
HowdoIwriteapplicationswithNon-blockingCollectives?
IXPUG‘18 41NetworkBasedComputingLaboratory
• AsetofdesignslideswillbeaddedinthissectionbasedonthefollowingpaperacceptedforIPDPS‘18
J.Hashmi,S.Chakraborty,M.Bayatpour,H.Subramoni,andD.K.Panda,DesigningEfficientSharedAddressSpaceReductionCollectivesforMulti-/Many-cores,32ndIEEEInternationalParallel&DistributedProcessingSymposium(IPDPS'18),May2018
XPMEM-basedSharedAddressSpaceCollectives