Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory...

DesigningHigh-PerformanceandScalableCollectivesfortheMany-coreEra:TheMVAPICH2Approach

S.Chakraborty,M.Bayatpour,J.Hashmi,H.SubramoniandDKPandaTheOhioStateUniversity

E-mail:{chakraborty.52,bayatpour.1,hashmi.29,subramoni.1,panda.2}@osu.edu

IXPUG’18Presentation

Presenter: JahanzebHashmi

IXPUG‘18 2NetworkBasedComputingLaboratory

ParallelProgrammingModelsOverviewP1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

PartitionedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

• Programmingmodelsprovideabstractmachinemodels

• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

• Programmingmodelsoffervariouscommunicationprimitives– Point-to-point(betweenpairofprocesses/threads)

– RemoteMemoryAccess(directlyaccessmemoryofanotherprocess)

– Collectives(groupcommunication)

SupportingProgrammingModelsforMulti-PetaflopandExaflopSystems:Challenges

ProgrammingModelsMPI,PGAS(UPC,GlobalArrays,OpenSHMEM),CUDA,OpenMP,OpenACC,Cilk,Hadoop(MapReduce),Spark(RDD,DAG),etc.

ApplicationKernels/Applications

NetworkingTechnologies(InfiniBand,40/100GigE,Aries,andOmni-Path)

Multi-/Many-coreArchitectures

Accelerators(GPUandFPGA)

MiddlewareCo-Design

Opportunitiesand

ChallengesacrossVarious

Layers

PerformanceScalabilityResilience

CommunicationLibraryorRuntimeforProgrammingModelsPoint-to-pointCommunication

CollectiveCommunication

Energy-Awareness

SynchronizationandLocks

I/OandFileSystems

FaultTolerance

MPI-FFT VASP AMG COSMO Graph500 MiniFE MILC DL-POLY HOOMD-blue

Percentageof

Commun

icatio

nTimeSpent

inCollectiveOperatio

WhyCollectiveCommunicationMatters?

http://www.hpcadvisorycouncil.com

all,Bcast

all,Allre

duce,A

duce,B

cast,A

duce,B

• HPCAdvisoryCouncil(HPCAC)MPIapplicationprofiles

• Mostapplicationprofilesshowedmajorityoftimespentincollectiveoperations

• Optimizingcollectivecommunicationdirectlyimpactsscientificapplicationsleadingtoacceleratedscientificdiscovery

WhydifferentalgorithmsofevenadensecollectivesuchasAlltoalldonotachievetheoreticalpeakbandwidthofferedbythesystem?

AreCollectiveDesignsinMPIreadyforManycoreEra?

1000000

1 2 4 8 16 32 64 128

512 1K

BrucksRecursive-DoublingScatter-DestinationPairwise-ExchangeTheoretical-Peak

Effectiveband

width(M

AlltoallAlgorithmsonsingleKNL7250inCache-modeon64MPIprocessesusingMVAPICH2-2.3rc1

~35%ofpeak

• Exploitinghighconcurrencyandhighbandwidthofferedbymodernarchitectures

• Designing“zero-copy”and“contention-free”CollectiveCommunication

• Efficienthardwareoffloadingforbetteroverlapofcommunicationandcomputation

BroadChallengesduetoArchitecturalAdvances

HowdoesMVAPICH2asanMPIlibrarytacklesthesechallengesandprovideoptimalcollectivedesignsforemergingmulti-/many-cores?

OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.1),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,875organizationsin86countries

– Morethan464,000(>0.46million)downloadsfromtheOSUsitedirectly

– EmpoweringmanyTOP500clusters(Nov‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 9th,556,104cores(Oakforest-PACS)inJapan

• 12th,368,928-core(Stampede2)atTACC

• 17th,241,108-core(Pleiades)atNASA

• 48th,76,032-core(Tsubame2.5)atTokyoInstituteofTechnology

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade

ArchitectureofMVAPICH2SoftwareFamily

HighPerformanceParallelProgrammingModels

MessagePassingInterface(MPI)

PGAS(UPC,OpenSHMEM,CAF,UPC++)

Hybrid--- MPI+X(MPI+PGAS+OpenMP/Cilk)

HighPerformanceandScalableCommunicationRuntimeDiverseAPIsandMechanisms

Point-to-point

Primitives

CollectivesAlgorithms

Energy-Awareness

RemoteMemoryAccess

I/OandFileSystems

FaultTolerance

Virtualization ActiveMessages

JobStartupIntrospection&Analysis

SupportforModernNetworkingTechnology(InfiniBand,iWARP,RoCE,Omni-Path)

SupportforModernMulti-/Many-coreArchitectures(Intel-Xeon,OpenPower,Xeon-Phi,ARM,NVIDIAGPGPU)

TransportProtocols ModernFeatures

RC XRC UD DC UMR ODPSR-IOV

MultiRail

TransportMechanismsSharedMemory

CMA IVSHMEM

ModernFeatures

MCDRAM* NVLink* CAPI*

* Upcoming

XPMEM*

MVAPICH2SoftwareFamilyHigh-PerformanceParallelProgrammingLibraries

MVAPICH2 SupportforInfiniBand,Omni-Path,Ethernet/iWARP,andRoCE

MVAPICH2-X AdvancedMPIfeatures,OSUINAM,PGAS(OpenSHMEM,UPC,UPC++,andCAF),andMPI+PGASprogrammingmodelswithunifiedcommunicationruntime

MVAPICH2-GDR OptimizedMPIforclusterswithNVIDIAGPUs

MVAPICH2-Virt High-performanceandscalableMPIforhypervisorandcontainerbasedHPCcloud

MVAPICH2-EA EnergyawareandHigh-performanceMPI

MVAPICH2-MIC OptimizedMPIforclusterswithIntelKNC

Microbenchmarks

OMB MicrobenchmarkssuitetoevaluateMPIandPGAS(OpenSHMEM,UPC,andUPC++)librariesforCPUsandGPUs

OSUINAM Networkmonitoring,profiling,andanalysisforclusterswithMPIandschedulerintegration

OEMT UtilitytomeasuretheenergyconsumptionofMPIapplications

• ExploitinghighconcurrencyandhighbandwidthofferedbymodernarchitecturesforMPIcollectivesdesign– Point-to-point

– DirectShared-memory

– DataPartitionedMulti-Leader(DPML)

• Designing“zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

– Truezero-copycollectives

• Hardwareoffloadingforbettercommunicationandcomputationoverlap– SHARPbasedoffloadedcollectives

– CORE-DirectbasedNon-blockingcollectives

Agenda

• Commonlyusedapproachinimplementingcollectives

• Easytoexpressalgorithmsinmessagepassingsemantics

• AnaïveBroadcastcouldbeaseriesof“send”operationsfromroottoallthenon-rootprocesses

• Reliesontheimplementationofpointtopointprimitives

• Limitedbytheoverheadsexposedbytheseprimitives

– Tag-matching

– Rendezvoushand-shake

CollectiveDesignsbasedonPoint-to-pointPrimitives

// root = 0

// msg-size > eager-size

if (rank == 0) {

for (i = 1 to n-1) {

MPI_Send(buf,...i,...);

} else {

MPI_Recv(buf,...0,...);

ANaïveExampleofMPI_BcastwhenusingMPI_Send/MPI_RecvMPI_Send MPI_Recv MPI_Recv

MPI_Send

Applicationskewandlibraryoverhead

MPI_Send

• Overheadsofhandshakeforeachrendezvousmessagetransfer

• Isthereabetterway?

Tagmatching

ReceiverwastingCPUcycleswaitingforSender’srequest

rootMPI_Send

MPI_Recv MPI_Recv

• Alargeshared-memoryregion– Collectivealgorithmsarerealizedbyshared-

memorycopiesandsynchronizations

– Goodperformanceforsmallmessageviaexploitingcachelocality

– AvoidoverheadsassociatedwithMPIpoint-to-pointimplementations

• Requiresoneadditionalcopyforeachtransfer

• Performancedegradationsforlargemessagecommunication– memcpy ()isthedominantcostforlargemessages

• MostMPIlibrariesusesomevariantofDirectSHMEMcollectives

DirectSharedMemorybasedCollectives

ReductioncollectivesperformevenworsewithSHMEMbaseddesignbecauseofcompute+memcpy

100000

1000000

1K 2K 4K 8K 16K

512K 1M 2M 4M

MessageSize

KNL(64Processes)

DirectSHMEM

Point-to-point

PersonalizedAll-to-All

DataPartitioningbasedMulti-Leader(DPML)Designs• Hierarchicalalgorithmsdelegatelotofcomputationonthe“node-leader”

– Leaderprocessresponsibleforinter-nodereductionswhileintra-nodenon-rootprocesseswaitfortheleader

• ExistingdesignsforMPI_Allreducedonottakeadvantageofthevastparallelismavailableinmodernmulti-/many-coreprocessors

• DPML- anewsolutionforMPI_Allreduce• Takesadvantageoftheparallelismofferedby

– Multi-/many-corearchitectures– Highthroughputandhigh-endfeaturesofferedbyInfiniBandandOmni-Path

• MultiplepartitionsofreductionvectorsforarbitrarynumberofleadersM.Bayatpour,S.Chakraborty,H.Subramoni,X.Lu,andD.K.Panda,ScalableReductionCollectiveswithDataPartitioning-basedMulti-LeaderDesign,Supercomputing'17.

8K 16K 32K 64K 128K 256K

Latency(us)

MessageSize

MVAPICH2-2.2GADPMLIMPI

PerformanceofDPMLMPI_Allreduce OnDifferentNetworksandArchitectures

• 2XimprovementofoverMVAPICH2at256K

• HigherbenefitsofDPMLasthemessagesizeincreases

XEON+IB(64Nodes,28PPN)

2XLowerisbetter

8K 16K 32K 64K 128K 256KMessageSize

MVAPICH2-2.2GADPMLIMPI

KNL+OmniPath (64Nodes64PPN)

• BenefitsofDPMLsustainedonKNL+OmniPath evenat4096 processes

• With32Kbytes,4X improvementoverMVAPICH2

ScalabilityofDPMLAllreduce OnStampede2-KNL(10,240Processes)

4 8 16 32 64 128 256 512 1024 2048 4096

Latency(us)

MessageSize

MVAPICH2 DPML IMPI

0200400600800

100012001400160018002000

8K 16K 32K 64K 128K 256K

MessageSize

MVAPICH2 DPML IMPI

OSUMicroBenchmark(64PPN)

• ForMPI_Allreducelatencywith32Kbytes,DPMLdesigncanreducethelatencyby2.4X

AvailableinMVAPICH2-X2.3b

Lowerisbetter

PerformanceBenefitsofDPMLAllReduce onMiniAMR Kernel

448 896 1792NumberofProcesses

MVAPICH2DPMLIMPI

512 1024 1280MeshRe

ime(s)

NumberofProcesses

MVAPICH2DPMLIMPI

• ForMiniAMRApplicationwith4096processes,DPMLcanreducethelatencyby2.4XonKNL+Omni-Pathcluster

2.4X 1.5X

KNL+Omni-Path(32PPN) XEON+Omni-Path(28PPN)

Lowerisbetter

• OnXEON+Omni-Path,with1792processes,DPMLcanreducethelatencyby1.5X

• Designing“Zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

Agenda

HowKernel-assisted“Zero-copy”works?

Kernel

MPISender MPIReceiver

SendBuffer

RecvBuffer

SharedMemory

Kernel

MPISender MPIReceiver

SendBuffer

RecvBuffer

SharedMemory– SHMEMRequirestwocopies

NosystemcalloverheadBetterforSmallMessages

Kernel-AssistedCopyRequiressinglecopySystemcalloverhead

BetterforLargeMessages

AVarietyofAvailable“Zero”-CopyMechanismsLiMIC KNEM CMA XPMEM

PermissionCheck NotSupported Supported Supported Supported

Availability KernelModule KernelModule Included inLinux3.2+ KernelModule

memcpy()invokedat Kernel-space Kernel-space Kernel-space User-space

memcpy()granularity Pagesize Pagesize Pagesize Anysize

LiMIC KNEM CMA XPMEMMVAPICH2 √ x √ √(upcomingrelease)OpenMPI2.1.0 x √ √ √IntelMPI2017 x x √ xCrayMPI x x √ √

MPILibrarySupport

CrossMemoryAttach(CMA)iswidelysupportedkernel-assistedmechanism

• Directalgorithmdesignsbasedonkernel-assistedzero-copymechanism– “Map” applicationbufferpagesinsidekernel

– Issue“Put” or“Get” operationsdirectlyontheapplicationbuffers

• Goodperformanceforlargemessages– Avoidunnecessarycopy overheadsofSHMEM

• Performancedependsonthecommunicationpattern ofthecollectiveprimitive

• Doesnotoffer“zero-copy”forReductionCollectives

DirectKernel-assisted(CMA-based)Collectives

Whataboutcontention?

1000000

1K 2K 4K 8K 16K

512K 1M 2M 4M

MessageSize

KNL(64Processes)

DirectSHMEM

Directkernel-assisted

Point-to-point

CMAbasedPersonalizedAll-to-All

>50%better

Latency(us)

ImpactofCollectiveCommunicationPatternonCMACollectives

100000

1000000

1K 4K 16K 64K 256K 1M 4MMessageSize

DifferentProcesses

PPN-2 PPN-4 PPN-8 PPN-16 PPN-32 PPN-64

100000

1000000

SameProcess,SameBuffer

100000

1000000

SameProcess,DiffBuffers

Latency(us)

All-to-All– GoodScalability One-to-All- PoorScalability One-to-All– PoorScalability

>100xworse

>100xworseNoincrease

withPPN

ContentionisatProcesslevel

Contention-awareCMACollective

100000

1000000

1K 2K 4K 8K 16K

512K 1M 2M 4M 8M

MessageSize

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

Latency(us)

• Upto5x and2ximprovementforMPI_Scatter andMPI_Bcast onKNL• ForBcast,improvementsobtainedforlargemessagesonly(p-1copieswithCMA,pcopieswith

Sharedmemory)• AlltoAllLargemessageperformanceboundbysystembandwidth(5%-20%improvement)• FallbacktoSHMEMforsmallmessages

~5xbetter

S.Chakraborty,H.Subramoni,andD.K.Panda,ContentionAwareKernel-AssistedMPICollectivesforMulti/Many-coreSystems,IEEECluster’17,BESTPaperFinalist

MPI_Scatter MPI_Bcast

100000

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

UseCMA

UseSHMEM 1

100000

1000000

KNL(64Processes)

MVAPICH2

IntelMPI

OpenMPI

Proposed

MPI_AlltoAll

Multi-NodeScalabilityUsingTwo-LevelAlgorithms

100000

1000000

1K 4K 16K 64K 256K 1M 4M

MessageSize

KNL(2Nodes,128procs)

MVAPICH2IntelMPIOpenMPITunedCMA

Latency(us)

100000

1000000

1K 4K 16K 64K 256K 1M

MessageSize

KNL(4Nodes,256Procs)

MVAPICH2IntelMPIOpenMPITunedCMA 1

100000

1000000

1K 2K 4K 8K 16K

512K 1M

MessageSize

KNL(8Nodes,512Procs)

MVAPICH2IntelMPIOpenMPITunedCMA

• Significantlyfasterintra-nodecommunication• Newtwo-levelcollectivedesignscanbecomposed

• 4x-17x improvementin8nodeScatterandGathercomparedtodefaultMVAPICH2

~2.5xBetter

~3.2xBetter

~4xBetter

~17xBetter

Canwehavezero-copy“Reduction”collectiveswiththisapproach?Doyouseetheproblemhere???

1. Contention“avoidance”–Notremoval

2. Reductionrequiresextracopies

• OffloadReductioncomputationandcommunicationtopeerMPIranks– EveryPeerhasdirect“load/store”accesstootherpeer’sbuffers– Multiplepseudorootsindependentlycarry-outreductionsforintra-andinter-node

– Directlyputreduceddataintoroot’sreceivebuffer

• True“Zero-copy” designforAllreduce andReduce– NocopiesrequireduringtheentiredurationofReductionoperation

– Scalabletomultiplenodes

• Zerocontention overheads asmemorycopieshappenin“user-space”

SharedAddressSpace(XPMEM-based)Collectives

J.Hashmi,S.Chakraborty,M.Bayatpour,H.Subramoni,andD.Panda,DesigningEfficientSharedAddressSpaceReductionCollectivesforMulti-/Many-cores,InternationalParallel&DistributedProcessingSymposium(IPDPS'18),May2018.

SharedAddressSpace(XPMEM)-basedCollectivesDesign

16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency(us)

MessageSize

MVAPICH2IntelMPIMVAPICH2-XPMEM

OSU_Allreduce(Broadwell256procs)

• “SharedAddressSpace”-basedtrue zero-copy ReductioncollectivedesignsinMVAPICH2

• Offloadedcomputation/communicationtopeersranksinreductioncollectiveoperation

• Upto4X improvementfor4MBReduceandupto1.8X improvementfor4MAllReduce

100000

16K 32K 64K 128K 256K 512K 1M 2M 4MMessageSize

MVAPICH2

IntelMPI

MVAPICH2-XPMEM

OSU_Reduce(Broadwell256procs)

WillbeavailableinupcomingMVAPICH2-Xrelease

Application-LevelBenefitsofXPMEM-BasedCollectives

MiniAMR(Broadwell,ppn=16)

• Upto20%benefitsoverIMPIforCNTKDNNtrainingusingAllReduce• Upto27% benefitsoverIMPIandupto15% improvementoverMVAPICH2for

MiniAMRapplicationkernel

28 56 112 224

ExecutionTime(s)

No.ofProcesses

IntelMPIMVAPICH2MVAPICH2-XPMEM

CNTKAlexNetTraining(Broadwell,B.S=default,iteration=50,ppn=28)

16 32 64 128 256ExecutionTime(s)

No.ofProcesses

IntelMPIMVAPICH2MVAPICH2-XPMEM20%

• Designing“zero-copy”and“contention-free”CollectiveCommunication– Contention-awaredesigns

Agenda

• Applicationprocessesschedulecollectiveoperation

• Checkperiodicallyifoperationiscomplete

• Overlapofcomputationandcommunication=>BetterPerformance

• Catch:Whowillprogresscommunication

ConceptofNon-blockingCollectivesApplicationProcess

ApplicationProcess

Computation

Communication

CommunicationSupportEntity

ScheduleOperation

CheckifComplete

§ ManagementandexecutionofMPIoperationsinthenetworkbyusingSHArP§ Manipulationofdatawhileitisbeingtransferredintheswitch

network

§ SHArPprovidesanabstractiontorealizethereductionoperation§ DefinesAggregationNodes(AN),AggregationTree,and

AggregationGroups

§ ANlogicisimplementedasanInfiniBandTargetChannelAdapter(TCA)integratedintotheswitchASIC*

§ UsesRCforcommunicationbetweenANsandbetweenANandhostsintheAggregationTree*

OffloadingwithScalableHierarchicalAggregationProtocol(SHArP)

PhysicalNetworkTopology*

LogicalSHArPTree**Blochetal.ScalableHierarchicalAggregationProtocol(SHArP):AHardwareArchitectureforEfficientDataReduction

(4,28) (8,28) (16,28)

ExecutionTime(s)

(NumberofNodes,PPN)

MVAPICH2

MVAPICH2-SHArP

MeshRefinementTimeofMiniAMR

SHArP basedblockingAllreduce CollectiveDesignsinMVAPICH2

(4,28) (8,28) (16,28)

ExecutionTime(s)

(NumberofNodes,PPN)

MVAPICH2

MVAPICH2-SHArP

AvgDDOTAllreducetimeofHPCG

SHArPSupportisavailablesinceMVAPICH22.3a

M.Bayatpour,S.Chakraborty,H.Subramoni,X.Lu,andD.K.Panda,ScalableReductionCollectiveswithDataPartitioning-basedMulti-LeaderDesign,SuperComputing'17.

13%12%

56 224 448

ing(Secon

NumberofProcesses

MVAPICH2NUMA-Aware(proposed)MVAPICH2+SHArP

4 8 16 32 64 128 256 512 1K 2K 4K

Latency(us)

MessageSize(Byte)

MVAPICH2

NUMA-aware(Proposed)

MVAPICH2+SHArP

PerformanceofNUMA-awareSHArP DesignonXEON+IBCluster

• Asthemessagesizedecreases,thebenefitsofusingSocket-baseddesignincreases

• NUMA-awaredesigncanreducethelatencybyupto23%forDDOTphaseofHPCGandupto40%formicro-benchmarks

OSUMicroBenchmark(16Nodes,28PPN) HPCG(16nodes,28PPN)

Lowerisbetter

4 8 16 32 64 128

Latency(us)

MessageSize(Bytes)

1PPN*,8Nodes

MVAPICH2MVAPICH2-SHArP

4 8 16 32 64 128

Overla

MessageSize(Bytes)

1PPN,8Nodes

MVAPICH2MVAPICH2-SHArP

SHArP basedNon-BlockingAllreduce inMVAPICH2MPI_Iallreduce Benchmark

*PPN:ProcessesPerNode

• CompleteoffloadofAllreducecollectiveoperationto“Switch”

o higheroverlapofcommunicationandcomputation

AvailablesinceMVAPICH22.3a

Lowerisbetter

Higherisbetter

• MellanoxCORE-Directtechnologyallowsforoffloadingthecollectivecommunicationtothenetworkadapter

• MVAPICH2supportsCORE-Directbasedoffloadingofnon-blockingcollectives

– Coversallthenon-blockingcollectives

– Enabledbyconfigureandruntimeparameters

• CORE-DirectbasedMPI_Ibcast designimprovestheperformanceofHighPerformanceLinpack (HPL)benchmark

NICoffloadbasedNon-blockingCollectivesusingCORE-Direct

AvailablesinceMVAPICH2-X2.2a

Co-designingHPLwithCore-DirectandPerformanceBenefits

10 20 30 40 50 60 70Normalize

rforman

HPLProblemSize(N)as%ofTotalMemory

HPL-Offload HPL-1ring HPL-Host

HPLPerformanceComparisonwith512ProcessesHPL-OffloadconsistentlyoffershigherthroughputthanHPL-1ringandHPL-Host.Improvespeakthroughputbyupto4.5%forlargeproblemsizes

0500100015002000250030003500400045005000

0102030405060708090

64 128 256 512

System Size (Number of Processes)

HPL-Offload HPL-1ring HPL-HostHPL-Offload HPL-1ring HPL-Host

HPL-OffloadsurpassesthepeakthroughputofHPL-1ringwithsignificantlysmallerproblemsizesandrun-times!

K.Kandalla,H.Subramoni,J.Vienne,S.PaiRaikar,K.Tomko,S.Sur,andDKPanda,DesigningNon-blockingBroadcastwithCollectiveOffloadonInfiniBandClusters:ACaseStudywithHPL,(HOTI2011)

Higherisbetter

• Many-corenodeswillbethefoundationblocksforemergingExascalesystems

• Communicationmechanismsandruntimesneedtobere-designed totakeadvantageofthehighconcurrency offeredbymanycores

• Presentedasetofnoveldesigns forcollectivecommunication primitivesinMPIthataddressseveralchallenges

• Demonstratedtheperformancebenefits ofourproposeddesignsunderavarietyofmulti-/many-coresandhigh-speednetworks

• SomeofthesedesignsarealreadyavailableinMVAPICH2libraries

• ThenewdesignswillbeavailableinupcomingMVAPICH2libraries

ConcludingRemarks

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

hashmi.29@osu.edu

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/

CollectiveCommunicationinMVAPICH2

Multi/Many-CoreAwareDesigns

Communicationonly

Inter-NodeCommunication

Intra-NodeCommunication

PointtoPoint(SHMEM,

LiMIC,CMA,XPMEM)

DirectSharedMemory

DirectKernelAssisted

(CMA,XPMEM,LiMIC)

PointtoPoint

HardwareMulticast SHARP RDMA

DesignedforPerformance&Overlap

Communication+Computation

Blockingandnon-blockingCollectiveCommunicationPrimitivesinMPI

BreakdownofaCMAReadoperation• CMAreliesonget_user_pages()function

• Takesapagetablelockonthetargetprocess

• Lockcontentionincreaseswithnumberofconcurrentreaders

• Over90%oftotaltimespentinlockcontention

• One-to-allcommunicationonBroadwell,profiledusingftrace

• Lockcontentionistherootcauseofperformancedegradation• Presentinotherkernel-assistedschemessuchasKNEM,LiMiC aswell

Latency(us)

10Pages

50Pages

100Pages

10Pages

50Pages

100Pages

10Pages

50Pages

100Pages

NoContention 14Readers 28Readers

SystemCallPermissionCheckAcquireLocksCopyDataPinPages

voidmain()

MPI_Init()

MPI_Ialltoall(…)

ComputationthatdoesnotdependonresultofAlltoall

MPI_Test(forIalltoall)/*Checkifcomplete(non-blocking)*/

ComputationthatdoesnotdependonresultofAlltoall

MPI_Wait(forIalltoall)/*Waittillcomplete(Blocking)*/

MPI_Finalize()

HowdoIwriteapplicationswithNon-blockingCollectives?

• AsetofdesignslideswillbeaddedinthissectionbasedonthefollowingpaperacceptedforIPDPS‘18

J.Hashmi,S.Chakraborty,M.Bayatpour,H.Subramoni,andD.K.Panda,DesigningEfficientSharedAddressSpaceReductionCollectivesforMulti-/Many-cores,32ndIEEEInternationalParallel&DistributedProcessingSymposium(IPDPS'18),May2018

XPMEM-basedSharedAddressSpaceCollectives

Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory...

Documents

Transcript of Designing High-Performance and Scalable Collectives for ... · Network Based Computing Laboratory...

Croyances Collectives : Organisations, MarchØs Financiers ...

ACTIONS COLLECTIVES à l’international Programme 2013international.bretagne.cci.fr/files/crci_bretagne/... · Aéronautique p. 10 ... Programme Actions Collectives 2013 – 4Version

ACTIONS COLLECTIVES à l’international Programme 2011international.bretagne.cci.fr/files/crci_bretagne/s... · 2011-05-13 · Programme Actions Collectives 2011 – 3Version 18/04/2011

Collectives in Spain (by Gaston Leval)

Probability Collectives

Collectives Against Mass

Designing High Performance and Energy- Efficient MPI ...sc15.supercomputing.org/.../drs118s2-file6.pdf · Designing High Performance and Energy-Efficient MPI Collectives for Next

Behind the map: crises and crisis collectives in high tech actions

Secteur Conventions Collectives (Négociations Collectives - Temps de Travail - Egalité Professionnelle - CE CCE Groupe) 141 avenue du Maine 75680 PARIS.

MGNREGS and Women’s Collectives - Kudumbashreekudumbashree.org/storage//files/y8osd_mgnregs in kerala.pdf · MGNREGS and Women’s Collectives ... I live in Meppadi Panchayat in

MEDICAL CANNABIS DISPENSING COLLECTIVES AND LOCAL REGULATION · MEDICAL CANNABIS DISPENSING COLLECTIVES AND LOCAL REGULATION ... MEDICAL CANNABIS DISPENSING COLLECTIVES AND ... worked

Catalogue des actions collectives FAFIEC

Collectives' and individuals' obligations: a parity argumenteprints.whiterose.ac.uk/108565/2/Collectives and Individuals... · This is a repository copy of Collectives' and individuals'

Registre des actions collectives

Introduction to Collectives

PROTECTIONS COLLECTIVES - altradequipement.com · PROTECTIONS COLLECTIVES POUR CHARPENTE METALLIQUE pinces IPE rapides, platines à visser ou à cheviller, garde-corps PROTECTIONS

COLLECTIVES AND COMPLEX SYSTEM DESIGNaero.stanford.edu/Reports/VKI_Collectives_Kroo_A.pdf · COLLECTIVES AND COMPLEX SYSTEM DESIGN ... the application of optimization in the design

Collectives for Integrated Livelihood Initiatives

Configurable Collectives & Posthuman Rights

Researching and marketing to consumption collectives