Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds...
Transcript of Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds...
BuildingEfficientHPCCloudswithMVAPICH2andOpenStackoverSR-IOV-enabledHeterogeneousClusters
TalkatOpenStackSummit2018
Vancouver,Canada
by
DhabaleswarK.(DK)PandaTheOhioStateUniversity
E-mail:[email protected]://www.cse.ohio-state.edu/~panda
XiaoyiLuTheOhioStateUniversity
E-mail:[email protected]://www.cse.ohio-state.edu/~luxi
OpenStackSummit2018 2NetworkBasedComputingLaboratory
• CloudComputingwidelyadoptedinindustrycomputingenvironment
• CloudComputingprovideshighresourceutilizationandflexibility
• VirtualizationisthekeytechnologytoenableCloudComputing
• Intersect360studyshowscloudisthefastestgrowingclassofHPC
• HPCMeetsCloud:TheconvergenceofCloudComputingandHPC
HPCMeetsCloudComputing
OpenStackSummit2018 3NetworkBasedComputingLaboratory
DriversofModernHPCClusterandCloudArchitecture
• Multi-core/many-coretechnologies,Accelerators
• Largememorynodes
• SolidStateDrives(SSDs),NVM,ParallelFilesystems,ObjectStorageClusters
• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)
• SingleRootI/OVirtualization(SR-IOV)
HighPerformanceInterconnects–InfiniBand(withSR-IOV)
<1useclatency,200GbpsBandwidth>Multi-/Many-core
Processors
SSDs,ObjectStorageClusters
Largememorynodes(Upto2TB)
Cloud Cloud SDSCComet TACCStampede
OpenStackSummit2018 4NetworkBasedComputingLaboratory
• SingleRootI/OVirtualization(SR-IOV)isprovidingnewopportunitiestodesignHPCcloudwithverylittlelowoverhead
SingleRootI/OVirtualization(SR-IOV)
• Allowsasinglephysicaldevice,oraPhysicalFunction(PF),topresentitselfasmultiplevirtualdevices,orVirtualFunctions(VFs)
• VFsaredesignedbasedontheexistingnon-virtualizedPFs,noneedfordriverchange
• EachVFcanbededicatedtoasingleVMthroughPCIpass-through
• Workwith10/40GigEandInfiniBand
4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications
The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.
The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.
2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.
Qemu Userspace
Guest 1
Userspace
kernelPCI Device
mmap region
Qemu Userspace
Guest 2
Userspace
kernel
mmap region
Qemu Userspace
Guest 3
Userspace
kernelPCI Device
mmap region
/dev/shm/<name>
PCI Device
Host
mmap mmap mmap
shared mem fd
eventfds
(a) Inter-VM Shmem Mechanism [15]
Guest 1Guest OS
VF Driver
Guest 2Guest OS
VF Driver
Guest 3Guest OS
VF Driver
Hypervisor PF Driver
I/O MMU
SR-IOV Hardware
Virtual Function
Virtual Function
Virtual Function
Physical Function
PCI Express
(b) SR-IOV Mechanism [22]
Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms
Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to
OpenStackSummit2018 5NetworkBasedComputingLaboratory
• VirtualizationSupportwithVirtualMachinesandContainers– KVM,Docker,Singularity,etc.
• CommunicationcoordinationamongoptimizedcommunicationchannelsonClouds– SR-IOV,IVShmem,IPC-Shm,CMA,etc.
• Locality-awarecommunication• Scalabilityformilliontobillionprocessors
– Supportforhighly-efficientinter-nodeandintra-nodecommunication(bothtwo-sidedandone-sided)
• ScalableCollectivecommunication– Offload;Non-blocking;Topology-aware
• Balancingintra-nodeandinter-nodecommunicationfornextgenerationnodes(128-1024cores)– Multipleend-pointspernode
• NUMA-awarecommunicationfornestedvirtualization• IntegratedSupportforGPGPUsandAccelerators• Fault-tolerance/resiliency
– Migrationsupportwithvirtualmachines• QoSsupportforcommunicationandI/O• SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,MPI+UPC++,CAF,…)• Energy-Awareness• Co-designwithresourcemanagementandschedulingsystemsonClouds
– OpenStack,Slurm,etc.
BroadChallengesofBuildingEfficientHPCClouds
OpenStackSummit2018 6NetworkBasedComputingLaboratory
• MVAPICH2-VirtwithSR-IOVandIVSHMEM– Standalone,OpenStack
• SR-IOV-enabledVMMigrationSupportinMVAPICH2
• MVAPICH2withContainers(DockerandSingularity)
• MVAPICH2withNestedVirtualization(ContaineroverVM)
• MVAPICH2-VirtonSLURM– SLURMalone,SLURM+OpenStack
• NeuroscienceApplicationsonHPCClouds
• BigDataLibrariesonCloud– RDMA-Hadoop,OpenStackSwift
ApproachestoBuildHPCClouds
OpenStackSummit2018 7NetworkBasedComputingLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.1),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,900organizationsin86countries
– Morethan469,000(>0.46million)downloadsfromtheOSUsitedirectly
– EmpoweringmanyTOP500clusters(Nov‘17ranking)• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China
• 9th,556,104cores(Oakforest-PACS)inJapan
• 12th,368,928-core(Stampede2)atTACC
• 17th,241,108-core(Pleiades)atNASA
• 48th,76,032-core(Tsubame2.5)atTokyoInstituteofTechnology
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade
OpenStackSummit2018 8NetworkBasedComputingLaboratory
• RedesignMVAPICH2tomakeitvirtualmachineaware– SR-IOVshowsneartonative
performanceforinter-nodepointtopointcommunication
– IVSHMEMofferssharedmemorybaseddataaccessacrossco-residentVMs
– LocalityDetector:maintainsthelocalityinformationofco-residentvirtualmachines
– CommunicationCoordinator:selectsthecommunicationchannel(SR-IOV,IVSHMEM)adaptively
• SupportdeploymentwithOpenStack
OverviewofMVAPICH2-VirtwithSR-IOVandIVSHMEM
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter
Physical Function
user space
kernel space
MPI proc
PCI Device
VF Driver
Guest 2user space
kernel space
MPI proc
PCI Device
VF Driver
Virtual Function
Virtual Function
/dev/shm/
IV-SHM
IV-Shmem Channel
SR-IOV Channel
J.Zhang,X.Lu,J.Jose,R.Shi,D.K.Panda.CanInter-VMShmemBenefitMPIApplicationsonSR-IOVbasedVirtualizedInfiniBandClusters?Euro-Par,2014
J.Zhang,X.Lu,J.Jose,R.Shi,M.Li,D.K.Panda.HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters.HiPC,2014
J.Zhang,X.Lu,M.Arnold,D.K.Panda.MVAPICH2overOpenStackwithSR-IOV:AnEfficientApproachtoBuildHPCClouds.CCGrid,2015
OpenStackSummit2018 9NetworkBasedComputingLaboratory
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
ExecutionTime(s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native
1% 9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
ExecutionTime(m
s)
ProblemSize(Scale,Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-Native 2%
• 32VMs,6Core/VM
• ComparedtoNative,2-5%overheadforGraph500with128Procs
• ComparedtoNative,1-9.5%overheadforSPECMPI2007with128Procs
Application-LevelPerformanceonChameleon
SPECMPI2007 Graph500
5%
OpenStackSummit2018 10NetworkBasedComputingLaboratory
• MVAPICH2-VirtwithSR-IOVandIVSHMEM– Standalone,OpenStack
• SR-IOV-enabledVMMigrationSupportinMVAPICH2
• MVAPICH2withContainers(DockerandSingularity)
• MVAPICH2withNestedVirtualization(ContaineroverVM)
• MVAPICH2-VirtonSLURM– SLURMalone,SLURM+OpenStack
• NeuroscienceApplicationsonHPCClouds
• BigDataLibrariesonCloud– RDMA-Hadoop,OpenStackSwift
ApproachestoBuildHPCClouds
OpenStackSummit2018 11NetworkBasedComputingLaboratory
ExecuteLiveMigrationwithSR-IOVDevice
OpenStackSummit2018 12NetworkBasedComputingLaboratory
HighPerformanceSR-IOVenabledVMMigrationSupportinMVAPICH2
MPI
Host
Guest VM1
Hypervisor
IBAdapterIVShmem
Network Suspend Trigger
Read-to-Migrate Detector
Network ReactiveNotifier
ŏ
Controller
MPIGuest VM2
MPI
Host
Guest VM1
VF /SR-IOV
Hypervisor
IBAdapter IVShmemEthernet
Adapter
MPIGuest VM2
EthernetAdapter
Migration Done
Detector
VF /SR-IOV
VF /SR-IOV
VF /SR-IOV
Migration Trigger
J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017
• MigrationwithSR-IOVdevicehastohandlethechallengesofdetachment/re-attachmentofvirtualizedIBdeviceandIBconnection
• ConsistofSR-IOVenabledIBClusterandExternalMigrationController
• MultipleparallellibrariestonotifyMPIapplicationsduringmigration(detach/reattachSR-IOV/IVShmem,migrateVMs,migrationstatus)
• HandletheIBconnectionsuspendingandreactivating
• ProposeProgressengine(PE)andmigrationthreadbased(MT)designtooptimizeVMmigrationandMPIapplicationperformance
OpenStackSummit2018 13NetworkBasedComputingLaboratory
• ComparedwiththeTCP,theRDMAschemereducesthetotalmigrationtimeby20%
• Totaltimeisdominatedby`Migration’time;Timesonotherstepsaresimilaracrossdifferentschemes
• Proposedmigrationframeworkcouldreduceupto51%migrationtime
PerformanceEvaluationofVMMigrationFramework
0
0.5
1
1.5
2
2.5
3
TCP IPoIB RDMA
Times(s)
SetPOST_MIGRATION AddIVSHMEMAttachVF MigrationRemoveIVSHMEM DetachVFSetPRE_MIGRATION
BreakdownofVMmigration
0
10
20
30
40
2VM 4VM 8VM 16VM
Time(s)
SequentialMigrationFramework
ProposedMigrationFramework
MultipleVMMigrationTime
OpenStackSummit2018 14NetworkBasedComputingLaboratory
Bcast(4VMs*2Procs/VM)
• MigrateaVMfromonemachinetoanotherwhilebenchmarkisrunninginside
• ProposedMT-baseddesignsperformslightlyworsethanPE-baseddesignsbecauseoflock/unlock
• NobenefitfromMTbecauseofNOcomputationinvolved
PerformanceEvaluationofVMMigrationFrameworkPt2PtLatency
0
50
100
150
200
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency(us)
MessageSize(bytes)
PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA
0
100
200
300
400
500
600
700
1 4 16 64 256 1K 4K 16K 64K 256K 1M
2Laten
cy(u
s)
MessageSize(bytes)
PE-IPoIB
PE-RDMA
MT-IPoIB
MT-RDMA
OpenStackSummit2018 15NetworkBasedComputingLaboratory
Graph500
• 8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning
• ComparedwithNM,MT-worstandPEincursomeoverheadcomparedwithNM
• MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation
PerformanceEvaluationofVMMigrationFrameworkNAS
0
20
40
60
80
100
120
LU.C EP.C IS.C MG.C CG.C
ExecutionTime(s)
PE
MT-worst
MT-typical
NM
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
20,10 20,16 20,20 22,10
ExecutionTime(s)
PE
MT-worst
MT-typical
NM
OpenStackSummit2018 16NetworkBasedComputingLaboratory
• MVAPICH2-VirtwithSR-IOVandIVSHMEM– Standalone,OpenStack
• SR-IOV-enabledVMMigrationSupportinMVAPICH2
• MVAPICH2withContainers(DockerandSingularity)
• MVAPICH2withNestedVirtualization(ContaineroverVM)
• MVAPICH2-VirtonSLURM– SLURMalone,SLURM+OpenStack
• NeuroscienceApplicationsonHPCClouds
• BigDataLibrariesonCloud– RDMA-Hadoop,OpenStackSwift
ApproachestoBuildHPCClouds
OpenStackSummit2018 17NetworkBasedComputingLaboratory
• Container-basedtechnologies(e.g.,Docker)providelightweightvirtualizationsolutions
• Container-basedvirtualization–sharehostkernelbycontainers
OverviewofContainers-basedVirtualization
VM1Container1
OpenStackSummit2018 18NetworkBasedComputingLaboratory
• WhataretheperformancebottleneckswhenrunningMPIapplicationsonmultiplecontainersperhostinHPCcloud?
• Canweproposeanewdesigntoovercomethebottleneckonsuchcontainer-basedHPCcloud?
• Canoptimizeddesigndelivernear-nativeperformancefordifferentcontainerdeploymentscenarios?
• Locality-awarebaseddesigntoenableCMAandSharedmemorychannelsforMPIcommunicationacrossco-residentcontainers
Containers-basedDesign:Issues,Challenges,andApproaches
J.Zhang,X.Lu,D.K.Panda.HighPerformanceMPILibraryforContainer-basedHPCCloudonInfiniBandClusters.ICPP,2016
OpenStackSummit2018 19NetworkBasedComputingLaboratory
0
10
20
30
40
50
60
70
80
90
100
MG.D FT.D EP.D LU.D CG.D
ExecutionTime(s)
Container-Def
Container-Opt
Native
• 64Containersacross16nodes,pining4CoresperContainer
• ComparedtoContainer-Def,upto11%and73%ofexecutiontimereductionforNASandGraph500
• ComparedtoNative,lessthan9%and5%overheadforNASandGraph500
Application-LevelPerformanceonDockerwithMVAPICH2
Graph500 NAS
11%
0
50
100
150
200
250
300
1Cont*16P 2Conts*8P 4Conts*4P
BFSExecutionTime(m
s)
Scale,Edgefactor(20,16)
Container-Def
Container-Opt
Native
73%
OpenStackSummit2018 20NetworkBasedComputingLaboratory
0100020003000400050006000700080009000
1K 4K 16K 64K 256K 1M
Band
width
(MB/
s)
Mesage Size ( bytes)
Singularity-Intra-Node
Native-Intra-Node
Singularity-Inter-Node
Native-Inter-Node
SingularityPerformanceonDifferentProcessorArchitectures
• MPIpoint-to-pointBandwidth
• OnbothHaswellandKNL,lessthan7%overheadforSingularitysolution
• Worseintra-nodeperformancethanHaswellbecauselowCPUfrequency,complexclustermode,andcostmaintainingcachecoherence
• KNL-Inter-nodeperformsbetterthanintra-nodecaseafteraround256Kbytes,asOmni-Pathinterconnectoutperformssharedmemory-basedtransferforlargemessagesize
BWonHaswell BWonKNL
7%
02000400060008000
10000120001400016000
1K 4K 16K 64K 256K 1M
Band
width
(MB/
s)
Mesage Size ( bytes)
Singularity-Intra-NodeNative-Intra-NodeSingularity-Inter-NodeNative-Inter-Node
OpenStackSummit2018 21NetworkBasedComputingLaboratory
0
500
1000
1500
2000
2500
3000
22,16 22,20 24,16 24,20 26,16 26,20
Exec
utio
n Tim
e (m
s) Singularity
Native
SingularityPerformanceonHaswellwithInfiniBand
0
50
100
150
200
250
300
CG EP FT IS LU MG
Exec
utio
n Tim
e (s) Singularity
Native
ClassDNAS Graph500
• 512processorsacross32Haswellnodes• Singularitydeliversnear-nativeperformance,lessthan7%overheadonHaswell
withInfiniBand
7%
J.Zhang,X.Lu,D.K.Panda.IsSingularity-basedContainerTechnologyReadyforRunningMPIApplicationsonHPCClouds?UCC2017.(BestStudentPaperAward)
OpenStackSummit2018 22NetworkBasedComputingLaboratory
• MVAPICH2-VirtwithSR-IOVandIVSHMEM– Standalone,OpenStack
• SR-IOV-enabledVMMigrationSupportinMVAPICH2
• MVAPICH2withContainers(DockerandSingularity)
• MVAPICH2withNestedVirtualization(ContaineroverVM)
• MVAPICH2-VirtonSLURM– SLURMalone,SLURM+OpenStack
• NeuroscienceApplicationsonHPCClouds
• BigDataLibrariesonCloud– RDMA-Hadoop,OpenStackSwift
ApproachestoBuildHPCClouds
OpenStackSummit2018 23NetworkBasedComputingLaboratory
NestedVirtualization:ContainersoverVirtualMachines
• Usefulforlivemigration,sandboxapplication,legacysystemintegration,softwaredeployment,etc.
• Performanceissuesbecauseoftheredundantcallstacks(two-layervirtualization)andisolatedphysicalresources
Hardware
Host OSHypervisor
Redhat
VM1
Docker Engine
Container2
bins/libs
AppStack
Container1
bins/libs
AppStack
Ubuntu
VM2
Docker Engine
Container4
bins/libs
AppStack
Container3
bins/libs
AppStack
OpenStackSummit2018 24NetworkBasedComputingLaboratory
MultipleCommunicationPathsinNestedVirtualization
1. Intra-VMIntra-Container(acrosscore4andcore5)
2. Intra-VMInter-Container(acrosscore13andcore14)
3. Inter-VMInter-Container(acrosscore6andcore12)
4. Inter-NodeInter-Container(acrosscore15andthecoreonremotenode)
QPIMemo
ryCo
ntroll
er
core 4 core 5 core 6 core 7
NUMA 0
VM 0Container 0
Memo
ryCo
ntroll
ercore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11
core 12 core 13 core 14 core 15
NUMA 1
VM 1Container 2Container 1 Container 3
1 23 ...4
• DifferentVMplacementsintroducemultiplecommunicationpathsoncontainerlevel
OpenStackSummit2018 25NetworkBasedComputingLaboratory
OverviewofProposedDesigninMVAPICH2
QPI
Mem
ory
Cont
rolle
r
core 4 core 5 core 6 core 7
NUMA 0
VM 0Container 0
Mem
ory
Cont
rolle
rcore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11
core 12 core 13 core 14 core 15
VM 1Container 2Container 1 Container 3
1 23 4
Two-Layer NUMA Aware Communication Coordinator
NUMA 1
Container Locality Detector VM Locality Detector
Nested Locality Combiner
Two-Layer Locality Detector
CMAChannel
SHared Memory (SHM) Channel
Network (HCA)Channel
Two-LayerLocalityDetector:DynamicallydetectingMPIprocessesintheco-residentcontainersinsideoneVMaswellastheonesintheco-residentVMs
Two-LayerNUMAAwareCommunicationCoordinator:Leveragenestedlocalityinfo,NUMAarchitectureinfoandmessagetoselectappropriatecommunicationchannel
J.Zhang,X.Lu,D.K.Panda.DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,VEE,2017
OpenStackSummit2018 26NetworkBasedComputingLaboratory
Inter-VMInter-ContainerPt2Pt(Intra-Socket)
0
50
100
150
200
250
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (bytes)
Default1Layer2Layer
Native(w/o CMA)Native
0 0.5
1 1.5
2 2.5
3 3.5
4
1 4 16 64 256 1K 4K
0
2000
4000
6000
8000
10000
12000
14000
16000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Band
wid
th (M
B/s)
Message Size (bytes)
Default1Layer2Layer
Native(w/o CMA)Native
• 1LayerhassimilarperformancetotheDefault• Comparedwith1Layer,2Layerdeliversupto84%and184%improvementfor
latencyandBW
Latency BW
OpenStackSummit2018 27NetworkBasedComputingLaboratory
Inter-VMInter-ContainerPt2Pt(Inter-Socket)
0
50
100
150
200
250
300
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (bytes)
Default1Layer2Layer
Basic-HybridEnhanced-Hybrid
Native(w/o CMA)Native
0 0.5
1 1.5
2 2.5
3 3.5
4
1 4 16 64 256 1K 4K
0
2000
4000
6000
8000
10000
12000
14000
16000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Band
wid
th (M
B/s)
Message Size (bytes)
Default1Layer2Layer
Basic-HybridEnhanced-Hybrid
Native(w/o CMA)Native
Latency BW• 1-LayerhassimilarperformancetotheDefault• 2-Layerhasnear-nativeperformanceforsmallmsg,butclearoverheadonlargemsg• Comparedto2-Layer,Hybriddesignbringsupto42%and25%improvementfor
latencyandBW,respectively
OpenStackSummit2018 28NetworkBasedComputingLaboratory
Application-levelEvaluations
0
1
2
3
4
5
6
7
8
9
10
22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16
BFS
Exe
cutio
n Ti
me
(s)
Default1Layer
2Layer-Enhanced-Hybrid
0
20
40
60
80
100
120
140
160
180
IS MG EP FT CG LU
Exec
utio
n Ti
me
(s)
Default1Layer
2Layer-Enhanced-Hybrid
• 256processesacross64containerson16nodes• ComparedwithDefault,enhanced-hybriddesignreducesupto16%(28,16)and10%(LU)of
executiontimeforGraph500andNAS,respectively• Comparedwiththe1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6%
(LU)performancebenefit
ClassDNASGraph500
OpenStackSummit2018 29NetworkBasedComputingLaboratory
• MVAPICH2-VirtwithSR-IOVandIVSHMEM– Standalone,OpenStack
• SR-IOV-enabledVMMigrationSupportinMVAPICH2
• MVAPICH2withContainers(DockerandSingularity)
• MVAPICH2withNestedVirtualization(ContaineroverVM)
• MVAPICH2-VirtonSLURM– SLURMalone,SLURM+OpenStack
• NeuroscienceApplicationsonHPCClouds
• BigDataLibrariesonCloud– RDMA-Hadoop,OpenStackSwift
ApproachestoBuildHPCClouds
OpenStackSummit2018 30NetworkBasedComputingLaboratory
loadSPANK
reclaimVMs
registerjobstepreply
registerjobstepreq
Slurmctld Slurmd Slurmd
releasehosts
runjobstepreq
runjobstepreply
mpirun_vm
MPIJobacrossVMs
VMConfigReader
loadSPANKVMLauncher
loadSPANKVMReclaimer
• VMConfigurationReader–RegisterallVMconfigurationoptions,setinthejobcontrolenvironmentsothattheyarevisibletoallallocatednodes.
• VMLauncher–SetupVMsoneachallocatednodes.-FilebasedlocktodetectoccupiedVFandexclusivelyallocatefreeVF
-AssignauniqueIDtoeachIVSHMEManddynamicallyattachtoeachVM
• VMReclaimer–TeardownVMsandreclaimresources
SLURMSPANKPluginbasedDesign
MPIMPI
vmhostfile
OpenStackSummit2018 31NetworkBasedComputingLaboratory
• VMConfigurationReader–VMoptionsregister
• VMLauncher,VMReclaimer–OffloadtounderlyingOpenStackinfrastructure-PCIWhitelisttopassthroughfreeVFtoVM
-ExtendNovatoenableIVSHMEMwhenlaunchingVM
SLURMSPANKPluginwithOpenStackbasedDesign
J.Zhang,X.Lu,S.Chakraborty,D.K.Panda.SLURM-V:ExtendingSLURMforBuildingEfficientHPCCloudwithSR-IOVandIVShmem.Euro-Par,2016
reclaimVMs
registerjobstepreply
registerjobstepreq
Slurmctld Slurmd
releasehosts
launchVM
mpirun_vm
loadSPANKVMConfigReader
MPI
VMhostfile
OpenStackdaemon
requestlaunchVMVMLauncher
return
requestreclaimVMVMReclaimer
return
......
......
......
......
OpenStackSummit2018 32NetworkBasedComputingLaboratory
• 32VMsacross8nodes,6Core/VM
• EASJ-ComparedtoNative,lessthan4%overheadwith128Procs
• SACJ,EACJ–Alsominoroverhead,whenrunningNASasconcurrentjobwith64Procs
Application-LevelPerformanceonChameleon(Graph500)
Exclusive Allocations Sequential Jobs
0
500
1000
1500
2000
2500
3000
24,16 24,20 26,10
BFSExecutionTime(m
s)
ProblemSize(Scale,Edgefactor)
VM
Native
0
50
100
150
200
250
22,10 22,16 22,20BF
SExecutionTime(m
s)
ProblemSize(Scale,Edgefactor)
VMNative
0
50
100
150
200
250
2210 2216 2220
BFSExecutionTime(m
s)
ProblemSize(Scale,Edgefactor)
VM
Native
Shared-host Allocations Concurrent Jobs
Exclusive Allocations Concurrent Jobs
4%
OpenStackSummit2018 33NetworkBasedComputingLaboratory
• MVAPICH2-VirtwithSR-IOVandIVSHMEM– Standalone,OpenStack
• SR-IOV-enabledVMMigrationSupportinMVAPICH2
• MVAPICH2withContainers(DockerandSingularity)
• MVAPICH2withNestedVirtualization(ContaineroverVM)
• MVAPICH2-VirtonSLURM– SLURMalone,SLURM+OpenStack
• NeuroscienceApplicationsonHPCClouds
• BigDataLibrariesonCloud– RDMA-Hadoop,OpenStackSwift
ApproachestoBuildHPCClouds
OpenStackSummit2018 34NetworkBasedComputingLaboratory
• EasyandFastDiscoveryisthekey!
NeuroScienceMeetsHPCCloud
1https://github.com/francopestilli/life
The Brain Connectome. Illustration of a set of fascicles (white matter bundles) obtained by using a tractography algorithm. Fascicles are grouped together conforming white matter tracts (shown with different colors here) connecting different cortical areas of the human brain. LiFE1 (Linear Fascicle Evaluation) is an approach to predict diffusion measurements in brain connectomes.
Virtualization
CloudComputing
• Easy-to-useandHigh-PerformanceTechnologyisthekey!
OpenStackSummit2018 35NetworkBasedComputingLaboratory
• Identifiedcomputationallyintensivetasksasthecomputationsofmatrixbyvectorproducts– w=MTyandy=Mw
• ThecomputationallyintensivefunctionshavebeenparallelizedusingMPIbydividingthetaskamongmultipleMPIprocesses
• ImplementationusesMVAPICH22,fromOSUteam
MPI-basedLiFEforBrainHealth:InitialDesignusingMVAPICH2MPILibrary
1https://github.com/francopestilli/life 2http://mvapich.cse.ohio-state.edu/
Computation of w = MTy using 2 MPI processes Computation of y = Mw using 2 MPI processes
OpenStackSummit2018 36NetworkBasedComputingLaboratory
DesignandEvaluationwithMVAPICH2:SingleNodewithMPIonIntelKnightsLanding(KNL)
• EvaluationonTACCStampedeKNL(IntelXeonPhiKNLCPUs,68cores,96GBmemorypernode)• Upto8.7xspeedup
0
2
4
6
8
10
1 2 4 8 16 24 28 32 64
SpeedUp
MPIProcesses
SpeedUponaSingleNodeStampede2
0
1000
2000
3000
4000
5000
1 2 4 8 16 32 64
ExecutionTime(s)
MPIProcesses
TimeBreakupEvaluationonStampede2
NNLS Load MM
Reduce Gather Bcast
MPI-LiFEsoftwareisavailablefromhttp://neurohpc.cse.ohio-state.eduDocker-containerizedversion,Canrunfromlaptoptoclusters
OpenStackSummit2018 37NetworkBasedComputingLaboratory
DesignandEvaluationofLiFEcodewithMVAPICH2-Virt+Docker
• EvaluationonChameleonwithDocker(IntelHaswellCPUs,24cores,128GBmemorypernode)• Upto5.5xspeeduponChameleon
0
1
2
3
4
5
6
1 2 4 8 16 24
SpeedUp
Nodes
SpeedUponSingleNodewithDocker(Chameleon)
Native 1Container 2Containers
OpenStackSummit2018 38NetworkBasedComputingLaboratory
• MVAPICH2-VirtwithSR-IOVandIVSHMEM– Standalone,OpenStack
• SR-IOV-enabledVMMigrationSupportinMVAPICH2
• MVAPICH2withContainers(DockerandSingularity)
• MVAPICH2withNestedVirtualization(ContaineroverVM)
• MVAPICH2-VirtonSLURM– SLURMalone,SLURM+OpenStack
• NeuroscienceApplicationsonHPCClouds
• BigDataLibrariesonCloud– RDMA-Hadoop,OpenStackSwift
ApproachestoBuildHPCClouds
OpenStackSummit2018 39NetworkBasedComputingLaboratory
• RDMAforApacheSpark
• RDMAforApacheHadoop2.x(RDMA-Hadoop-2.x)– PluginsforApache,Hortonworks(HDP)andCloudera(CDH)Hadoopdistributions
• RDMAforApacheHBase
• RDMAforMemcached(RDMA-Memcached)
• RDMAforApacheHadoop1.x(RDMA-Hadoop)
• OSUHiBD-Benchmarks(OHB)– HDFS,Memcached,HBase,andSparkMicro-benchmarks
• http://hibd.cse.ohio-state.edu
• UsersBase:285organizationsfrom34countries
• Morethan26,400downloadsfromtheprojectsite
TheHigh-PerformanceBigData(HiBD)Project
AvailableforInfiniBandandRoCEAlsorunonEthernet
Availableforx86andOpenPOWER
SupportforSingularityandDocker
OpenStackSummit2018 40NetworkBasedComputingLaboratory
OverviewofRDMA-Hadoop-VirtArchitecture• Virtualization-awaremodulesinallthefour
mainHadoopcomponents:– HDFS:Virtualization-awareBlockManagement
toimprovefault-tolerance
– YARN:ExtensionstoContainerAllocationPolicytoreducenetworktraffic
– MapReduce:ExtensionstoMapTaskSchedulingPolicytoreducenetworktraffic
– HadoopCommon:TopologyDetectionModuleforautomatictopologydetection
• CommunicationsinHDFS,MapReduce,andRPCgothroughRDMA-baseddesignsoverSR-IOVenabledInfiniBand
HDFS
YARN
Had
oop
Com
mon
MapReduce
HBase Others
Virtual Machines Bare-Metal nodes Containers
Big Data Applications
Topo
logy
Det
ectio
n M
odul
e
Map Task Scheduling Policy Extension
Container Allocation Policy Extension
CloudBurst MR-MS Polygraph Others
Virtualization Aware Block Management
S.Gugnani,X.Lu,D.K.Panda.DesigningVirtualization-awareandAutomaticTopologyDetectionSchemesforAcceleratingHadooponSR-IOV-enabledClouds.CloudCom,2016.
OpenStackSummit2018 41NetworkBasedComputingLaboratory
EvaluationwithApplications
– 14%and24%improvementwithDefaultModeforCloudBurstandSelf-Join
– 30%and55%improvementwithDistributedModeforCloudBurstandSelf-Join
0
20
40
60
80
100
DefaultMode DistributedMode
EXEC
UTIONTIM
ECloudBurst
RDMA-Hadoop RDMA-Hadoop-Virt
050100150200250300350400
DefaultMode DistributedMode
EXEC
UTIONTIM
E
Self-Join
RDMA-Hadoop RDMA-Hadoop-Virt
30%reduction
55%reduction
OpenStackSummit2018 42NetworkBasedComputingLaboratory
• DistributedCloud-basedObjectStorageService
• DeployedaspartofOpenStackinstallation
• Canbedeployedasstandalonestoragesolutionaswell
• WorldwidedataaccessviaInternet
– HTTP-based
• Architecture
– MultipleObjectServers:Tostoredata
– FewProxyServers:Actasaproxyforallrequests
– Ring:Handlesmetadata
• Usage– Input/outputsourceforBigDataapplications(mostcommonuse
case)
– Software/Databackup
– StorageofVM/Dockerimages
• BasedontraditionalTCPsocketscommunication
OpenStackSwiftOverview
SendPUTorGETrequestPUT/GET/v1/<account>/<container>/<object>
ProxyServer
ObjectServer
ObjectServer
ObjectServer
Ring
Disk1
Disk2
Disk1
Disk2
Disk1
Disk2
SwiftArchitecture
OpenStackSummit2018 43NetworkBasedComputingLaboratory
• Challenges
– Proxyserverisabottleneckforlargescaledeployments
– Objectupload/downloadoperationsnetworkintensive
– CananRDMA-basedapproachbenefit?
• Design
– Re-designedSwiftarchitectureforimprovedscalabilityandperformance;Twoproposeddesigns:
• Client-ObliviousDesign:Nochangesrequiredontheclientside
• MetadataServer-basedDesign:Directcommunicationbetweenclientandobjectservers;bypassproxyserver
– RDMA-basedcommunicationframeworkforacceleratingnetworkingperformance
– High-performanceI/OframeworktoprovidemaximumoverlapbetweencommunicationandI/O
Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildingEfficientHPCClouds
S.Gugnani,X.Lu,andD.K.Panda,Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildinganEfficientHPCCloud,acceptedatCCGrid’17,May2017
Client-ObliviousDesign(D1)
MetadataServer-basedDesign(D2)
OpenStackSummit2018 44NetworkBasedComputingLaboratory
0
5
10
15
20
25
SwiftPUT
Swift-X(D1)PUT
Swift-X(D2)PUT
SwiftGET
Swift-X(D1)GET
Swift-X(D2)GET
LATENCY
(S)
TIMEBREAKUPOFGETANDPUT
Communication I/O Hashsum Other
Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildingEfficientHPCClouds
0
5
10
15
20
1MB 4MB 16MB 64MB256MB 1GB 4GB
LATENCY
(s)
OBJECTSIZE
GETLATENCYEVALUATION
Swift Swift-X(D2) Swift-X(D1)
Reducedby66%
• Upto66%reductioninGETlatency• Communicationtimereducedbyupto3.8xforPUTandupto2.8xforGET
OpenStackSummit2018 45NetworkBasedComputingLaboratory
AvailableAppliancesonChameleonCloud*
Appliance Description
CentOS7KVMSR-IOV
Chameleonbare-metalimagecustomizedwiththeKVMhypervisorandarecompiledkerneltoenableSR-IOVoverInfiniBand.https://www.chameleoncloud.org/appliances/3/
MPIbare-metalclustercomplex
appliance(BasedonHeat)
ThisappliancedeploysanMPIclustercomposedofbaremetalinstancesusingtheMVAPICH2libraryoverInfiniBand.https://www.chameleoncloud.org/appliances/29/
MPI+SR-IOVKVMcluster(Basedon
Heat)
ThisappliancedeploysanMPIclusterofKVMvirtualmachinesusingtheMVAPICH2-VirtimplementationandconfiguredwithSR-IOVforhigh-performancecommunicationoverInfiniBand.https://www.chameleoncloud.org/appliances/28/
CentOS7SR-IOVRDMA-Hadoop
TheCentOS7SR-IOVRDMA-HadoopapplianceisbuiltfromtheCentOS7applianceandadditionallycontainsRDMA-HadooplibrarywithSR-IOV.https://www.chameleoncloud.org/appliances/17/
• Throughtheseavailableappliances,usersandresearcherscaneasilydeployHPCcloudstoperformexperimentsandrunjobswith– High-PerformanceSR-IOV+InfiniBand
– High-PerformanceMVAPICH2Libraryoverbare-metalInfiniBandclusters
– High-PerformanceMVAPICH2LibrarywithVirtualizationSupportoverSR-IOVenabledKVMclusters
– High-PerformanceHadoopwithRDMA-basedEnhancementsSupport[*]OnlyincludeappliancescontributedbyOSUNowLab
OpenStackSummit2018 46NetworkBasedComputingLaboratory
• MVAPICH2-VirtoverSR-IOV-enabledInfiniBandisanefficientapproachtobuildHPCClouds– Standalone,OpenStack,Slurm,andSlurm+OpenStack
– SupportVirtualMachineMigrationwithSR-IOVInfiniBanddevices
– SupportVirtualMachine,Container(DockerandSingularity),andNestedVirtualization
• Verylittleoverheadwithvirtualization,nearnativeperformanceatapplicationlevel
• MuchbetterperformancethanAmazonEC2
• MVAPICH2-VirtisavailableforbuildingHPCClouds(http://mvapich.cse.ohio-state.edu)– SR-IOV,IVSHMEM,Dockersupport,OpenStack
• NeuroscienceapplicationscanbenefitfromtechnologiesonHPCclouds
• BigDataanalyticsstackssuchasRDMA-Hadoopcanbenefitfromcloud-awaredesigns
• AppliancesforMVAPICH2-VirtandRDMA-HadoopareavailableforbuildingHPCClouds
• SR-IOV/containersupportandappliancesforotherMVAPICH2libraries(MVAPICH2-X,MVAPICH2-GDR,...)andRDMA-Spark/Memcached
Conclusions
OpenStackSummit2018 47NetworkBasedComputingLaboratory
The2ndInternationalBoFon
BuildingEfficientCloudsforHPC,BigData,andDeepLearningMiddlewareandApplications(HPCCloudBoF)
HPCCloudBoF2018willbeheldwithISC‘18,Frankfurt,GermanyJune27,2018(Wednesday)1:45-2:45pm
HPCCloudBoF2017washeldinconjunctionwithSC’17
http://sc17.supercomputing.org/presentation/?id=bof165&sess=sess357
OpenStackSummit2018 48NetworkBasedComputingLaboratory
ThankYou!
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/
MVAPICH/MVAPICH2http://mvapich.cse.ohio-state.edu/
{panda,luxi}@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~luxi