Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds...

48
Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous Clusters Talk at OpenStack Summit 2018 Vancouver, Canada by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Xiaoyi Lu The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~luxi

Transcript of Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds...

Page 1: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

BuildingEfficientHPCCloudswithMVAPICH2andOpenStackoverSR-IOV-enabledHeterogeneousClusters

TalkatOpenStackSummit2018

Vancouver,Canada

by

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:[email protected]://www.cse.ohio-state.edu/~panda

XiaoyiLuTheOhioStateUniversity

E-mail:[email protected]://www.cse.ohio-state.edu/~luxi

Page 2: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 2NetworkBasedComputingLaboratory

•  CloudComputingwidelyadoptedinindustrycomputingenvironment

•  CloudComputingprovideshighresourceutilizationandflexibility

•  VirtualizationisthekeytechnologytoenableCloudComputing

•  Intersect360studyshowscloudisthefastestgrowingclassofHPC

•  HPCMeetsCloud:TheconvergenceofCloudComputingandHPC

HPCMeetsCloudComputing

Page 3: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 3NetworkBasedComputingLaboratory

DriversofModernHPCClusterandCloudArchitecture

•  Multi-core/many-coretechnologies,Accelerators

•  Largememorynodes

•  SolidStateDrives(SSDs),NVM,ParallelFilesystems,ObjectStorageClusters

•  RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

•  SingleRootI/OVirtualization(SR-IOV)

HighPerformanceInterconnects–InfiniBand(withSR-IOV)

<1useclatency,200GbpsBandwidth>Multi-/Many-core

Processors

SSDs,ObjectStorageClusters

Largememorynodes(Upto2TB)

Cloud Cloud SDSCComet TACCStampede

Page 4: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 4NetworkBasedComputingLaboratory

•  SingleRootI/OVirtualization(SR-IOV)isprovidingnewopportunitiestodesignHPCcloudwithverylittlelowoverhead

SingleRootI/OVirtualization(SR-IOV)

•  Allowsasinglephysicaldevice,oraPhysicalFunction(PF),topresentitselfasmultiplevirtualdevices,orVirtualFunctions(VFs)

•  VFsaredesignedbasedontheexistingnon-virtualizedPFs,noneedfordriverchange

•  EachVFcanbededicatedtoasingleVMthroughPCIpass-through

•  Workwith10/40GigEandInfiniBand

4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications

The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.

The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.

2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.

Qemu Userspace

Guest 1

Userspace

kernelPCI Device

mmap region

Qemu Userspace

Guest 2

Userspace

kernel

mmap region

Qemu Userspace

Guest 3

Userspace

kernelPCI Device

mmap region

/dev/shm/<name>

PCI Device

Host

mmap mmap mmap

shared mem fd

eventfds

(a) Inter-VM Shmem Mechanism [15]

Guest 1Guest OS

VF Driver

Guest 2Guest OS

VF Driver

Guest 3Guest OS

VF Driver

Hypervisor PF Driver

I/O MMU

SR-IOV Hardware

Virtual Function

Virtual Function

Virtual Function

Physical Function

PCI Express

(b) SR-IOV Mechanism [22]

Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms

Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to

Page 5: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 5NetworkBasedComputingLaboratory

•  VirtualizationSupportwithVirtualMachinesandContainers–  KVM,Docker,Singularity,etc.

•  CommunicationcoordinationamongoptimizedcommunicationchannelsonClouds–  SR-IOV,IVShmem,IPC-Shm,CMA,etc.

•  Locality-awarecommunication•  Scalabilityformilliontobillionprocessors

–  Supportforhighly-efficientinter-nodeandintra-nodecommunication(bothtwo-sidedandone-sided)

•  ScalableCollectivecommunication–  Offload;Non-blocking;Topology-aware

•  Balancingintra-nodeandinter-nodecommunicationfornextgenerationnodes(128-1024cores)–  Multipleend-pointspernode

•  NUMA-awarecommunicationfornestedvirtualization•  IntegratedSupportforGPGPUsandAccelerators•  Fault-tolerance/resiliency

–  Migrationsupportwithvirtualmachines•  QoSsupportforcommunicationandI/O•  SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,MPI+UPC++,CAF,…)•  Energy-Awareness•  Co-designwithresourcemanagementandschedulingsystemsonClouds

–  OpenStack,Slurm,etc.

BroadChallengesofBuildingEfficientHPCClouds

Page 6: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 6NetworkBasedComputingLaboratory

•  MVAPICH2-VirtwithSR-IOVandIVSHMEM–  Standalone,OpenStack

•  SR-IOV-enabledVMMigrationSupportinMVAPICH2

•  MVAPICH2withContainers(DockerandSingularity)

•  MVAPICH2withNestedVirtualization(ContaineroverVM)

•  MVAPICH2-VirtonSLURM–  SLURMalone,SLURM+OpenStack

•  NeuroscienceApplicationsonHPCClouds

•  BigDataLibrariesonCloud–  RDMA-Hadoop,OpenStackSwift

ApproachestoBuildHPCClouds

Page 7: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 7NetworkBasedComputingLaboratory

OverviewoftheMVAPICH2Project•  HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

–  MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.1),Startedin2001,Firstversionavailablein2002

–  MVAPICH2-X(MPI+PGAS),Availablesince2011

–  SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

–  SupportforVirtualization(MVAPICH2-Virt),Availablesince2015–  SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

–  SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

–  Usedbymorethan2,900organizationsin86countries

–  Morethan469,000(>0.46million)downloadsfromtheOSUsitedirectly

–  EmpoweringmanyTOP500clusters(Nov‘17ranking)•  1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

•  9th,556,104cores(Oakforest-PACS)inJapan

•  12th,368,928-core(Stampede2)atTACC

•  17th,241,108-core(Pleiades)atNASA

•  48th,76,032-core(Tsubame2.5)atTokyoInstituteofTechnology

–  AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

–  http://mvapich.cse.ohio-state.edu

•  EmpoweringTop500systemsforoveradecade

Page 8: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 8NetworkBasedComputingLaboratory

•  RedesignMVAPICH2tomakeitvirtualmachineaware–  SR-IOVshowsneartonative

performanceforinter-nodepointtopointcommunication

–  IVSHMEMofferssharedmemorybaseddataaccessacrossco-residentVMs

–  LocalityDetector:maintainsthelocalityinformationofco-residentvirtualmachines

–  CommunicationCoordinator:selectsthecommunicationchannel(SR-IOV,IVSHMEM)adaptively

•  SupportdeploymentwithOpenStack

OverviewofMVAPICH2-VirtwithSR-IOVandIVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter

Physical Function

user space

kernel space

MPI proc

PCI Device

VF Driver

Guest 2user space

kernel space

MPI proc

PCI Device

VF Driver

Virtual Function

Virtual Function

/dev/shm/

IV-SHM

IV-Shmem Channel

SR-IOV Channel

J.Zhang,X.Lu,J.Jose,R.Shi,D.K.Panda.CanInter-VMShmemBenefitMPIApplicationsonSR-IOVbasedVirtualizedInfiniBandClusters?Euro-Par,2014

J.Zhang,X.Lu,J.Jose,R.Shi,M.Li,D.K.Panda.HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters.HiPC,2014

J.Zhang,X.Lu,M.Arnold,D.K.Panda.MVAPICH2overOpenStackwithSR-IOV:AnEfficientApproachtoBuildHPCClouds.CCGrid,2015

Page 9: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 9NetworkBasedComputingLaboratory

0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

ExecutionTime(s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1% 9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

ExecutionTime(m

s)

ProblemSize(Scale,Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native 2%

•  32VMs,6Core/VM

•  ComparedtoNative,2-5%overheadforGraph500with128Procs

•  ComparedtoNative,1-9.5%overheadforSPECMPI2007with128Procs

Application-LevelPerformanceonChameleon

SPECMPI2007 Graph500

5%

Page 10: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 10NetworkBasedComputingLaboratory

•  MVAPICH2-VirtwithSR-IOVandIVSHMEM–  Standalone,OpenStack

•  SR-IOV-enabledVMMigrationSupportinMVAPICH2

•  MVAPICH2withContainers(DockerandSingularity)

•  MVAPICH2withNestedVirtualization(ContaineroverVM)

•  MVAPICH2-VirtonSLURM–  SLURMalone,SLURM+OpenStack

•  NeuroscienceApplicationsonHPCClouds

•  BigDataLibrariesonCloud–  RDMA-Hadoop,OpenStackSwift

ApproachestoBuildHPCClouds

Page 11: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 11NetworkBasedComputingLaboratory

ExecuteLiveMigrationwithSR-IOVDevice

Page 12: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 12NetworkBasedComputingLaboratory

HighPerformanceSR-IOVenabledVMMigrationSupportinMVAPICH2

MPI

Host

Guest VM1

Hypervisor

IBAdapterIVShmem

Network Suspend Trigger

Read-to-Migrate Detector

Network ReactiveNotifier

ŏ

Controller

MPIGuest VM2

MPI

Host

Guest VM1

VF /SR-IOV

Hypervisor

IBAdapter IVShmemEthernet

Adapter

MPIGuest VM2

EthernetAdapter

Migration Done

Detector

VF /SR-IOV

VF /SR-IOV

VF /SR-IOV

Migration Trigger

J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017

•  MigrationwithSR-IOVdevicehastohandlethechallengesofdetachment/re-attachmentofvirtualizedIBdeviceandIBconnection

•  ConsistofSR-IOVenabledIBClusterandExternalMigrationController

•  MultipleparallellibrariestonotifyMPIapplicationsduringmigration(detach/reattachSR-IOV/IVShmem,migrateVMs,migrationstatus)

•  HandletheIBconnectionsuspendingandreactivating

•  ProposeProgressengine(PE)andmigrationthreadbased(MT)designtooptimizeVMmigrationandMPIapplicationperformance

Page 13: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 13NetworkBasedComputingLaboratory

•  ComparedwiththeTCP,theRDMAschemereducesthetotalmigrationtimeby20%

•  Totaltimeisdominatedby`Migration’time;Timesonotherstepsaresimilaracrossdifferentschemes

•  Proposedmigrationframeworkcouldreduceupto51%migrationtime

PerformanceEvaluationofVMMigrationFramework

0

0.5

1

1.5

2

2.5

3

TCP IPoIB RDMA

Times(s)

SetPOST_MIGRATION AddIVSHMEMAttachVF MigrationRemoveIVSHMEM DetachVFSetPRE_MIGRATION

BreakdownofVMmigration

0

10

20

30

40

2VM 4VM 8VM 16VM

Time(s)

SequentialMigrationFramework

ProposedMigrationFramework

MultipleVMMigrationTime

Page 14: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 14NetworkBasedComputingLaboratory

Bcast(4VMs*2Procs/VM)

•  MigrateaVMfromonemachinetoanotherwhilebenchmarkisrunninginside

•  ProposedMT-baseddesignsperformslightlyworsethanPE-baseddesignsbecauseoflock/unlock

•  NobenefitfromMTbecauseofNOcomputationinvolved

PerformanceEvaluationofVMMigrationFrameworkPt2PtLatency

0

50

100

150

200

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize(bytes)

PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA

0

100

200

300

400

500

600

700

1 4 16 64 256 1K 4K 16K 64K 256K 1M

2Laten

cy(u

s)

MessageSize(bytes)

PE-IPoIB

PE-RDMA

MT-IPoIB

MT-RDMA

Page 15: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 15NetworkBasedComputingLaboratory

Graph500

•  8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning

•  ComparedwithNM,MT-worstandPEincursomeoverheadcomparedwithNM

•  MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation

PerformanceEvaluationofVMMigrationFrameworkNAS

0

20

40

60

80

100

120

LU.C EP.C IS.C MG.C CG.C

ExecutionTime(s)

PE

MT-worst

MT-typical

NM

0.0

0.1

0.1

0.2

0.2

0.3

0.3

0.4

0.4

20,10 20,16 20,20 22,10

ExecutionTime(s)

PE

MT-worst

MT-typical

NM

Page 16: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 16NetworkBasedComputingLaboratory

•  MVAPICH2-VirtwithSR-IOVandIVSHMEM–  Standalone,OpenStack

•  SR-IOV-enabledVMMigrationSupportinMVAPICH2

•  MVAPICH2withContainers(DockerandSingularity)

•  MVAPICH2withNestedVirtualization(ContaineroverVM)

•  MVAPICH2-VirtonSLURM–  SLURMalone,SLURM+OpenStack

•  NeuroscienceApplicationsonHPCClouds

•  BigDataLibrariesonCloud–  RDMA-Hadoop,OpenStackSwift

ApproachestoBuildHPCClouds

Page 17: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 17NetworkBasedComputingLaboratory

•  Container-basedtechnologies(e.g.,Docker)providelightweightvirtualizationsolutions

•  Container-basedvirtualization–sharehostkernelbycontainers

OverviewofContainers-basedVirtualization

VM1Container1

Page 18: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 18NetworkBasedComputingLaboratory

•  WhataretheperformancebottleneckswhenrunningMPIapplicationsonmultiplecontainersperhostinHPCcloud?

•  Canweproposeanewdesigntoovercomethebottleneckonsuchcontainer-basedHPCcloud?

•  Canoptimizeddesigndelivernear-nativeperformancefordifferentcontainerdeploymentscenarios?

•  Locality-awarebaseddesigntoenableCMAandSharedmemorychannelsforMPIcommunicationacrossco-residentcontainers

Containers-basedDesign:Issues,Challenges,andApproaches

J.Zhang,X.Lu,D.K.Panda.HighPerformanceMPILibraryforContainer-basedHPCCloudonInfiniBandClusters.ICPP,2016

Page 19: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 19NetworkBasedComputingLaboratory

0

10

20

30

40

50

60

70

80

90

100

MG.D FT.D EP.D LU.D CG.D

ExecutionTime(s)

Container-Def

Container-Opt

Native

•  64Containersacross16nodes,pining4CoresperContainer

•  ComparedtoContainer-Def,upto11%and73%ofexecutiontimereductionforNASandGraph500

•  ComparedtoNative,lessthan9%and5%overheadforNASandGraph500

Application-LevelPerformanceonDockerwithMVAPICH2

Graph500 NAS

11%

0

50

100

150

200

250

300

1Cont*16P 2Conts*8P 4Conts*4P

BFSExecutionTime(m

s)

Scale,Edgefactor(20,16)

Container-Def

Container-Opt

Native

73%

Page 20: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 20NetworkBasedComputingLaboratory

0100020003000400050006000700080009000

1K 4K 16K 64K 256K 1M

Band

width

(MB/

s)

Mesage Size ( bytes)

Singularity-Intra-Node

Native-Intra-Node

Singularity-Inter-Node

Native-Inter-Node

SingularityPerformanceonDifferentProcessorArchitectures

•  MPIpoint-to-pointBandwidth

•  OnbothHaswellandKNL,lessthan7%overheadforSingularitysolution

•  Worseintra-nodeperformancethanHaswellbecauselowCPUfrequency,complexclustermode,andcostmaintainingcachecoherence

•  KNL-Inter-nodeperformsbetterthanintra-nodecaseafteraround256Kbytes,asOmni-Pathinterconnectoutperformssharedmemory-basedtransferforlargemessagesize

BWonHaswell BWonKNL

7%

02000400060008000

10000120001400016000

1K 4K 16K 64K 256K 1M

Band

width

(MB/

s)

Mesage Size ( bytes)

Singularity-Intra-NodeNative-Intra-NodeSingularity-Inter-NodeNative-Inter-Node

Page 21: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 21NetworkBasedComputingLaboratory

0

500

1000

1500

2000

2500

3000

22,16 22,20 24,16 24,20 26,16 26,20

Exec

utio

n Tim

e (m

s) Singularity

Native

SingularityPerformanceonHaswellwithInfiniBand

0

50

100

150

200

250

300

CG EP FT IS LU MG

Exec

utio

n Tim

e (s) Singularity

Native

ClassDNAS Graph500

•  512processorsacross32Haswellnodes•  Singularitydeliversnear-nativeperformance,lessthan7%overheadonHaswell

withInfiniBand

7%

J.Zhang,X.Lu,D.K.Panda.IsSingularity-basedContainerTechnologyReadyforRunningMPIApplicationsonHPCClouds?UCC2017.(BestStudentPaperAward)

Page 22: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 22NetworkBasedComputingLaboratory

•  MVAPICH2-VirtwithSR-IOVandIVSHMEM–  Standalone,OpenStack

•  SR-IOV-enabledVMMigrationSupportinMVAPICH2

•  MVAPICH2withContainers(DockerandSingularity)

•  MVAPICH2withNestedVirtualization(ContaineroverVM)

•  MVAPICH2-VirtonSLURM–  SLURMalone,SLURM+OpenStack

•  NeuroscienceApplicationsonHPCClouds

•  BigDataLibrariesonCloud–  RDMA-Hadoop,OpenStackSwift

ApproachestoBuildHPCClouds

Page 23: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 23NetworkBasedComputingLaboratory

NestedVirtualization:ContainersoverVirtualMachines

•  Usefulforlivemigration,sandboxapplication,legacysystemintegration,softwaredeployment,etc.

•  Performanceissuesbecauseoftheredundantcallstacks(two-layervirtualization)andisolatedphysicalresources

Hardware

Host OSHypervisor

Redhat

VM1

Docker Engine

Container2

bins/libs

AppStack

Container1

bins/libs

AppStack

Ubuntu

VM2

Docker Engine

Container4

bins/libs

AppStack

Container3

bins/libs

AppStack

Page 24: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 24NetworkBasedComputingLaboratory

MultipleCommunicationPathsinNestedVirtualization

1.  Intra-VMIntra-Container(acrosscore4andcore5)

2.  Intra-VMInter-Container(acrosscore13andcore14)

3.  Inter-VMInter-Container(acrosscore6andcore12)

4.  Inter-NodeInter-Container(acrosscore15andthecoreonremotenode)

QPIMemo

ryCo

ntroll

er

core 4 core 5 core 6 core 7

NUMA 0

VM 0Container 0

Memo

ryCo

ntroll

ercore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11

core 12 core 13 core 14 core 15

NUMA 1

VM 1Container 2Container 1 Container 3

1 23 ...4

•  DifferentVMplacementsintroducemultiplecommunicationpathsoncontainerlevel

Page 25: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 25NetworkBasedComputingLaboratory

OverviewofProposedDesigninMVAPICH2

QPI

Mem

ory

Cont

rolle

r

core 4 core 5 core 6 core 7

NUMA 0

VM 0Container 0

Mem

ory

Cont

rolle

rcore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11

core 12 core 13 core 14 core 15

VM 1Container 2Container 1 Container 3

1 23 4

Two-Layer NUMA Aware Communication Coordinator

NUMA 1

Container Locality Detector VM Locality Detector

Nested Locality Combiner

Two-Layer Locality Detector

CMAChannel

SHared Memory (SHM) Channel

Network (HCA)Channel

Two-LayerLocalityDetector:DynamicallydetectingMPIprocessesintheco-residentcontainersinsideoneVMaswellastheonesintheco-residentVMs

Two-LayerNUMAAwareCommunicationCoordinator:Leveragenestedlocalityinfo,NUMAarchitectureinfoandmessagetoselectappropriatecommunicationchannel

J.Zhang,X.Lu,D.K.Panda.DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,VEE,2017

Page 26: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 26NetworkBasedComputingLaboratory

Inter-VMInter-ContainerPt2Pt(Intra-Socket)

0

50

100

150

200

250

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size (bytes)

Default1Layer2Layer

Native(w/o CMA)Native

0 0.5

1 1.5

2 2.5

3 3.5

4

1 4 16 64 256 1K 4K

0

2000

4000

6000

8000

10000

12000

14000

16000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Band

wid

th (M

B/s)

Message Size (bytes)

Default1Layer2Layer

Native(w/o CMA)Native

•  1LayerhassimilarperformancetotheDefault•  Comparedwith1Layer,2Layerdeliversupto84%and184%improvementfor

latencyandBW

Latency BW

Page 27: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 27NetworkBasedComputingLaboratory

Inter-VMInter-ContainerPt2Pt(Inter-Socket)

0

50

100

150

200

250

300

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size (bytes)

Default1Layer2Layer

Basic-HybridEnhanced-Hybrid

Native(w/o CMA)Native

0 0.5

1 1.5

2 2.5

3 3.5

4

1 4 16 64 256 1K 4K

0

2000

4000

6000

8000

10000

12000

14000

16000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Band

wid

th (M

B/s)

Message Size (bytes)

Default1Layer2Layer

Basic-HybridEnhanced-Hybrid

Native(w/o CMA)Native

Latency BW•  1-LayerhassimilarperformancetotheDefault•  2-Layerhasnear-nativeperformanceforsmallmsg,butclearoverheadonlargemsg•  Comparedto2-Layer,Hybriddesignbringsupto42%and25%improvementfor

latencyandBW,respectively

Page 28: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 28NetworkBasedComputingLaboratory

Application-levelEvaluations

0

1

2

3

4

5

6

7

8

9

10

22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16

BFS

Exe

cutio

n Ti

me

(s)

Default1Layer

2Layer-Enhanced-Hybrid

0

20

40

60

80

100

120

140

160

180

IS MG EP FT CG LU

Exec

utio

n Ti

me

(s)

Default1Layer

2Layer-Enhanced-Hybrid

•  256processesacross64containerson16nodes•  ComparedwithDefault,enhanced-hybriddesignreducesupto16%(28,16)and10%(LU)of

executiontimeforGraph500andNAS,respectively•  Comparedwiththe1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6%

(LU)performancebenefit

ClassDNASGraph500

Page 29: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 29NetworkBasedComputingLaboratory

•  MVAPICH2-VirtwithSR-IOVandIVSHMEM–  Standalone,OpenStack

•  SR-IOV-enabledVMMigrationSupportinMVAPICH2

•  MVAPICH2withContainers(DockerandSingularity)

•  MVAPICH2withNestedVirtualization(ContaineroverVM)

•  MVAPICH2-VirtonSLURM–  SLURMalone,SLURM+OpenStack

•  NeuroscienceApplicationsonHPCClouds

•  BigDataLibrariesonCloud–  RDMA-Hadoop,OpenStackSwift

ApproachestoBuildHPCClouds

Page 30: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 30NetworkBasedComputingLaboratory

loadSPANK

reclaimVMs

registerjobstepreply

registerjobstepreq

Slurmctld Slurmd Slurmd

releasehosts

runjobstepreq

runjobstepreply

mpirun_vm

MPIJobacrossVMs

VMConfigReader

loadSPANKVMLauncher

loadSPANKVMReclaimer

•  VMConfigurationReader–RegisterallVMconfigurationoptions,setinthejobcontrolenvironmentsothattheyarevisibletoallallocatednodes.

•  VMLauncher–SetupVMsoneachallocatednodes.-FilebasedlocktodetectoccupiedVFandexclusivelyallocatefreeVF

-AssignauniqueIDtoeachIVSHMEManddynamicallyattachtoeachVM

•  VMReclaimer–TeardownVMsandreclaimresources

SLURMSPANKPluginbasedDesign

MPIMPI

vmhostfile

Page 31: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 31NetworkBasedComputingLaboratory

•  VMConfigurationReader–VMoptionsregister

•  VMLauncher,VMReclaimer–OffloadtounderlyingOpenStackinfrastructure-PCIWhitelisttopassthroughfreeVFtoVM

-ExtendNovatoenableIVSHMEMwhenlaunchingVM

SLURMSPANKPluginwithOpenStackbasedDesign

J.Zhang,X.Lu,S.Chakraborty,D.K.Panda.SLURM-V:ExtendingSLURMforBuildingEfficientHPCCloudwithSR-IOVandIVShmem.Euro-Par,2016

reclaimVMs

registerjobstepreply

registerjobstepreq

Slurmctld Slurmd

releasehosts

launchVM

mpirun_vm

loadSPANKVMConfigReader

MPI

VMhostfile

OpenStackdaemon

requestlaunchVMVMLauncher

return

requestreclaimVMVMReclaimer

return

......

......

......

......

Page 32: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 32NetworkBasedComputingLaboratory

•  32VMsacross8nodes,6Core/VM

•  EASJ-ComparedtoNative,lessthan4%overheadwith128Procs

•  SACJ,EACJ–Alsominoroverhead,whenrunningNASasconcurrentjobwith64Procs

Application-LevelPerformanceonChameleon(Graph500)

Exclusive Allocations Sequential Jobs

0

500

1000

1500

2000

2500

3000

24,16 24,20 26,10

BFSExecutionTime(m

s)

ProblemSize(Scale,Edgefactor)

VM

Native

0

50

100

150

200

250

22,10 22,16 22,20BF

SExecutionTime(m

s)

ProblemSize(Scale,Edgefactor)

VMNative

0

50

100

150

200

250

2210 2216 2220

BFSExecutionTime(m

s)

ProblemSize(Scale,Edgefactor)

VM

Native

Shared-host Allocations Concurrent Jobs

Exclusive Allocations Concurrent Jobs

4%

Page 33: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 33NetworkBasedComputingLaboratory

•  MVAPICH2-VirtwithSR-IOVandIVSHMEM–  Standalone,OpenStack

•  SR-IOV-enabledVMMigrationSupportinMVAPICH2

•  MVAPICH2withContainers(DockerandSingularity)

•  MVAPICH2withNestedVirtualization(ContaineroverVM)

•  MVAPICH2-VirtonSLURM–  SLURMalone,SLURM+OpenStack

•  NeuroscienceApplicationsonHPCClouds

•  BigDataLibrariesonCloud–  RDMA-Hadoop,OpenStackSwift

ApproachestoBuildHPCClouds

Page 34: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 34NetworkBasedComputingLaboratory

•  EasyandFastDiscoveryisthekey!

NeuroScienceMeetsHPCCloud

1https://github.com/francopestilli/life

The Brain Connectome. Illustration of a set of fascicles (white matter bundles) obtained by using a tractography algorithm. Fascicles are grouped together conforming white matter tracts (shown with different colors here) connecting different cortical areas of the human brain. LiFE1 (Linear Fascicle Evaluation) is an approach to predict diffusion measurements in brain connectomes.

Virtualization

CloudComputing

•  Easy-to-useandHigh-PerformanceTechnologyisthekey!

Page 35: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 35NetworkBasedComputingLaboratory

•  Identifiedcomputationallyintensivetasksasthecomputationsofmatrixbyvectorproducts–  w=MTyandy=Mw

•  ThecomputationallyintensivefunctionshavebeenparallelizedusingMPIbydividingthetaskamongmultipleMPIprocesses

•  ImplementationusesMVAPICH22,fromOSUteam

MPI-basedLiFEforBrainHealth:InitialDesignusingMVAPICH2MPILibrary

1https://github.com/francopestilli/life 2http://mvapich.cse.ohio-state.edu/

Computation of w = MTy using 2 MPI processes Computation of y = Mw using 2 MPI processes

Page 36: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 36NetworkBasedComputingLaboratory

DesignandEvaluationwithMVAPICH2:SingleNodewithMPIonIntelKnightsLanding(KNL)

•  EvaluationonTACCStampedeKNL(IntelXeonPhiKNLCPUs,68cores,96GBmemorypernode)•  Upto8.7xspeedup

0

2

4

6

8

10

1 2 4 8 16 24 28 32 64

SpeedUp

MPIProcesses

SpeedUponaSingleNodeStampede2

0

1000

2000

3000

4000

5000

1 2 4 8 16 32 64

ExecutionTime(s)

MPIProcesses

TimeBreakupEvaluationonStampede2

NNLS Load MM

Reduce Gather Bcast

MPI-LiFEsoftwareisavailablefromhttp://neurohpc.cse.ohio-state.eduDocker-containerizedversion,Canrunfromlaptoptoclusters

Page 37: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 37NetworkBasedComputingLaboratory

DesignandEvaluationofLiFEcodewithMVAPICH2-Virt+Docker

•  EvaluationonChameleonwithDocker(IntelHaswellCPUs,24cores,128GBmemorypernode)•  Upto5.5xspeeduponChameleon

0

1

2

3

4

5

6

1 2 4 8 16 24

SpeedUp

Nodes

SpeedUponSingleNodewithDocker(Chameleon)

Native 1Container 2Containers

Page 38: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 38NetworkBasedComputingLaboratory

•  MVAPICH2-VirtwithSR-IOVandIVSHMEM–  Standalone,OpenStack

•  SR-IOV-enabledVMMigrationSupportinMVAPICH2

•  MVAPICH2withContainers(DockerandSingularity)

•  MVAPICH2withNestedVirtualization(ContaineroverVM)

•  MVAPICH2-VirtonSLURM–  SLURMalone,SLURM+OpenStack

•  NeuroscienceApplicationsonHPCClouds

•  BigDataLibrariesonCloud–  RDMA-Hadoop,OpenStackSwift

ApproachestoBuildHPCClouds

Page 39: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 39NetworkBasedComputingLaboratory

•  RDMAforApacheSpark

•  RDMAforApacheHadoop2.x(RDMA-Hadoop-2.x)–  PluginsforApache,Hortonworks(HDP)andCloudera(CDH)Hadoopdistributions

•  RDMAforApacheHBase

•  RDMAforMemcached(RDMA-Memcached)

•  RDMAforApacheHadoop1.x(RDMA-Hadoop)

•  OSUHiBD-Benchmarks(OHB)–  HDFS,Memcached,HBase,andSparkMicro-benchmarks

•  http://hibd.cse.ohio-state.edu

•  UsersBase:285organizationsfrom34countries

•  Morethan26,400downloadsfromtheprojectsite

TheHigh-PerformanceBigData(HiBD)Project

AvailableforInfiniBandandRoCEAlsorunonEthernet

Availableforx86andOpenPOWER

SupportforSingularityandDocker

Page 40: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 40NetworkBasedComputingLaboratory

OverviewofRDMA-Hadoop-VirtArchitecture•  Virtualization-awaremodulesinallthefour

mainHadoopcomponents:–  HDFS:Virtualization-awareBlockManagement

toimprovefault-tolerance

–  YARN:ExtensionstoContainerAllocationPolicytoreducenetworktraffic

–  MapReduce:ExtensionstoMapTaskSchedulingPolicytoreducenetworktraffic

–  HadoopCommon:TopologyDetectionModuleforautomatictopologydetection

•  CommunicationsinHDFS,MapReduce,andRPCgothroughRDMA-baseddesignsoverSR-IOVenabledInfiniBand

HDFS

YARN

Had

oop

Com

mon

MapReduce

HBase Others

Virtual Machines Bare-Metal nodes Containers

Big Data Applications

Topo

logy

Det

ectio

n M

odul

e

Map Task Scheduling Policy Extension

Container Allocation Policy Extension

CloudBurst MR-MS Polygraph Others

Virtualization Aware Block Management

S.Gugnani,X.Lu,D.K.Panda.DesigningVirtualization-awareandAutomaticTopologyDetectionSchemesforAcceleratingHadooponSR-IOV-enabledClouds.CloudCom,2016.

Page 41: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 41NetworkBasedComputingLaboratory

EvaluationwithApplications

–  14%and24%improvementwithDefaultModeforCloudBurstandSelf-Join

–  30%and55%improvementwithDistributedModeforCloudBurstandSelf-Join

0

20

40

60

80

100

DefaultMode DistributedMode

EXEC

UTIONTIM

ECloudBurst

RDMA-Hadoop RDMA-Hadoop-Virt

050100150200250300350400

DefaultMode DistributedMode

EXEC

UTIONTIM

E

Self-Join

RDMA-Hadoop RDMA-Hadoop-Virt

30%reduction

55%reduction

Page 42: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 42NetworkBasedComputingLaboratory

•  DistributedCloud-basedObjectStorageService

•  DeployedaspartofOpenStackinstallation

•  Canbedeployedasstandalonestoragesolutionaswell

•  WorldwidedataaccessviaInternet

–  HTTP-based

•  Architecture

–  MultipleObjectServers:Tostoredata

–  FewProxyServers:Actasaproxyforallrequests

–  Ring:Handlesmetadata

•  Usage–  Input/outputsourceforBigDataapplications(mostcommonuse

case)

–  Software/Databackup

–  StorageofVM/Dockerimages

•  BasedontraditionalTCPsocketscommunication

OpenStackSwiftOverview

SendPUTorGETrequestPUT/GET/v1/<account>/<container>/<object>

ProxyServer

ObjectServer

ObjectServer

ObjectServer

Ring

Disk1

Disk2

Disk1

Disk2

Disk1

Disk2

SwiftArchitecture

Page 43: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 43NetworkBasedComputingLaboratory

•  Challenges

–  Proxyserverisabottleneckforlargescaledeployments

–  Objectupload/downloadoperationsnetworkintensive

–  CananRDMA-basedapproachbenefit?

•  Design

–  Re-designedSwiftarchitectureforimprovedscalabilityandperformance;Twoproposeddesigns:

•  Client-ObliviousDesign:Nochangesrequiredontheclientside

•  MetadataServer-basedDesign:Directcommunicationbetweenclientandobjectservers;bypassproxyserver

–  RDMA-basedcommunicationframeworkforacceleratingnetworkingperformance

–  High-performanceI/OframeworktoprovidemaximumoverlapbetweencommunicationandI/O

Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildingEfficientHPCClouds

S.Gugnani,X.Lu,andD.K.Panda,Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildinganEfficientHPCCloud,acceptedatCCGrid’17,May2017

Client-ObliviousDesign(D1)

MetadataServer-basedDesign(D2)

Page 44: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 44NetworkBasedComputingLaboratory

0

5

10

15

20

25

SwiftPUT

Swift-X(D1)PUT

Swift-X(D2)PUT

SwiftGET

Swift-X(D1)GET

Swift-X(D2)GET

LATENCY

(S)

TIMEBREAKUPOFGETANDPUT

Communication I/O Hashsum Other

Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildingEfficientHPCClouds

0

5

10

15

20

1MB 4MB 16MB 64MB256MB 1GB 4GB

LATENCY

(s)

OBJECTSIZE

GETLATENCYEVALUATION

Swift Swift-X(D2) Swift-X(D1)

Reducedby66%

•  Upto66%reductioninGETlatency•  Communicationtimereducedbyupto3.8xforPUTandupto2.8xforGET

Page 45: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 45NetworkBasedComputingLaboratory

AvailableAppliancesonChameleonCloud*

Appliance Description

CentOS7KVMSR-IOV

Chameleonbare-metalimagecustomizedwiththeKVMhypervisorandarecompiledkerneltoenableSR-IOVoverInfiniBand.https://www.chameleoncloud.org/appliances/3/

MPIbare-metalclustercomplex

appliance(BasedonHeat)

ThisappliancedeploysanMPIclustercomposedofbaremetalinstancesusingtheMVAPICH2libraryoverInfiniBand.https://www.chameleoncloud.org/appliances/29/

MPI+SR-IOVKVMcluster(Basedon

Heat)

ThisappliancedeploysanMPIclusterofKVMvirtualmachinesusingtheMVAPICH2-VirtimplementationandconfiguredwithSR-IOVforhigh-performancecommunicationoverInfiniBand.https://www.chameleoncloud.org/appliances/28/

CentOS7SR-IOVRDMA-Hadoop

TheCentOS7SR-IOVRDMA-HadoopapplianceisbuiltfromtheCentOS7applianceandadditionallycontainsRDMA-HadooplibrarywithSR-IOV.https://www.chameleoncloud.org/appliances/17/

•  Throughtheseavailableappliances,usersandresearcherscaneasilydeployHPCcloudstoperformexperimentsandrunjobswith–  High-PerformanceSR-IOV+InfiniBand

–  High-PerformanceMVAPICH2Libraryoverbare-metalInfiniBandclusters

–  High-PerformanceMVAPICH2LibrarywithVirtualizationSupportoverSR-IOVenabledKVMclusters

–  High-PerformanceHadoopwithRDMA-basedEnhancementsSupport[*]OnlyincludeappliancescontributedbyOSUNowLab

Page 46: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 46NetworkBasedComputingLaboratory

•  MVAPICH2-VirtoverSR-IOV-enabledInfiniBandisanefficientapproachtobuildHPCClouds–  Standalone,OpenStack,Slurm,andSlurm+OpenStack

–  SupportVirtualMachineMigrationwithSR-IOVInfiniBanddevices

–  SupportVirtualMachine,Container(DockerandSingularity),andNestedVirtualization

•  Verylittleoverheadwithvirtualization,nearnativeperformanceatapplicationlevel

•  MuchbetterperformancethanAmazonEC2

•  MVAPICH2-VirtisavailableforbuildingHPCClouds(http://mvapich.cse.ohio-state.edu)–  SR-IOV,IVSHMEM,Dockersupport,OpenStack

•  NeuroscienceapplicationscanbenefitfromtechnologiesonHPCclouds

•  BigDataanalyticsstackssuchasRDMA-Hadoopcanbenefitfromcloud-awaredesigns

•  AppliancesforMVAPICH2-VirtandRDMA-HadoopareavailableforbuildingHPCClouds

•  SR-IOV/containersupportandappliancesforotherMVAPICH2libraries(MVAPICH2-X,MVAPICH2-GDR,...)andRDMA-Spark/Memcached

Conclusions

Page 47: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 47NetworkBasedComputingLaboratory

The2ndInternationalBoFon

BuildingEfficientCloudsforHPC,BigData,andDeepLearningMiddlewareandApplications(HPCCloudBoF)

HPCCloudBoF2018willbeheldwithISC‘18,Frankfurt,GermanyJune27,2018(Wednesday)1:45-2:45pm

HPCCloudBoF2017washeldinconjunctionwithSC’17

http://sc17.supercomputing.org/presentation/?id=bof165&sess=sess357

Page 48: Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds with MVAPICH2 and OpenStack ... • Virtualization Support with Virtual Machines

OpenStackSummit2018 48NetworkBasedComputingLaboratory

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/

MVAPICH/MVAPICH2http://mvapich.cse.ohio-state.edu/

{panda,luxi}@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

http://www.cse.ohio-state.edu/~luxi