Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds...

BuildingEfficientHPCCloudswithMVAPICH2andOpenStackoverSR-IOV-enabledHeterogeneousClusters

TalkatOpenStackSummit2018

Vancouver,Canada

by

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:[email protected]://www.cse.ohio-state.edu/~panda

XiaoyiLuTheOhioStateUniversity

E-mail:[email protected]://www.cse.ohio-state.edu/~luxi

OpenStackSummit2018 2NetworkBasedComputingLaboratory

•  CloudComputingwidelyadoptedinindustrycomputingenvironment

•  CloudComputingprovideshighresourceutilizationandflexibility

•  VirtualizationisthekeytechnologytoenableCloudComputing

•  Intersect360studyshowscloudisthefastestgrowingclassofHPC

•  HPCMeetsCloud:TheconvergenceofCloudComputingandHPC

HPCMeetsCloudComputing


DriversofModernHPCClusterandCloudArchitecture

•  Multi-core/many-coretechnologies,Accelerators

•  Largememorynodes

•  SolidStateDrives(SSDs),NVM,ParallelFilesystems,ObjectStorageClusters

•  RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

•  SingleRootI/OVirtualization(SR-IOV)

HighPerformanceInterconnects–InfiniBand(withSR-IOV)

<1useclatency,200GbpsBandwidth>Multi-/Many-core

Processors

SSDs,ObjectStorageClusters

Largememorynodes(Upto2TB)

Cloud Cloud SDSCComet TACCStampede


•  SingleRootI/OVirtualization(SR-IOV)isprovidingnewopportunitiestodesignHPCcloudwithverylittlelowoverhead

SingleRootI/OVirtualization(SR-IOV)

•  Allowsasinglephysicaldevice,oraPhysicalFunction(PF),topresentitselfasmultiplevirtualdevices,orVirtualFunctions(VFs)

•  VFsaredesignedbasedontheexistingnon-virtualizedPFs,noneedfordriverchange

•  EachVFcanbededicatedtoasingleVMthroughPCIpass-through

•  Workwith10/40GigEandInfiniBand

4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications

The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.

The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.

2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.

Qemu Userspace

Guest 1

Userspace

kernelPCI Device

mmap region

Qemu Userspace

Guest 2

Userspace

kernel

mmap region

Qemu Userspace

Guest 3

Userspace

kernelPCI Device

mmap region

/dev/shm/<name>

PCI Device

Host

mmap mmap mmap

shared mem fd

eventfds

(a) Inter-VM Shmem Mechanism [15]

Guest 1Guest OS

VF Driver

Guest 2Guest OS

VF Driver

Guest 3Guest OS

VF Driver

Hypervisor PF Driver

I/O MMU

SR-IOV Hardware

Virtual Function

Virtual Function

Virtual Function

Physical Function

PCI Express

(b) SR-IOV Mechanism [22]

Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms

Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to


•  VirtualizationSupportwithVirtualMachinesandContainers–  KVM,Docker,Singularity,etc.

•  CommunicationcoordinationamongoptimizedcommunicationchannelsonClouds–  SR-IOV,IVShmem,IPC-Shm,CMA,etc.

•  Locality-awarecommunication•  Scalabilityformilliontobillionprocessors

–  Supportforhighly-efficientinter-nodeandintra-nodecommunication(bothtwo-sidedandone-sided)

•  ScalableCollectivecommunication–  Offload;Non-blocking;Topology-aware

•  Balancingintra-nodeandinter-nodecommunicationfornextgenerationnodes(128-1024cores)–  Multipleend-pointspernode

•  NUMA-awarecommunicationfornestedvirtualization•  IntegratedSupportforGPGPUsandAccelerators•  Fault-tolerance/resiliency

–  Migrationsupportwithvirtualmachines•  QoSsupportforcommunicationandI/O•  SupportforHybridMPI+PGASprogramming(MPI+OpenMP,MPI+UPC,MPI+OpenSHMEM,MPI+UPC++,CAF,…)•  Energy-Awareness•  Co-designwithresourcemanagementandschedulingsystemsonClouds

–  OpenStack,Slurm,etc.

BroadChallengesofBuildingEfficientHPCClouds


•  MVAPICH2-VirtwithSR-IOVandIVSHMEM–  Standalone,OpenStack

•  SR-IOV-enabledVMMigrationSupportinMVAPICH2

•  MVAPICH2withContainers(DockerandSingularity)

•  MVAPICH2withNestedVirtualization(ContaineroverVM)

•  MVAPICH2-VirtonSLURM–  SLURMalone,SLURM+OpenStack

•  NeuroscienceApplicationsonHPCClouds

•  BigDataLibrariesonCloud–  RDMA-Hadoop,OpenStackSwift

ApproachestoBuildHPCClouds


OverviewoftheMVAPICH2Project•  HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

–  MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.1),Startedin2001,Firstversionavailablein2002

–  MVAPICH2-X(MPI+PGAS),Availablesince2011

–  SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

–  SupportforVirtualization(MVAPICH2-Virt),Availablesince2015–  SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

–  SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

–  Usedbymorethan2,900organizationsin86countries

–  Morethan469,000(>0.46million)downloadsfromtheOSUsitedirectly

–  EmpoweringmanyTOP500clusters(Nov‘17ranking)•  1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

•  9th,556,104cores(Oakforest-PACS)inJapan

•  12th,368,928-core(Stampede2)atTACC

•  17th,241,108-core(Pleiades)atNASA

•  48th,76,032-core(Tsubame2.5)atTokyoInstituteofTechnology

–  AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

–  http://mvapich.cse.ohio-state.edu

•  EmpoweringTop500systemsforoveradecade


•  RedesignMVAPICH2tomakeitvirtualmachineaware–  SR-IOVshowsneartonative

performanceforinter-nodepointtopointcommunication

–  IVSHMEMofferssharedmemorybaseddataaccessacrossco-residentVMs

–  LocalityDetector:maintainsthelocalityinformationofco-residentvirtualmachines

–  CommunicationCoordinator:selectsthecommunicationchannel(SR-IOV,IVSHMEM)adaptively

•  SupportdeploymentwithOpenStack

OverviewofMVAPICH2-VirtwithSR-IOVandIVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter

Physical Function

user space

kernel space

MPI proc

PCI Device

VF Driver

Guest 2user space

kernel space

MPI proc

PCI Device

VF Driver

Virtual Function

Virtual Function

/dev/shm/

IV-SHM

IV-Shmem Channel

SR-IOV Channel

J.Zhang,X.Lu,J.Jose,R.Shi,D.K.Panda.CanInter-VMShmemBenefitMPIApplicationsonSR-IOVbasedVirtualizedInfiniBandClusters?Euro-Par,2014

J.Zhang,X.Lu,J.Jose,R.Shi,M.Li,D.K.Panda.HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters.HiPC,2014

J.Zhang,X.Lu,M.Arnold,D.K.Panda.MVAPICH2overOpenStackwithSR-IOV:AnEfficientApproachtoBuildHPCClouds.CCGrid,2015


0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

ExecutionTime(s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1% 9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

ExecutionTime(m

s)

ProblemSize(Scale,Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native 2%

•  32VMs,6Core/VM

•  ComparedtoNative,2-5%overheadforGraph500with128Procs

•  ComparedtoNative,1-9.5%overheadforSPECMPI2007with128Procs

Application-LevelPerformanceonChameleon

SPECMPI2007 Graph500

5%


ExecuteLiveMigrationwithSR-IOVDevice


HighPerformanceSR-IOVenabledVMMigrationSupportinMVAPICH2

MPI

Host

Guest VM1

Hypervisor

IBAdapterIVShmem

Network Suspend Trigger

Read-to-Migrate Detector

Network ReactiveNotifier

ŏ

Controller

MPIGuest VM2

MPI

Host

Guest VM1

VF /SR-IOV

Hypervisor

IBAdapter IVShmemEthernet

Adapter

MPIGuest VM2

EthernetAdapter

Migration Done

Detector

VF /SR-IOV

VF /SR-IOV

VF /SR-IOV

Migration Trigger

J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017

•  MigrationwithSR-IOVdevicehastohandlethechallengesofdetachment/re-attachmentofvirtualizedIBdeviceandIBconnection

•  ConsistofSR-IOVenabledIBClusterandExternalMigrationController

•  MultipleparallellibrariestonotifyMPIapplicationsduringmigration(detach/reattachSR-IOV/IVShmem,migrateVMs,migrationstatus)

•  HandletheIBconnectionsuspendingandreactivating

•  ProposeProgressengine(PE)andmigrationthreadbased(MT)designtooptimizeVMmigrationandMPIapplicationperformance


•  ComparedwiththeTCP,theRDMAschemereducesthetotalmigrationtimeby20%

•  Totaltimeisdominatedby`Migration’time;Timesonotherstepsaresimilaracrossdifferentschemes

•  Proposedmigrationframeworkcouldreduceupto51%migrationtime

PerformanceEvaluationofVMMigrationFramework

0

0.5

1

1.5

2

2.5

3

TCP IPoIB RDMA

Times(s)

SetPOST_MIGRATION AddIVSHMEMAttachVF MigrationRemoveIVSHMEM DetachVFSetPRE_MIGRATION

BreakdownofVMmigration

0

10

20

30

40

2VM 4VM 8VM 16VM

Time(s)

SequentialMigrationFramework

ProposedMigrationFramework

MultipleVMMigrationTime


Bcast(4VMs*2Procs/VM)

•  MigrateaVMfromonemachinetoanotherwhilebenchmarkisrunninginside

•  ProposedMT-baseddesignsperformslightlyworsethanPE-baseddesignsbecauseoflock/unlock

•  NobenefitfromMTbecauseofNOcomputationinvolved

PerformanceEvaluationofVMMigrationFrameworkPt2PtLatency

0

50

100

150

200

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize(bytes)

PE-IPoIB PE-RDMA MT-IPoIB MT-RDMA

0

100

200

300

400

500

600

700

1 4 16 64 256 1K 4K 16K 64K 256K 1M

2Laten

cy(u

s)

MessageSize(bytes)

PE-IPoIB

PE-RDMA

MT-IPoIB

MT-RDMA


Graph500

•  8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning

•  ComparedwithNM,MT-worstandPEincursomeoverheadcomparedwithNM

•  MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation

PerformanceEvaluationofVMMigrationFrameworkNAS

0

20

40

60

80

100

120

LU.C EP.C IS.C MG.C CG.C

ExecutionTime(s)

PE

MT-worst

MT-typical

NM

0.0

0.1

0.1

0.2

0.2

0.3

0.3

0.4

0.4

20,10 20,16 20,20 22,10

ExecutionTime(s)

PE

MT-worst

MT-typical

NM


•  Container-basedtechnologies(e.g.,Docker)providelightweightvirtualizationsolutions

•  Container-basedvirtualization–sharehostkernelbycontainers

OverviewofContainers-basedVirtualization

VM1Container1


•  WhataretheperformancebottleneckswhenrunningMPIapplicationsonmultiplecontainersperhostinHPCcloud?

•  Canweproposeanewdesigntoovercomethebottleneckonsuchcontainer-basedHPCcloud?

•  Canoptimizeddesigndelivernear-nativeperformancefordifferentcontainerdeploymentscenarios?

•  Locality-awarebaseddesigntoenableCMAandSharedmemorychannelsforMPIcommunicationacrossco-residentcontainers

Containers-basedDesign:Issues,Challenges,andApproaches

J.Zhang,X.Lu,D.K.Panda.HighPerformanceMPILibraryforContainer-basedHPCCloudonInfiniBandClusters.ICPP,2016


0

10

20

30

40

50

60

70

80

90

100

MG.D FT.D EP.D LU.D CG.D

ExecutionTime(s)

Container-Def

Container-Opt

Native

•  64Containersacross16nodes,pining4CoresperContainer

•  ComparedtoContainer-Def,upto11%and73%ofexecutiontimereductionforNASandGraph500

•  ComparedtoNative,lessthan9%and5%overheadforNASandGraph500

Application-LevelPerformanceonDockerwithMVAPICH2

Graph500 NAS

11%

0

50

100

150

200

250

300

1Cont*16P 2Conts*8P 4Conts*4P

BFSExecutionTime(m

s)

Scale,Edgefactor(20,16)

Container-Def

Container-Opt

Native

73%


0100020003000400050006000700080009000

1K 4K 16K 64K 256K 1M

Band

width

(MB/

s)

Mesage Size ( bytes)

Singularity-Intra-Node

Native-Intra-Node

Singularity-Inter-Node

Native-Inter-Node

SingularityPerformanceonDifferentProcessorArchitectures

•  MPIpoint-to-pointBandwidth

•  OnbothHaswellandKNL,lessthan7%overheadforSingularitysolution

•  Worseintra-nodeperformancethanHaswellbecauselowCPUfrequency,complexclustermode,andcostmaintainingcachecoherence

•  KNL-Inter-nodeperformsbetterthanintra-nodecaseafteraround256Kbytes,asOmni-Pathinterconnectoutperformssharedmemory-basedtransferforlargemessagesize

BWonHaswell BWonKNL

7%

02000400060008000

10000120001400016000

1K 4K 16K 64K 256K 1M

Band

width

(MB/

s)

Mesage Size ( bytes)

Singularity-Intra-NodeNative-Intra-NodeSingularity-Inter-NodeNative-Inter-Node


0

500

1000

1500

2000

2500

3000

22,16 22,20 24,16 24,20 26,16 26,20

Exec

utio

n Tim

e (m

s) Singularity

Native

SingularityPerformanceonHaswellwithInfiniBand

0

50

100

150

200

250

300

CG EP FT IS LU MG

Exec

utio

n Tim

e (s) Singularity

Native

ClassDNAS Graph500

•  512processorsacross32Haswellnodes•  Singularitydeliversnear-nativeperformance,lessthan7%overheadonHaswell

withInfiniBand

7%

J.Zhang,X.Lu,D.K.Panda.IsSingularity-basedContainerTechnologyReadyforRunningMPIApplicationsonHPCClouds?UCC2017.(BestStudentPaperAward)


NestedVirtualization:ContainersoverVirtualMachines

•  Usefulforlivemigration,sandboxapplication,legacysystemintegration,softwaredeployment,etc.

•  Performanceissuesbecauseoftheredundantcallstacks(two-layervirtualization)andisolatedphysicalresources

Hardware

Host OSHypervisor

Redhat

VM1

Docker Engine

Container2

bins/libs

AppStack

Container1

bins/libs

AppStack

Ubuntu

VM2

Docker Engine

Container4

bins/libs

AppStack

Container3

bins/libs

AppStack


MultipleCommunicationPathsinNestedVirtualization

1.  Intra-VMIntra-Container(acrosscore4andcore5)

2.  Intra-VMInter-Container(acrosscore13andcore14)

3.  Inter-VMInter-Container(acrosscore6andcore12)

4.  Inter-NodeInter-Container(acrosscore15andthecoreonremotenode)

QPIMemo

ryCo

ntroll

er

core 4 core 5 core 6 core 7

NUMA 0

VM 0Container 0

Memo

ryCo

ntroll

ercore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11


NUMA 1

VM 1Container 2Container 1 Container 3

1 23 ...4

•  DifferentVMplacementsintroducemultiplecommunicationpathsoncontainerlevel


OverviewofProposedDesigninMVAPICH2

QPI

Mem

ory

Cont

rolle

r


NUMA 0

VM 0Container 0

Mem

ory

Cont

rolle

rcore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11


VM 1Container 2Container 1 Container 3

1 23 4

Two-Layer NUMA Aware Communication Coordinator

NUMA 1

Container Locality Detector VM Locality Detector

Nested Locality Combiner

Two-Layer Locality Detector

CMAChannel

SHared Memory (SHM) Channel

Network (HCA)Channel

Two-LayerLocalityDetector:DynamicallydetectingMPIprocessesintheco-residentcontainersinsideoneVMaswellastheonesintheco-residentVMs

Two-LayerNUMAAwareCommunicationCoordinator:Leveragenestedlocalityinfo,NUMAarchitectureinfoandmessagetoselectappropriatecommunicationchannel

J.Zhang,X.Lu,D.K.Panda.DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,VEE,2017


Inter-VMInter-ContainerPt2Pt(Intra-Socket)

0

50

100

150

200

250

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size (bytes)

Default1Layer2Layer

Native(w/o CMA)Native

0 0.5

1 1.5

2 2.5

3 3.5

4

1 4 16 64 256 1K 4K

0

2000

4000

6000

8000

10000

12000

14000

16000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Band

wid

th (M

B/s)


Default1Layer2Layer


•  1LayerhassimilarperformancetotheDefault•  Comparedwith1Layer,2Layerdeliversupto84%and184%improvementfor

latencyandBW

Latency BW


Inter-VMInter-ContainerPt2Pt(Inter-Socket)

0

50

100

150

200

250

300

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Default1Layer2Layer

Basic-HybridEnhanced-Hybrid


0 0.5

1 1.5

2 2.5

3 3.5

4

1 4 16 64 256 1K 4K

0

2000

4000

6000

8000

10000

12000

14000

16000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Band

wid

th (M

B/s)


Default1Layer2Layer

Basic-HybridEnhanced-Hybrid


Latency BW•  1-LayerhassimilarperformancetotheDefault•  2-Layerhasnear-nativeperformanceforsmallmsg,butclearoverheadonlargemsg•  Comparedto2-Layer,Hybriddesignbringsupto42%and25%improvementfor

latencyandBW,respectively


Application-levelEvaluations

0

1

2

3

4

5

6

7

8

9

10

22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16

BFS

Exe

cutio

n Ti

me

(s)

Default1Layer

2Layer-Enhanced-Hybrid

0

20

40

60

80

100

120

140

160

180

IS MG EP FT CG LU

Exec

utio

n Ti

me

(s)

Default1Layer

2Layer-Enhanced-Hybrid

•  256processesacross64containerson16nodes•  ComparedwithDefault,enhanced-hybriddesignreducesupto16%(28,16)and10%(LU)of

executiontimeforGraph500andNAS,respectively•  Comparedwiththe1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6%

(LU)performancebenefit

ClassDNASGraph500


loadSPANK

reclaimVMs

registerjobstepreply

registerjobstepreq

Slurmctld Slurmd Slurmd

releasehosts

runjobstepreq

runjobstepreply

mpirun_vm

MPIJobacrossVMs

VMConfigReader

loadSPANKVMLauncher

loadSPANKVMReclaimer

•  VMConfigurationReader–RegisterallVMconfigurationoptions,setinthejobcontrolenvironmentsothattheyarevisibletoallallocatednodes.

•  VMLauncher–SetupVMsoneachallocatednodes.-FilebasedlocktodetectoccupiedVFandexclusivelyallocatefreeVF

-AssignauniqueIDtoeachIVSHMEManddynamicallyattachtoeachVM

•  VMReclaimer–TeardownVMsandreclaimresources

SLURMSPANKPluginbasedDesign

MPIMPI

vmhostfile


•  VMConfigurationReader–VMoptionsregister

•  VMLauncher,VMReclaimer–OffloadtounderlyingOpenStackinfrastructure-PCIWhitelisttopassthroughfreeVFtoVM

-ExtendNovatoenableIVSHMEMwhenlaunchingVM

SLURMSPANKPluginwithOpenStackbasedDesign

J.Zhang,X.Lu,S.Chakraborty,D.K.Panda.SLURM-V:ExtendingSLURMforBuildingEfficientHPCCloudwithSR-IOVandIVShmem.Euro-Par,2016

reclaimVMs

registerjobstepreply

registerjobstepreq

Slurmctld Slurmd

releasehosts

launchVM

mpirun_vm

loadSPANKVMConfigReader

MPI

VMhostfile

OpenStackdaemon

requestlaunchVMVMLauncher

return

requestreclaimVMVMReclaimer

return

......

......

......

......


•  32VMsacross8nodes,6Core/VM

•  EASJ-ComparedtoNative,lessthan4%overheadwith128Procs

•  SACJ,EACJ–Alsominoroverhead,whenrunningNASasconcurrentjobwith64Procs

Application-LevelPerformanceonChameleon(Graph500)

Exclusive Allocations Sequential Jobs

0

500

1000

1500

2000

2500

3000

24,16 24,20 26,10

BFSExecutionTime(m

s)


VM

Native

0

50

100

150

200

250

22,10 22,16 22,20BF

SExecutionTime(m

s)


VMNative

0

50

100

150

200

250

2210 2216 2220

BFSExecutionTime(m

s)


VM

Native

Shared-host Allocations Concurrent Jobs

Exclusive Allocations Concurrent Jobs

4%


•  EasyandFastDiscoveryisthekey!

NeuroScienceMeetsHPCCloud

1https://github.com/francopestilli/life

The Brain Connectome. Illustration of a set of fascicles (white matter bundles) obtained by using a tractography algorithm. Fascicles are grouped together conforming white matter tracts (shown with different colors here) connecting different cortical areas of the human brain. LiFE1 (Linear Fascicle Evaluation) is an approach to predict diffusion measurements in brain connectomes.

Virtualization

CloudComputing

•  Easy-to-useandHigh-PerformanceTechnologyisthekey!


•  Identifiedcomputationallyintensivetasksasthecomputationsofmatrixbyvectorproducts–  w=MTyandy=Mw

•  ThecomputationallyintensivefunctionshavebeenparallelizedusingMPIbydividingthetaskamongmultipleMPIprocesses

•  ImplementationusesMVAPICH22,fromOSUteam

MPI-basedLiFEforBrainHealth:InitialDesignusingMVAPICH2MPILibrary

1https://github.com/francopestilli/life 2http://mvapich.cse.ohio-state.edu/

Computation of w = MTy using 2 MPI processes Computation of y = Mw using 2 MPI processes


DesignandEvaluationwithMVAPICH2:SingleNodewithMPIonIntelKnightsLanding(KNL)

•  EvaluationonTACCStampedeKNL(IntelXeonPhiKNLCPUs,68cores,96GBmemorypernode)•  Upto8.7xspeedup

0

2

4

6

8

10

1 2 4 8 16 24 28 32 64

SpeedUp

MPIProcesses

SpeedUponaSingleNodeStampede2

0

1000

2000

3000

4000

5000

1 2 4 8 16 32 64

ExecutionTime(s)

MPIProcesses

TimeBreakupEvaluationonStampede2

NNLS Load MM

Reduce Gather Bcast

MPI-LiFEsoftwareisavailablefromhttp://neurohpc.cse.ohio-state.eduDocker-containerizedversion,Canrunfromlaptoptoclusters


DesignandEvaluationofLiFEcodewithMVAPICH2-Virt+Docker

•  EvaluationonChameleonwithDocker(IntelHaswellCPUs,24cores,128GBmemorypernode)•  Upto5.5xspeeduponChameleon

0

1

2

3

4

5

6

1 2 4 8 16 24

SpeedUp

Nodes

SpeedUponSingleNodewithDocker(Chameleon)

Native 1Container 2Containers


•  RDMAforApacheSpark

•  RDMAforApacheHadoop2.x(RDMA-Hadoop-2.x)–  PluginsforApache,Hortonworks(HDP)andCloudera(CDH)Hadoopdistributions

•  RDMAforApacheHBase

•  RDMAforMemcached(RDMA-Memcached)

•  RDMAforApacheHadoop1.x(RDMA-Hadoop)

•  OSUHiBD-Benchmarks(OHB)–  HDFS,Memcached,HBase,andSparkMicro-benchmarks

•  http://hibd.cse.ohio-state.edu

•  UsersBase:285organizationsfrom34countries

•  Morethan26,400downloadsfromtheprojectsite

TheHigh-PerformanceBigData(HiBD)Project

AvailableforInfiniBandandRoCEAlsorunonEthernet

Availableforx86andOpenPOWER

SupportforSingularityandDocker


OverviewofRDMA-Hadoop-VirtArchitecture•  Virtualization-awaremodulesinallthefour

mainHadoopcomponents:–  HDFS:Virtualization-awareBlockManagement

toimprovefault-tolerance

–  YARN:ExtensionstoContainerAllocationPolicytoreducenetworktraffic

–  MapReduce:ExtensionstoMapTaskSchedulingPolicytoreducenetworktraffic

–  HadoopCommon:TopologyDetectionModuleforautomatictopologydetection

•  CommunicationsinHDFS,MapReduce,andRPCgothroughRDMA-baseddesignsoverSR-IOVenabledInfiniBand

HDFS

YARN

Had

oop

Com

mon

MapReduce

HBase Others

Virtual Machines Bare-Metal nodes Containers

Big Data Applications

Topo

logy

Det

ectio

n M

odul

e

Map Task Scheduling Policy Extension

Container Allocation Policy Extension

CloudBurst MR-MS Polygraph Others

Virtualization Aware Block Management

S.Gugnani,X.Lu,D.K.Panda.DesigningVirtualization-awareandAutomaticTopologyDetectionSchemesforAcceleratingHadooponSR-IOV-enabledClouds.CloudCom,2016.


EvaluationwithApplications

–  14%and24%improvementwithDefaultModeforCloudBurstandSelf-Join

–  30%and55%improvementwithDistributedModeforCloudBurstandSelf-Join

0

20

40

60

80

100

DefaultMode DistributedMode

EXEC

UTIONTIM

ECloudBurst

RDMA-Hadoop RDMA-Hadoop-Virt

050100150200250300350400

DefaultMode DistributedMode

EXEC

UTIONTIM

E

Self-Join

RDMA-Hadoop RDMA-Hadoop-Virt

30%reduction

55%reduction


•  DistributedCloud-basedObjectStorageService

•  DeployedaspartofOpenStackinstallation

•  Canbedeployedasstandalonestoragesolutionaswell

•  WorldwidedataaccessviaInternet

–  HTTP-based

•  Architecture

–  MultipleObjectServers:Tostoredata

–  FewProxyServers:Actasaproxyforallrequests

–  Ring:Handlesmetadata

•  Usage–  Input/outputsourceforBigDataapplications(mostcommonuse

case)

–  Software/Databackup

–  StorageofVM/Dockerimages

•  BasedontraditionalTCPsocketscommunication

OpenStackSwiftOverview

SendPUTorGETrequestPUT/GET/v1/<account>/<container>/<object>

ProxyServer

ObjectServer

ObjectServer

ObjectServer

Ring

Disk1

Disk2

Disk1

Disk2

Disk1

Disk2

SwiftArchitecture


•  Challenges

–  Proxyserverisabottleneckforlargescaledeployments

–  Objectupload/downloadoperationsnetworkintensive

–  CananRDMA-basedapproachbenefit?

•  Design

–  Re-designedSwiftarchitectureforimprovedscalabilityandperformance;Twoproposeddesigns:

•  Client-ObliviousDesign:Nochangesrequiredontheclientside

•  MetadataServer-basedDesign:Directcommunicationbetweenclientandobjectservers;bypassproxyserver

–  RDMA-basedcommunicationframeworkforacceleratingnetworkingperformance

–  High-performanceI/OframeworktoprovidemaximumoverlapbetweencommunicationandI/O

Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildingEfficientHPCClouds

S.Gugnani,X.Lu,andD.K.Panda,Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildinganEfficientHPCCloud,acceptedatCCGrid’17,May2017

Client-ObliviousDesign(D1)

MetadataServer-basedDesign(D2)


0

5

10

15

20

25

SwiftPUT

Swift-X(D1)PUT

Swift-X(D2)PUT

SwiftGET

Swift-X(D1)GET

Swift-X(D2)GET

LATENCY

(S)

TIMEBREAKUPOFGETANDPUT

Communication I/O Hashsum Other

Swift-X:AcceleratingOpenStackSwiftwithRDMAforBuildingEfficientHPCClouds

0

5

10

15

20

1MB 4MB 16MB 64MB256MB 1GB 4GB

LATENCY

(s)

OBJECTSIZE

GETLATENCYEVALUATION

Swift Swift-X(D2) Swift-X(D1)

Reducedby66%

•  Upto66%reductioninGETlatency•  Communicationtimereducedbyupto3.8xforPUTandupto2.8xforGET


AvailableAppliancesonChameleonCloud*

Appliance Description

CentOS7KVMSR-IOV

Chameleonbare-metalimagecustomizedwiththeKVMhypervisorandarecompiledkerneltoenableSR-IOVoverInfiniBand.https://www.chameleoncloud.org/appliances/3/

MPIbare-metalclustercomplex

appliance(BasedonHeat)

ThisappliancedeploysanMPIclustercomposedofbaremetalinstancesusingtheMVAPICH2libraryoverInfiniBand.https://www.chameleoncloud.org/appliances/29/

MPI+SR-IOVKVMcluster(Basedon

Heat)

ThisappliancedeploysanMPIclusterofKVMvirtualmachinesusingtheMVAPICH2-VirtimplementationandconfiguredwithSR-IOVforhigh-performancecommunicationoverInfiniBand.https://www.chameleoncloud.org/appliances/28/

CentOS7SR-IOVRDMA-Hadoop

TheCentOS7SR-IOVRDMA-HadoopapplianceisbuiltfromtheCentOS7applianceandadditionallycontainsRDMA-HadooplibrarywithSR-IOV.https://www.chameleoncloud.org/appliances/17/

•  Throughtheseavailableappliances,usersandresearcherscaneasilydeployHPCcloudstoperformexperimentsandrunjobswith–  High-PerformanceSR-IOV+InfiniBand

–  High-PerformanceMVAPICH2Libraryoverbare-metalInfiniBandclusters

–  High-PerformanceMVAPICH2LibrarywithVirtualizationSupportoverSR-IOVenabledKVMclusters

–  High-PerformanceHadoopwithRDMA-basedEnhancementsSupport[*]OnlyincludeappliancescontributedbyOSUNowLab


•  MVAPICH2-VirtoverSR-IOV-enabledInfiniBandisanefficientapproachtobuildHPCClouds–  Standalone,OpenStack,Slurm,andSlurm+OpenStack

–  SupportVirtualMachineMigrationwithSR-IOVInfiniBanddevices

–  SupportVirtualMachine,Container(DockerandSingularity),andNestedVirtualization

•  Verylittleoverheadwithvirtualization,nearnativeperformanceatapplicationlevel

•  MuchbetterperformancethanAmazonEC2

•  MVAPICH2-VirtisavailableforbuildingHPCClouds(http://mvapich.cse.ohio-state.edu)–  SR-IOV,IVSHMEM,Dockersupport,OpenStack

•  NeuroscienceapplicationscanbenefitfromtechnologiesonHPCclouds

•  BigDataanalyticsstackssuchasRDMA-Hadoopcanbenefitfromcloud-awaredesigns

•  AppliancesforMVAPICH2-VirtandRDMA-HadoopareavailableforbuildingHPCClouds

•  SR-IOV/containersupportandappliancesforotherMVAPICH2libraries(MVAPICH2-X,MVAPICH2-GDR,...)andRDMA-Spark/Memcached

Conclusions


The2ndInternationalBoFon

BuildingEfficientCloudsforHPC,BigData,andDeepLearningMiddlewareandApplications(HPCCloudBoF)

HPCCloudBoF2018willbeheldwithISC‘18,Frankfurt,GermanyJune27,2018(Wednesday)1:45-2:45pm

HPCCloudBoF2017washeldinconjunctionwithSC’17

http://sc17.supercomputing.org/presentation/?id=bof165&sess=sess357


ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/

MVAPICH/MVAPICH2http://mvapich.cse.ohio-state.edu/

{panda,luxi}@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

http://www.cse.ohio-state.edu/~luxi

Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds...

Documents

Transcript of Building Efficient HPC Clouds with MVAPICH2 and OpenStack ...€¦ · Building Efficient HPC Clouds...