Offering Bare-Metal Performance and Scalability on Cloud ...

Offering Bare-Metal Performance and

Scalability on Cloud: The Azure HPC ApproachMVAPICH User Group Meeting, August 2019

Outline

HPC in the Cloud

Performance Concerns

Software Ecosystem

Data Movement

Security, CostHow Azure HPC

addresses these concerns?

HPC in the

Cloud

Latest Hardware

Always available

Instant Scaling

Enormous Storage Capacity

Specialized Compute Fleet in Azure

A D F G

L H N

Entry Level VMs General Purpose VMs Compute Optimized VMs Large Memory VMs >80,000 IOPs

Premium Storage

Storage Optimized VMs Standard HPC VMs GPU VMs FPGA* Cray Services in

Azure

Outline

Offering Bare-Metal Experience

Offer Bare-metal Experience

Eliminate Jitter Full Network Experience Transparent Exposure of

Hardware

Offer Bare-metal Experience (contd.)

Eliminate Jitter Full Network

Experience

Transparent Exposure

of Hardware

Outline

HB/HC clusters

Azure HB/HC Series VMs

Core Partitioning in HB and HC VMs

Outline

Performance Highlights

MVAPICH2 Latency on HB/HC

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy (

us)

Message Size (bytes)

MPI Latency (HB)

MVAPICH2 2.3.2

MVAPICH2 2.3.2 + new driver

0

0.5

1

1.5

2

2.5

3

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy (

us)


MPI Latency (HC)

MVAPICH2 2.3.2


1.86 us

1.35 us

2.69 us

1.67 us

MVAPICH2 Bandwidth on HB/HC

0

2000

4000

6000

8000

10000

12000

14000

1 2 4 8

16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Ban

dw

idth

(M

B/s

)


MPI Bandwidth (HC)

MVAPICH2 2.3.2


0

2000

4000

6000

8000

10000

12000

14000

1 2 4 8

16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Ban

dw

idth

(M

B/s

)


MPI Bandwidth (HB)

MVAPICH2 2.3.2

MVAPICH2 2.3.2 + new

driver

12.06 GB/s

12.22 GB/s

12.04 GB/s

12.38 GB/s

MVAPICH2 Bidirectional Bandwidth on HB/HC

0

5000

10000

15000

20000

25000

30000

1 2 4 8

16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Ban

dw

idth

(M

B/s

)


MPI Bidirectional Bandwidth (HC)

MVAPICH2 2.3.2


24.12 GB/s

24.72 GB/s

0

5000

10000

15000

20000

25000

30000

1 2 4 8

16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Ban

dw

idth

(M

B/s

)


MPI Bidirectional Bandwidth (HB)

MVAPICH2 2.3.2


24.12 GB/s

24.72 GB/s

Application Scaling: Ansys Fluent

0

20

40

60

80

100

120

140

160

0 32 64 96 128 160 192 224 256 288

Sp

eed

up

# Nodes

Fluent v19.3

Model: F1 Racecar 140M

MPI: HPC-X 2.4

Linear 100% Linear 75%

HB PPN=30 HB PPN=45

• 88% efficiency at 128 VMs, within 1% of XC50

• 83% Efficiency at 288 VMs (12,960 MPI ranks)

• Very useful learnings about scalability of EPYC

Bare Metal Results: https://www.ansys.com/solutions/solutions-by-role/it-professionals/platform-support/benchmarks-

overview/ansys-fluent-benchmarks/ansys-fluent-benchmarks-release-19/external-flow-over-a-formula-1-race-car

Application Scaling: StarCCM

0

5000

10000

15000

20000

0 50 100 150 200 250

SP

EED

UP

# NODES

LEMANS 100M COUPLED SCALING

Linear Scaling

HB PPN=30

HB PPN=45

0

20

40

60

80

100

120

0 50 100 150 200 250

PA

RA

LLEL E

FFIC

IEN

CY

# NODES

LEMANS 100M COUPLED SCALING

Linear Scaling

HB PPN=30

HB PPN=45

Host Cores PPN Speedup Efficiency

1 30 30 30 100.02 60 30 59.9 99.94 120 30 119.2 99.3

16 480 30 446.6 93.032 960 30 895.7 93.364 1920 30 1664.8 86.7

128 3840 30 3138.1 81.7256 7680 30 5473.3 71.3

Host Cores PPN Speedup Efficiency

1 45.0 45.0 45 100.0

2 90.0 45.0 89 99.1

16 720.0 45.0 697 96.7

32 1440.0 45.0 1318 91.5

64 2880 45 2505 87.0

128 5760 45 4451 77.3

256 11520 45 7449 64.7

Application Scaling : CP2K at 22,528 cores

0

200

400

600

800

1000

1200

1400

1600

0

50

100

150

200

250

300

350

400

54 64 72 81 99 100 121 128 144 169 196 225 256 288 361 400 450 512

Case

s p

er

Day o

nly

fu

ll I

nte

gers

(h

igh

er

is b

ett

er)

Tim

e t

o S

olu

tio

n i

n S

eco

nd

s (l

ow

er

is b

ett

er)

# Nodes

LiHFX: PRACE CP2K Test CASE B v2.0

Noctua [s] Seconds Cases/d

• 512 HC VMs at 22,528 cores

• https://www.cp2k.org/performance

https://www.cp2k.org/performance

Big Data on Azure HPC

0 100 200 300 400 500 600

TeraSort - 320 GB

PageRank - 19GB

Execution Time (s)

Spark-RDMA

RDMA (100 Gbps)

0

100

200

300

400

512GB 640GB 768GB 896GB 1TB

Tim

e (

sec)

Size (bytes)

HDFS: TestDFSIO (Write) Execution Time

Ethernet (40Gb) IPoIB (100 Gb) RDMA (100 Gb)

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy (

us)


Memcached GET Latency

100.00

96.78 95.58 94.93

100.00 98.86 98.37 96.94

50.00

60.00

70.00

80.00

90.00

100.00

0

200

400

600

800

1000

1200

1400

1600

2 4 8 16

% E

ffic

ien

cy

Imag

es/

seco

nd

# nodes

Horovod TensorFlow

IPoIB (100 Gb)

RDMA (100 Gb)

IPoIB Efficiency

RDMA Efficiency

Outline

MVAPICH2-Azure Release

MVAPICH2 Azure Release!

http://mvapich.cse.ohio-state.edu/downloads/



Outline

Demo: Deploy MVAPICH2-powered Azure Cluster

Quick-Deploy MVAICH2 to Azure


https://github.com/Azure/azhpc-templates/tree/mvapich2/create-vmss



https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fazure%2Fazhpc-templates%2Fmvapich2%2Fcreate-vmss%2Fazuredeploy.json

Quick-Deploy MVAICH2 to Azure (contd.)

Outline

Conclusion

Upcoming HBv2 Instances

Preview Access: https://azure.microsoft.com/en-us/blog/introducing-the-new-hbv2-azure-virtual-machines-

for-high-performance-computing/

https://azure.microsoft.com/en-us/blog/introducing-the-new-hbv2-azure-virtual-machines-for-high-performance-computing/

Links



http://mvapich.cse.ohio-state.edu/performance/mv2-azure-pt_to_pt/

https://techcommunity.microsoft.com/t5/Azure-Compute/CentOS-HPC-VM-Image-for-SR-IOV-enabled-Azure-HPC-

VMs/ba-p/665557

https://docs.microsoft.com/azure/virtual-machine-scale-sets/overview

https://azure.microsoft.com/blog/introducing-the-new-hbv2-azure-virtual-machines-for-high-performance-

computing/



http://mvapich.cse.ohio-state.edu/performance/mv2-azure-pt_to_pt/

https://techcommunity.microsoft.com/t5/Azure-Compute/CentOS-HPC-VM-Image-for-SR-IOV-enabled-Azure-HPC-VMs/ba-p/665557

https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/overview

https://azure.microsoft.com/en-us/blog/introducing-the-new-hbv2-azure-virtual-machines-for-high-performance-computing/

Offering Bare-Metal Performance and Scalability on Cloud ...

Documents

Transcript of Offering Bare-Metal Performance and Scalability on Cloud ...