Offering Bare-Metal Performance and Scalability on Cloud ...
Transcript of Offering Bare-Metal Performance and Scalability on Cloud ...
Offering Bare-Metal Performance and
Scalability on Cloud: The Azure HPC ApproachMVAPICH User Group Meeting, August 2019
Performance Concerns
Software Ecosystem
Data Movement
Security, CostHow Azure HPC
addresses these concerns?
HPC in the
Cloud
Latest Hardware
Always available
Instant Scaling
Enormous Storage Capacity
Specialized Compute Fleet in Azure
A D F G
L H N
Entry Level VMs General Purpose VMs Compute Optimized VMs Large Memory VMs >80,000 IOPs
Premium Storage
Storage Optimized VMs Standard HPC VMs GPU VMs FPGA* Cray Services in
Azure
Offer Bare-metal Experience
Eliminate Jitter Full Network Experience Transparent Exposure of
Hardware
Offer Bare-metal Experience (contd.)
Eliminate Jitter Full Network
Experience
Transparent Exposure
of Hardware
Offer Bare-metal Experience (contd.)
Eliminate Jitter Full Network
Experience
Transparent Exposure
of Hardware
Offer Bare-metal Experience (contd.)
Eliminate Jitter Full Network
Experience
Transparent Exposure
of Hardware
MVAPICH2 Latency on HB/HC
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy (
us)
Message Size (bytes)
MPI Latency (HB)
MVAPICH2 2.3.2
MVAPICH2 2.3.2 + new driver
0
0.5
1
1.5
2
2.5
3
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy (
us)
Message Size (bytes)
MPI Latency (HC)
MVAPICH2 2.3.2
MVAPICH2 2.3.2 + new driver
1.86 us
1.35 us
2.69 us
1.67 us
MVAPICH2 Bandwidth on HB/HC
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Ban
dw
idth
(M
B/s
)
Message Size (bytes)
MPI Bandwidth (HC)
MVAPICH2 2.3.2
MVAPICH2 2.3.2 + new driver
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Ban
dw
idth
(M
B/s
)
Message Size (bytes)
MPI Bandwidth (HB)
MVAPICH2 2.3.2
MVAPICH2 2.3.2 + new
driver
12.06 GB/s
12.22 GB/s
12.04 GB/s
12.38 GB/s
MVAPICH2 Bidirectional Bandwidth on HB/HC
0
5000
10000
15000
20000
25000
30000
1 2 4 8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Ban
dw
idth
(M
B/s
)
Message Size (bytes)
MPI Bidirectional Bandwidth (HC)
MVAPICH2 2.3.2
MVAPICH2 2.3.2 + new driver
24.12 GB/s
24.72 GB/s
0
5000
10000
15000
20000
25000
30000
1 2 4 8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
256K
512K
1M
2M
4M
Ban
dw
idth
(M
B/s
)
Message Size (bytes)
MPI Bidirectional Bandwidth (HB)
MVAPICH2 2.3.2
MVAPICH2 2.3.2 + new driver
24.12 GB/s
24.72 GB/s
Application Scaling: Ansys Fluent
0
20
40
60
80
100
120
140
160
0 32 64 96 128 160 192 224 256 288
Sp
eed
up
# Nodes
Fluent v19.3
Model: F1 Racecar 140M
MPI: HPC-X 2.4
Linear 100% Linear 75%
HB PPN=30 HB PPN=45
• 88% efficiency at 128 VMs, within 1% of XC50
• 83% Efficiency at 288 VMs (12,960 MPI ranks)
• Very useful learnings about scalability of EPYC
Bare Metal Results: https://www.ansys.com/solutions/solutions-by-role/it-professionals/platform-support/benchmarks-
overview/ansys-fluent-benchmarks/ansys-fluent-benchmarks-release-19/external-flow-over-a-formula-1-race-car
Application Scaling: StarCCM
0
5000
10000
15000
20000
0 50 100 150 200 250
SP
EED
UP
# NODES
LEMANS 100M COUPLED SCALING
Linear Scaling
HB PPN=30
HB PPN=45
0
20
40
60
80
100
120
0 50 100 150 200 250
PA
RA
LLEL E
FFIC
IEN
CY
# NODES
LEMANS 100M COUPLED SCALING
Linear Scaling
HB PPN=30
HB PPN=45
Host Cores PPN Speedup Efficiency
1 30 30 30 100.02 60 30 59.9 99.94 120 30 119.2 99.3
16 480 30 446.6 93.032 960 30 895.7 93.364 1920 30 1664.8 86.7
128 3840 30 3138.1 81.7256 7680 30 5473.3 71.3
Host Cores PPN Speedup Efficiency
1 45.0 45.0 45 100.0
2 90.0 45.0 89 99.1
16 720.0 45.0 697 96.7
32 1440.0 45.0 1318 91.5
64 2880 45 2505 87.0
128 5760 45 4451 77.3
256 11520 45 7449 64.7
Application Scaling : CP2K at 22,528 cores
0
200
400
600
800
1000
1200
1400
1600
0
50
100
150
200
250
300
350
400
54 64 72 81 99 100 121 128 144 169 196 225 256 288 361 400 450 512
Case
s p
er
Day o
nly
fu
ll I
nte
gers
(h
igh
er
is b
ett
er)
Tim
e t
o S
olu
tio
n i
n S
eco
nd
s (l
ow
er
is b
ett
er)
# Nodes
LiHFX: PRACE CP2K Test CASE B v2.0
Noctua [s] Seconds Cases/d
• 512 HC VMs at 22,528 cores
• https://www.cp2k.org/performance
Big Data on Azure HPC
0 100 200 300 400 500 600
TeraSort - 320 GB
PageRank - 19GB
Execution Time (s)
Spark-RDMA
RDMA (100 Gbps)
0
100
200
300
400
512GB 640GB 768GB 896GB 1TB
Tim
e (
sec)
Size (bytes)
HDFS: TestDFSIO (Write) Execution Time
Ethernet (40Gb) IPoIB (100 Gb) RDMA (100 Gb)
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy (
us)
Message Size (bytes)
Memcached GET Latency
100.00
96.78 95.58 94.93
100.00 98.86 98.37 96.94
50.00
60.00
70.00
80.00
90.00
100.00
0
200
400
600
800
1000
1200
1400
1600
2 4 8 16
% E
ffic
ien
cy
Imag
es/
seco
nd
# nodes
Horovod TensorFlow
IPoIB (100 Gb)
RDMA (100 Gb)
IPoIB Efficiency
RDMA Efficiency
MVAPICH2 Azure Release!
http://mvapich.cse.ohio-state.edu/downloads/
Quick-Deploy MVAICH2 to Azure
http://mvapich.cse.ohio-state.edu/downloads/
https://github.com/Azure/azhpc-templates/tree/mvapich2/create-vmss
Upcoming HBv2 Instances
Preview Access: https://azure.microsoft.com/en-us/blog/introducing-the-new-hbv2-azure-virtual-machines-
for-high-performance-computing/
Links
http://mvapich.cse.ohio-state.edu/downloads/
https://github.com/Azure/azhpc-templates/tree/mvapich2/create-vmss
http://mvapich.cse.ohio-state.edu/performance/mv2-azure-pt_to_pt/
https://techcommunity.microsoft.com/t5/Azure-Compute/CentOS-HPC-VM-Image-for-SR-IOV-enabled-Azure-HPC-
VMs/ba-p/665557
https://docs.microsoft.com/azure/virtual-machine-scale-sets/overview
https://azure.microsoft.com/blog/introducing-the-new-hbv2-azure-virtual-machines-for-high-performance-
computing/