Accelerating HPC, Big Data and Deep Learning on...
Transcript of Accelerating HPC, Big Data and Deep Learning on...
Accelerating HPC, Big Data and Deep Learning on OpenPOWER Platforms
Dhabaleswar K. (DK) PandaThe Ohio State University
E-mail: [email protected]://www.cse.ohio-state.edu/~panda
Talk at OpenPOWER Academic Discussion Group Workshop 2019
by
Follow us on
https://twitter.com/mvapich
2Network Based Computing Laboratory OpenPOWER-ADG ‘19
High-End Computing (HEC): PetaFlop to ExaFlop
Expected to have an ExaFlop system in 2020-2021!
100 PFlops in 2017
1 EFlops in 2020-2021?
149PFlops in 2018
3Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Challenges in Designing Convergent HPC, Big Data and Deep Learning Architectures
• MVAPICH Project - MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
4Network Based Computing Laboratory OpenPOWER-ADG ‘19
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
5Network Based Computing Laboratory OpenPOWER-ADG ‘19
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Physical Compute
6Network Based Computing Laboratory OpenPOWER-ADG ‘19
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
7Network Based Computing Laboratory OpenPOWER-ADG ‘19
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
8Network Based Computing Laboratory OpenPOWER-ADG ‘19
Can We Run HPC, Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Spark Job
Hadoop Job Deep LearningJob
9Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Challenges in Designing Convergent HPC, Big Data and Deep Learning Architectures
• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
10Network Based Computing Laboratory OpenPOWER-ADG ‘19
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02)
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,050 organizations in 89 countries
– More than 615,000 (> 0.6 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (June ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 15th, 570,020 cores (Neurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decadePartner in the TACC Frontera System
11Network Based Computing Laboratory OpenPOWER-ADG ‘19
0
100000
200000
300000
400000
500000
600000Se
p-04
Feb-
05Ju
l-05
Dec-
05M
ay-0
6O
ct-0
6M
ar-0
7Au
g-07
Jan-
08Ju
n-08
Nov
-08
Apr-
09Se
p-09
Feb-
10Ju
l-10
Dec-
10M
ay-1
1O
ct-1
1M
ar-1
2Au
g-12
Jan-
13Ju
n-13
Nov
-13
Apr-
14Se
p-14
Feb-
15Ju
l-15
Dec-
15M
ay-1
6O
ct-1
6M
ar-1
7Au
g-17
Jan-
18Ju
n-18
Nov
-18
Apr-
19
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2
-GD
R 2.
0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3.2
MV2
-X2.
3rc
2
MV2
Virt
2.2
MV2
2.3
.2
OSU
INAM
0.9
.3
MV2
-Azu
re 2
.3.2
MV2
-AW
S2.
3
MVAPICH2 Release Timeline and Downloads
12Network Based Computing Laboratory OpenPOWER-ADG ‘19
Architecture of MVAPICH2 Software Family (MPI, PGAS and DL)
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-
Awareness
Remote Memory Access
I/O and
File Systems
Fault
ToleranceVirtualization Active
MessagesJob Startup
Introspection & Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
MemoryCMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
13Network Based Computing Laboratory OpenPOWER-ADG ‘19
MVAPICH2 Software Family
Requirements Library
MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2
Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE
MVAPICH2-X
MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
MPI Energy Monitoring Tool OEMT
InfiniBand Network Analysis and Monitoring OSU INAM
Microbenchmarks for Measuring MPI and PGAS Performance OMB
14Network Based Computing Laboratory OpenPOWER-ADG ‘19
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
15Network Based Computing Laboratory OpenPOWER-ADG ‘19
• MVAPICH2 – Basic MPI support
• since MVAPICH2 2.2rc1 (March 2016)
• MVAPICH2-X– PGAS (OpenSHMEM and UPC) and Hybrid MPI+PGAS support
• since MVAPICH2-X 2.2b (March 2016)
– Advanced Collective Support with CMA• Since MVAPICH2-X 2.3b (Oct 2017)
• MVAPICH2-GDR – NVIDIA GPGPU support with NVLink (CORAL systems like Summit and Sierra)
• Since MVAPICH2-GDR 2.3a (Nov 2017)
OpenPOWER Platform Support in MVAPICH2 Libraries
16Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Message Passing Interface (MPI) Support– Point-to-point Inter-node and Intra-node
– XPMEM-based collectives
• Exploiting Accelerators (NVIDIA GPGPUs)– CUDA-aware MPI
– Point-to-point
– Applications
– Integrated Support with TAU
MPI, PGAS and Deep Learning Support for OpenPOWER
17Network Based Computing Laboratory OpenPOWER-ADG ‘19
0
0.2
0.4
0.6
0.8
4 8 16 32 64 128 256 512 1K 2K
Late
ncy
(us)
MVAPICH2-X 2.3.2SpectrumMPI-10.3.0.01
0.25us
Intra-node Point-to-Point Performance on OpenPOWER
Platform: Two nodes of OpenPOWER (Power9-ppc64le) CPU using Mellanox EDR (MT4121) HCA
Intra-Socket Small Message Latency Intra-Socket Large Message Latency
Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth
0
100
200
300
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
MVAPICH2-X 2.3.2SpectrumMPI-10.3.0.01
0
10000
20000
30000
40000
1 8 64 512 4K 32K 256K 2M
Band
wid
th (M
B/s) MVAPICH2-X 2.3.2
SpectrumMPI-10.3.0.01
0
10000
20000
30000
40000
1 8 64 512 4K 32K 256K 2M
Band
wid
th (M
B/s)
MVAPICH2-X 2.3.2SpectrumMPI-10.3.0.01
18Network Based Computing Laboratory OpenPOWER-ADG ‘19
0
1
2
3
4
1 2 4 8 16 32 64 128 256 512 1K 2K
Late
ncy
(us)
MVAPICH2-2.3.2SpectrumMPI-10.3.0.01
Inter-node Point-to-Point Performance on OpenPower
Platform: Two nodes of OpenPOWER (POWER9-ppc64le) CPU using Mellanox EDR (MT4121) HCA
Small Message Latency Large Message Latency
Bi-directional BandwidthBandwidth
0
50
100
150
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Late
ncy
(us)
MVAPICH2-2.3.2SpectrumMPI-10.3.0.01
0
10000
20000
30000
1 8 64 512 4K 32K 256K 2M
Band
wid
th (M
B/s)
MVAPICH2-2.3.2
SpectrumMPI-10.3.0.1
0
10000
20000
30000
40000
50000
1 8 64 512 4K 32K 256K 2MBi
-Ban
dwid
th (M
B/s) MVAPICH2-2.3.2
SpectrumMPI-10.3.0.01
24,743 MB/s 49,249 MB/s
19Network Based Computing Laboratory OpenPOWER-ADG ‘19
0
1000
2000
3000
256K 512K 1M 2M
Late
ncy
(us)
MVAPICH2-2.3
SpectrumMPI-10.1.0
OpenMPI-3.0.0
MVAPICH2-XPMEM
3X0
100
200
300
16K 32K 64K 128K
Late
ncy
(us)
Message Size
MVAPICH2-2.3SpectrumMPI-10.1.0OpenMPI-3.0.0MVAPICH2-XPMEM
34%
0
1000
2000
3000
4000
256K 512K 1M 2M
Late
ncy
(us)
Message Size
MVAPICH2-2.3SpectrumMPI-10.1.0OpenMPI-3.0.0MVAPICH2-XPMEM
0
100
200
300
16K 32K 64K 128K
Late
ncy
(us)
MVAPICH2-2.3SpectrumMPI-10.1.0OpenMPI-3.0.0MVAPICH2-XPMEM
Optimized MVAPICH2 All-Reduce with XPMEM(N
odes
=1, P
PN=2
0)
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch
• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node
2X
(Nod
es=2
, PPN
=20)
4X48%
3.3X 2X
2X
20Network Based Computing Laboratory OpenPOWER-ADG ‘19
0
100
200
300
400
16K 32K 64K 128K
Late
ncy
(us)
Message Size
MVAPICH2-2.3OpenMPI-3.0.0MVAPICH2-XPMEM
0
1000
2000
3000
4000
256K 512K 1M 2M
Late
ncy
(us)
MVAPICH2-2.3OpenMPI-3.0.0MVAPICH2-XPMEM
0
100
200
300
400
16K 32K 64K 128K
Late
ncy
(us)
MVAPICH2-2.3OpenMPI-3.0.0MVAPICH2-XPMEM
Optimized MVAPICH2 All-Reduce with XPMEM (N
odes
=3, P
PN=2
0)
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch
• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over OpenMPI for inter-node. (Spectrum MPI didn’t run for >2 processes)
22%
0
1000
2000
3000
4000
256K 512K 1M 2M
Late
ncy
(us)
Message Size
MVAPICH2-2.3OpenMPI-3.0.0MVAPICH2-XPMEM
(Nod
es=4
, PPN
=20)
42% 27%
2X35%
11%
20%
18.8%
21Network Based Computing Laboratory OpenPOWER-ADG ‘19
0
20
40
60
10 20 40 60
Exec
utio
n Ti
me
(s)
MPI Processes (proc-per-sock=10, proc-per-node=20)
MVAPICH-2.3MVAPICH2-XPMEM
MiniAMR Performance using Optimized XPMEM-based Collectives
• MiniAMR application execution time comparing MVAPICH2-2.3rc1 and optimized All-Reduce design– Up to 45% improvement over MVAPICH2-2.3rc1 in mesh-refinement time of MiniAMR application for weak-
scaling workload on up to four POWER8 nodes.
45%41%
36%42%
Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=scatter
22Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Message Passing Interface (MPI) Support– Point-to-point Inter-node and Intra-node
– XPMEM-based collectives
• Exploiting Accelerators (NVIDIA GPGPUs)– CUDA-aware MPI
– Point-to-point
– Applications
– Integrated Support with TAU
MPI, PGAS and Deep Learning Support for OpenPOWER
23Network Based Computing Laboratory OpenPOWER-ADG ‘19
At Sender:
At Receiver:MPI_Recv(r_devbuf, size, …);
insideMVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
24Network Based Computing Laboratory OpenPOWER-ADG ‘19
0
2000
4000
6000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
01000200030004000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
300 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design (D-D) Performance
1.85us11X
25Network Based Computing Laboratory OpenPOWER-ADG ‘19
D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)
0
1
2
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
Intra-Node Latency (Small Messages)
0
50
100
16K 32K 64K 128K 256K 512K 1M 2M 4MLate
ncy
(us)
Message Size (Bytes)
Intra-Node Latency (Large Messages)
020000400006000080000
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (M
B/s)
Message Size (Bytes)
Intra-Node Bandwidth
0
5
10
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
Inter-Node Latency (Small Messages)
0100200300
16K 32K 64K 128K 256K 512K 1M 2M 4MLate
ncy
(us)
Message Size (Bytes)
Inter-Node Latency (Large Messages)
05000
10000150002000025000
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (M
B/s)
Message Size (Bytes)
Inter-Node Bandwidth
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
Intra-node Bandwidth: 65.48 GB/sec for 4MB (via NVLINK2)Intra-node Latency: 0.76 us (with GDRCopy)
Inter-node Bandwidth: 23 GB/sec for 4MB (via 2 Port EDR)Inter-node Latency: 2.18 us (with GDRCopy 2.0)
26Network Based Computing Laboratory OpenPOWER-ADG ‘19
D-to-H & H-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
Intra-node D-H Bandwidth: 16.70 GB/sec for 2MB (via NVLINK2)
Intra-node D-H Latency: 0.49 us (with GDRCopy)
Intra-node H-D Latency: 0.49 us (with GDRCopy 2.0)Intra-node H-D Bandwidth: 26.09 GB/sec for 2MB (via NVLINK2)
0204060
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLa
tenc
y (u
s)
Message Size (Bytes)
D-H INTRA-NODE LATENCY (SMALL)
Spectrum MPI MV2-GDR
0
100
200
300
400
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
D-H INTRA-NODE LATENCY (LARGE)
Spectrum MPI MV2-GDR
05
101520
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (G
B/se
c)
Message Size (Bytes)
D-H INTRA-NODE BW
Spectrum MPI MV2-GDR
0204060
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
H-D INTRA-NODE LATENCY (SMALL)
Spectrum MPI MV2-GDR
0100200300400
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
H-D INTRA-NODE LATENCY (LARGE)
Spectrum MPI MV2-GDR
0
10
20
30
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (G
B/se
c)
Message Size (Bytes)
H-D INTRA-NODE BW
Spectrum MPI MV2-GDR
27Network Based Computing Laboratory OpenPOWER-ADG ‘19
Managed Memory Performance (OpenPOWER Intra-node)
Latency MD MD Bandwidth MD MD
Bi-Bandwidth MD MD
28Network Based Computing Laboratory OpenPOWER-ADG ‘19
0
2
4
6
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP
0
5
10
15
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-GDR-Next
MVAPICH2-GDR-Next w/ SHARP
MVAPICH2 with SHARP Support (Preliminary Results)GPU AllReduce w/ SHARP Support – 2 nodes
0
5
10
15
20
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-GDR-NextMVAPICH2-GDR-Next w/ SHARP
3.41 us
GPU AllReduce w/ SHARP Support – 4 nodes
2.34 us
HOST AllReduce w/ SHARP Support – 2 nodes
3.14 us
0
2
4
6
8
10
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP
HOST AllReduce w/ SHARP Support – 8 nodes
3.97 us
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
29Network Based Computing Laboratory OpenPOWER-ADG ‘19
Application: HYPRE - BoomerAMG
0.971.196
1.4361.756
0.94 1.01 1.082
1.628
0
0.5
1
1.5
2
16 GPUs 32 GPUs 64 GPUs 128 GPUs
Seco
nds
(Sum
of S
etup
Pha
se &
Sol
ve
Phas
e Ti
mes
)
# of GPUs
HYPRE - BoomerAMGSpectrum-MPI 10.3.0.1 MVAPICH2-GDR 2.3.2
RUN MVAPICH2-GDR 2.3.2: export MV2_USE_CUDA=1 MV2_USE_GDRCOPY=0 MV2_USE_RDMA_CM=0export MV2_USE_GPUDIRECT_LOOPBACK=0 MV2_HYBRID_BINDING_POLICY=spread MV2_IBA_HCA=mlx5_0:mlx5_3OMP_NUM_THREADS=20 lrun -n 128 -N 32 mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18
RUN Spectrum-MPI 10.3.0.1: OMP_NUM_THREADS=20 lrun -n 128 -N 32 --smpiargs "-gpu --disable_gdr" mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18
25%16%
30Network Based Computing Laboratory OpenPOWER-ADG ‘19
16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)
pre-comm post-recv post-
send wait-recv wait-send
post-comm start-up test-
commbench-comm
Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229
MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396
MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602
Application: COMB
Run Scripts pushed to COMB Github repo: https://github.com/LLNL/Comb/pull/2
18x
27x
- Improvements due to enhanced support for GPU-kernel based packing/unpacking routines
31Network Based Computing Laboratory OpenPOWER-ADG ‘19
19.3 21.8
35.1
52
18.8 22
34.8
52.1
0
10
20
30
40
50
60
1 GPU 4 GPUs 8 GPUs 16 GPUs
Cum
ulat
ive
Wor
k Ti
me
(sec
)# of GPUs
UMTMVAPICH2-GDR-2.3.2 Spectrum-MPI-Rolling-Release
Application: UMT - GPU
jsrun -r 4 -p 16 mpibind tau_exec -ebs ./SuOlsonTest ../sierra-runs/2x2x4_20.cmg 16 2 16 8 4RUN Spectrum-MPI/MV2-GDR
export MV2_USE_GDRCOPY=0
export MV2_USE_CUDA=1
export MV2_SUPPORT_TENSOR_FLOW=1
export MV2_ENABLE_AFFINITY=0
export OMPI_LD_PRELOAD_PREPEND=$HOME/software/mvapich232-jsrun-pgi/install/lib/libmpi.so
PREPARE MV2-GDR
• Use MV2-GDR pgi/18.7 w/ jsrun rpm
• Use TAU for profiling application
32Network Based Computing Laboratory OpenPOWER-ADG ‘19
Application: SW4
0
50
100
12 24 48
Solv
er P
hase
Tim
e (S
ec)
# GPUs
Performance Comparison on Host Buffers
Spectrum-MPI MVAPICH2-GDR
0
50
100
12 24 48
Solv
er P
hase
Tim
e (S
ec)
# GPUs
Performance Comparison on Device Buffers
Spectrum-MPI MVAPICH2-GDR
0
50
100
12 24 48
Solv
er P
hase
Tim
e (S
ec)
# GPUs
Performance Comparison on Managed Buffers
Spectrum-MPI MVAPICH2-GDR
RUN Spectrum-MPI
GPU_SUPPORT=‘-M “-gpu”’ (for device & managed buffers)
jsrun -n $NPROCS -g1 -c7 -a1 $GPU_SUPPORT ./sw4 $Input_file
NPROCS=<#gpus per node>*<#allocated nodes>Input_file= hayward-att-h200-ref.inRUN MV2-GDR
$MPI_HOME/bin/mpirun_rsh -export-all -np $NPROCS --hostfile <hostfile> $mv2-gdr-flags LD_PRELOAD=$MPI_HOME/lib/libmpi.so ./sw4 $Input_file
MV2-GDR-flagsMV2_USE_CUDA=1
MV2_USE_GPUDIRECT_RDMA=1 MV2_USE_GPUDIRECT_GDRCOPY=0 MV2_USE_RDMA_CM=0 MV2_DEBUG_SHOW_BACKTRACE=1 MV2_SHOW_CPU_BINDING=1 MV2_SHOW_ENV_INFO=2 MV2_USE_GPUDIRECT_LOOPBACK=0 MV2_CPU_BINDING_POLICY=HYBRID MV2_HYBRID_BINDING_POLICY=SPREAD MV2_IBA_HCA=mlx5_0:mlx5_3
33Network Based Computing Laboratory OpenPOWER-ADG ‘19
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/
35Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Challenges in Designing Convergent HPC, Big Data and Deep Learning Architectures
• MVAPICH Project - MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
36Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Deep Learning frameworks are a different game altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance
– NCCL2, CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!
• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit communication-runtime co-designs?
Deep Learning: New Challenges for MPI Runtimes
Scal
e-up
Per
form
ance
Scale-out PerformanceA. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
cuDNN
gRPC
Hadoop
MPI
MKL-DNN
DesiredNCCL2
37Network Based Computing Laboratory OpenPOWER-ADG ‘19
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
38Network Based Computing Laboratory OpenPOWER-ADG ‘19
• CPU-based Deep Learning– Using MVAPICH2-X
• GPU-based Deep Learning– Using MVAPICH2-GDR
High-Performance Deep Learning
39Network Based Computing Laboratory OpenPOWER-ADG ‘19
ResNet-50 using various DL benchmarks on Frontera
• Observed 260K images per sec for ResNet-50 on 2,048 Nodes
• Scaled MVAPICH2-X on 2,048 nodes on Frontera for Distributed Training using TensorFlow
• ResNet-50 can be trained in 7 minutes on 2048 nodes (114,688 cores)
*Jain et al., “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).
40Network Based Computing Laboratory OpenPOWER-ADG ‘19
• CPU-based Deep Learning– Using MVAPICH2-X
• GPU-based Deep Learning– Using MVAPICH2-GDR
High-Performance Deep Learning
41Network Based Computing Laboratory OpenPOWER-ADG ‘19
• MVAPICH2-GDR offers excellent performance via advanced designs for MPI_Allreduce.
• Up to 11% better performance on the RI2 cluster (16 GPUs)
• Near-ideal – 98% scaling efficiency
Exploiting CUDA-Aware MPI for TensorFlow (Horovod)
0100200300400500600700800900
1000
1 2 4 8 16
Imag
es/s
econ
d (H
ighe
r is b
ette
r)No. of GPUs
Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal
MVAPICH2-GDR 2.3 (MPI-Opt) is up to 11% faster than MVAPICH2
2.3 (Basic CUDA support)
A. A. Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19, https://arxiv.org/abs/1810.11112
42Network Based Computing Laboratory OpenPOWER-ADG ‘19
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
1
10
100
1000
10000
100000
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~3X better
43Network Based Computing Laboratory OpenPOWER-ADG ‘19
MVAPICH2-GDR vs. NCCL2: Allreduce Optimization (DGX-2)• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on a DGX-2 machine
020406080
Band
wid
th (G
B/s)
Message Size (Bytes)
Bandwidth
MVAPICH2-GDR 2.3.2 NCCL 2.4
2.4X better
110
1001000
10000
4 16 64 256 1K 4K 16K
64K
256K 1M 4M 16M
64M
Late
ncy
(us)
Message Size (Bytes)
Latency
MVAPICH2-GDR 2.3.2 NCCL 2.4
5.8X better
Platform: Nvidia DGX-2 system @ PSC (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
44Network Based Computing Laboratory OpenPOWER-ADG ‘19
MVAPICH2-GDR: MPI_Allreduce (Device Buffers) on Summit• Optimized designs in MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs
0
1
2
3
4
5
6
32M 64M 128M 256M
Band
wid
th (G
B/s)
Message Size (Bytes)
Bandwidth on 1,536 GPUs
MVAPICH2-GDR-2.3.2 NCCL 2.4
1.7X better
050
100150200250300350400450
4 16 64 256 1K 4K 16K
Late
ncy
(us)
Message Size (Bytes)
Latency on 1,536 GPUs
MVAPICH2-GDR-2.3.2 NCCL 2.4
1.6X better
Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect
0
2
4
6
8
10
24 48 96 192 384 768 1536
Band
wid
th (G
B/s)
Number of GPUs
128MB Message
SpectrumMPI 10.2.0.11 OpenMPI 4.0.1
NCCL 2.4 MVAPICH2-GDR-2.3.2
1.7X better
45Network Based Computing Laboratory OpenPOWER-ADG ‘19
Distributed Training with TensorFlow and MVAPICH2-GDR on Summit• ResNet-50 Training using
TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!
• 1,281,167 (1.2 mil.) images
• Time/epoch = 3.6 seconds
• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!
050
100150200250300350400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e pe
r sec
ond
(Tho
usan
ds)
Number of GPUs
NCCL-2.4 MVAPICH2-GDR-2.3.2
Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2
*We observed errors for NCCL2 beyond 96 GPUs
MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
46Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Near-linear scaling may be achieved by tuning Horovod/MPI• Optimizing MPI/Horovod towards large message sizes for high-resolution images
• Develop a generic Image Segmentation benchmark• Tuned DeepLabV3+ model using the benchmark and Horovod, up to 1.3X better than default
New Benchmark for Image Segmentation on Summit
*Anthony et al., “Scaling Semantic Image Segmentation using Tensorflow and MVAPICH2-GDR on HPC Systems” (Submission under review)
47Network Based Computing Laboratory OpenPOWER-ADG ‘19
Using HiDL Packages for Deep Learning on Existing HPC Infrastructure
Hadoop Job
48Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Challenges in Designing Convergent HPC, Big Data and Deep Learning Architectures
• MVAPICH Project - MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
49Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)• HDFS, MapReduce, Spark
Data Management and Processing on Modern Datacenters
50Network Based Computing Laboratory OpenPOWER-ADG ‘19
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Convergent Software Stacks for HPC, Big Data and Deep Learning
51Network Based Computing Laboratory OpenPOWER-ADG ‘19
• RDMA for Apache Spark
• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Kafka
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 315 organizations from 35 countries
• More than 31,600 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCEAlso run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker
52Network Based Computing Laboratory OpenPOWER-ADG ‘19
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
53Network Based Computing Laboratory OpenPOWER-ADG ‘19
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components
– Enhanced HDFS with in-memory and heterogeneous storage
– High performance design of MapReduce over Lustre
– Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode)
– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP
– Support for OpenPOWER, Singularity, and Docker
• Current release: 1.3.5
– Based on Apache Hadoop 2.8.0
– Compliant with Apache Hadoop 2.8.0, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• Different file systems with disks and SSDs and Lustre
RDMA for Apache Hadoop 2.x Distribution
http://hibd.cse.ohio-state.edu
54Network Based Computing Laboratory OpenPOWER-ADG ‘19
• For TestDFSIO throughput experiment, RDMA-IB design of HHH mode has an improvement of 1.57x - 2.06x compared to IPoIB (100Gbps).
• In HHH-M mode, the improvement goes up to 2.18x - 2.26x compared to IPoIB (100Gbps).
Performance of RDMA-Hadoop on OpenPOWERTestDFSIO Throughput
HHH Mode HHH-M Mode
050
100150200250300350400
10 20 30
Tota
l Thr
ough
put (
MBp
s)
Data Size (GB)
IPoIB (100Gbps)
RDMA-IB (100Gbps)
050
100150200250300350400
10 20 30
Tota
l Thr
ough
put (
MBp
s)
Data Size (GB)
IPoIB (100Gbps)
RDMA-IB (100Gbps)2.06x 2.26x
55Network Based Computing Laboratory OpenPOWER-ADG ‘19
• The RDMA-IB design of HHH mode reduces the job execution time of Sort by a maximum of 41% compared to IPoIB (100Gbps).
• The HHH-M design reduces the execution time by a maximum of 55%.
Performance of RDMA-Hadoop on OpenPOWERSort Execution Time
HHH Mode HHH-M Mode
050
100150200250300350400
10 20 30
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (100Gbps)
RDMA-IB (100Gbps)
55%
050
100150200250300350400
10 20 30
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (100Gbps)
RDMA-IB (100Gbps)
41%
56Network Based Computing Laboratory OpenPOWER-ADG ‘19
• The RDMA-IB design of HHH mode reduces the job execution time of TeraSort by a maximum of 12% compared to IPoIB (100Gbps).
• In HHH-M mode, the execution time of TeraSort is reduced by a maximum of 21% compared to IPoIB (100Gbps).
Performance of RDMA-Hadoop on OpenPOWERTeraSort Execution Time
HHH Mode HHH-M Mode
050
100150200250300350400
10 20 30
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (100Gbps)
RDMA-IB (100Gbps)
050
100150200250300350400
10 20 30
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (100Gbps)
RDMA-IB (100Gbps)12% 21%
57Network Based Computing Laboratory OpenPOWER-ADG ‘19
Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure
58Network Based Computing Laboratory OpenPOWER-ADG ‘19
• High-Performance Design of Spark over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark
– RDMA-based data shuffle and SEDA-based shuffle architecture
– Non-blocking and chunk-based data transfer
– Off-JVM-heap buffer management
– Support for OpenPOWER
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.5
– Based on Apache Spark 2.1.0
– Tested with• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• RAM disks, SSDs, and HDD
– http://hibd.cse.ohio-state.edu
RDMA for Apache Spark Distribution
59Network Based Computing Laboratory OpenPOWER-ADG ‘19
• GroupBy: RDMA design outperforms IPoIB by a maximum of 11%
• SortBy: RDMA design outperforms IPoIB by a maximum of 18%
Performance of RDMA-Spark on OpenPOWER
GroupBy SortBy
0
20
40
60
80
100
32 48 64
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB RDMA
0
20
40
60
80
100
32 48 64
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB RDMA 11% 18%
60Network Based Computing Laboratory OpenPOWER-ADG ‘19
• TeraSort: RDMA design outperforms IPoIB by a maximum of 35%
• Sort: RDMA design outperforms IPoIB by a maximum of 25%
Performance of RDMA-Spark on OpenPOWER
TeraSort Sort
0200400600800
100012001400
60 80 100
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB RDMA
0
50
100
150
200
250
300
60 80 100
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB RDMA35%25%
61Network Based Computing Laboratory OpenPOWER-ADG ‘19
Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure
62Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Challenges in Designing Convergent HPC, Big Data and Deep Learning Architectures
• MVAPICH Project - MPI and PGAS (MVAPICH) Library with CUDA-Awareness
• HiDL Project – High-Performance Deep Learning
• HiBD Project – High-Performance Big Data Analytics Library
• Commercial Support from X-ScaleSolutions
• Conclusions and Q&A
Presentation Overview
63Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning
– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques
– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates
– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years
Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
64Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Has joined the OpenPOWER Consortium as a silver ISV member• Provides flexibility:
– To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software stack
– A part of the OpenPOWER ecosystem
– Can participate with different vendors for bidding, installation and deployment process
• Introduced two new integrated products with support for OpenPOWER systems (Presented at the OpenPOWER North America Summit)
– X-ScaleHPC
– X-ScaleAI
– Send an e-mail to [email protected] for free trial!!
Silver ISV Member for the OpenPOWER Consortium + Products
65Network Based Computing Laboratory OpenPOWER-ADG ‘19
X-ScaleHPC Package
• Scalable solutions of communication middleware based on OSU MVAPICH2 libraries
• “out-of-the-box” fine-tuned and optimal performance on various HPC systems including OpenPOWER platforms
• Contact us for more details and a free trial!!– [email protected]
• Stop by X-ScaleSolutions booth (#2094) for a Demo!!
66Network Based Computing Laboratory OpenPOWER-ADG ‘19
X-ScaleAI Package• High-Performance and scalable solutions for deep learning
– Fully exploiting HPC resources using our X-ScaleHPC package
• “out-of-the-box” optimal performance on OpenPOWER (POWER9) + GPU platforms such as #1 Summit system
• What’s in the X-ScaleAI package?– Fine-tuned CUDA-Aware MPI library– Google TensorFlow framework built with OpenPOWER system– Distributed training using Horovod on top of TensorFlow– Simple installation and execution in one command!
• Contact us for more details and a free trial!!– [email protected]
• Stop by X-ScaleSolutions booth (#2094) for a Demo!!
67Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Upcoming Exascale systems need to be designed with a holistic view of HPC, Big Data, Deep Learning, and Cloud
• OpenPOWER, InfiniBand, and NVIDIA GPGPUs are emerging technologies for such systems
• Presented a set of solutions from OSU to enable HPC, Big Data and Deep Learning through a convergent software architecture for OpenPOWER platforms
• X-ScaleSolutions is an ISV provider in the OpenPOWER consortium to provide commercial support, optimizations, tuning and training for the OSU solutions
• OpenPOWER users are encouraged to take advantage of these solutions to extract highest performance and scalability for their applications on OpenPOWER platforms
Concluding Remarks
68Network Based Computing Laboratory OpenPOWER-ADG ‘19
• Presentations at OSU and X-Scale Booth (#2094)– Members of the MVAPICH, HiBD and HiDL members
– External speakers
• Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase)
• Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events
• Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/
Multiple Events at SC ‘19
69Network Based Computing Laboratory OpenPOWER-ADG ‘19
Funding AcknowledgmentsFunding Support by
Equipment Support by
70Network Based Computing Laboratory OpenPOWER-ADG ‘19
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– C.-H. Chu (Ph.D.)
– J. Hashmi (Ph.D.)– A. Jain (Ph.D.)
– K. S. Kandadi (M.S.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur
– X. Lu
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– Kamal Raj (M.S.)
– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)
– A. Quentin (Ph.D.)
– B. Ramesh (M. S.)
– S. Xu (M.S.)
– J. Lin
– M. Luo
– E. Mancini
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– M. S. Ghazimeersaeed
– A. Ruhela
– K. ManianCurrent Students (Undergraduate)
– V. Gangal (B.S.)
– N. Sarkauskas (B.S.)
Past Research Specialist– M. Arnold
Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)
71Network Based Computing Laboratory OpenPOWER-ADG ‘19
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/
Follow us on
https://twitter.com/mvapich