Post on 22-May-2020
Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters
Ching-Hsiang Chu
Advisor: Dhabaleswar K. Panda
Network Based Computing LabDepartment of Computer Science and Engineering
The Ohio State University, Columbus, OH
SC 19 Doctoral Showcase 2Network Based Computing Laboratory
• Introduction
• Problem Statement
• Detailed Description and Results
• Broader Impact on the HPC Community
• Expected Contributions
Outline
SC 19 Doctoral Showcase 3Network Based Computing Laboratory
Trends in Modern HPC Architecture: Heterogeneous
• Multi-core/many-core technologies
• High Performance Interconnects
• High Performance Storage and Compute devices
• Variety of programming models (MPI, PGAS, MPI+X)
Accelerators / Coprocessors high compute density,high performance/watt
High Performance Interconnects InfiniBand, Omni-Path, EFA
<1usec latency, 100Gbps+ Bandwidth
Multi/ Many-core Processors
SSD, NVMe-SSD, NVRAM
Node local storage
#1 Summit(27,648 GPUs)
#2 Sierra (17,280 GPUs)#10 Lassen (2,664 GPUs)
#8 ABCI(4,352 GPUs)
#22 DGX SuperPOD(1,536 GPUs)
SC 19 Doctoral Showcase 4Network Based Computing Laboratory
• Scale-up (up to 150 GB/s)
– PCIe, NVLink/NVSwitch
– Infinity Fabric, Gen-Z, CXL
• Scale-out (up to 25 GB/s)
– InfiniBand, Omni-path, Ethernet
– Cray Slingshot
Trends in Modern Large-scale Dense-GPU Systems
SC 19 Doctoral Showcase 5Network Based Computing Laboratory
• Various scientific applications are ported to GPU– Reportedly significant speedup compared to CPU version
– High-resolution/precision results
GPU-enabled HPC Applications
Derek Leinweber, CSSM, University of Adelaide
Lattice Quantum Chromodynamics Weather Simulation Wave propagation simulation
23x faster than CPU 2.8x faster than CPU 25x faster than CPU
Fuhrer O, Osuna C, Lapillonne X, Gysi T, Bianco M, Schulthess T. Towards GPU-accelerated operational weather forecasting. GTC 2013.
https://geodynamics.org/cig/software/specfem3d_globe/
SC 19 Doctoral Showcase 6Network Based Computing Laboratory
• Easy-to-use and high-performance frameworks
• Wide range of applications– Image Classification– Speech Recognition– Self-driving car– Healthcare– Climate Analytic
GPU-enabled Emerging Deep Learning Applications
Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize)
999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPU)
SC 19 Doctoral Showcase 7Network Based Computing Laboratory
insideMVAPICH2
• Supports and optimizes various communication patterns
• Overlaps data movement from GPU with RDMA transfers
GPU-Aware (CUDA-Aware) Communication Middleware
MPI-based Generic Communication Middleware DL-Specific Communication Middleware
• Ring-based collective operations
• Optimized for DL workloads on GPU systems
SC 19 Doctoral Showcase 8Network Based Computing Laboratory
Can we design a generic GPU-enabled communication middleware to fully exploit GPU resources and interconnects for traditional HPC and emerging ML/DL applications?
Broad Challenge
Overlap Productivity
Perf
orm
ance
Reso
urce
U
tiliz
atio
n
Ideal
User Naive
User Advanced
Farther from the center is Better
SC 19 Doctoral Showcase 9Network Based Computing Laboratory
• Introduction
• Problem Statement
• Detailed Description and Results
• Broader Impact on the HPC Community
• Expected Contributions
Outline
SC 19 Doctoral Showcase 10Network Based Computing Laboratory
• What kind of hardware capabilities can be leveraged to fully exploit the modern interconnects deployed in GPU clusters?
• What is the potential scope of communication patterns can be benefited from the proposed GPU-enabled communication middleware?
• How to leverage GPU resources such as high-bandwidth memory (HBM) and massive streaming multiprocessors (SM) to accelerate communication?
• What are design considerations for a GPU-enabled communication middleware to efficiently utilize the hardware features?
• What kind of performance benefits can be expected with the proposed GPU-enabled communication middleware?
• How can the traditional HPC and DL applications can take advantage of the proposed communication middleware without application-level modifications?
Problem Statements
SC 19 Doctoral Showcase 11Network Based Computing Laboratory
Modern HPCHardware
Programing Models
InterconnectsInfiniBand NVLink/NVSwitchPCIe
Multi-Core ProcessorsOpenPOWERXeon
GPU-EnabledCommunication Middleware
Scalable Streaming Broadcast
GPU-enabled Cooperative and Link-efficient reduction operations
Efficient Non-contiguous Data Process: Scheduling and Transfer
Message Passing Interface (MPI)
AcceleratorGPU
HardwareCapability Hardware Multicast Inter-GPU Direct Load/StoreGPUDirect RDMA
Research Framework
ComputeModels
CUDA
OpenACC
ApplicationsTraditional HPC Applications
MILC COSMO
Machine/Deep Learning
TensorFlow CNTK
Benchmarks
StreamingOMB AWP-ODC HOOMD-Blue
GDRCOPY
SC 19 Doctoral Showcase 12Network Based Computing Laboratory
• Introduction
• Problem Statement
• Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks– Efficient Scheduling of Non-contiguous Data Transfer– GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations
• Broader Impact on the HPC Community
• Expected Contributions
Outline
SC 19 Doctoral Showcase 13Network Based Computing Laboratory
• Streaming & Deep Learning applications– Large-scale broadcast operations
– High computation-communication overlap
– No application-level modification
Motivated Example #1 – Need of scalable broadcast
Data Source
Data Distributor / Parameter Server
Online/Offline streaming
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
Data streaming-like broadcast operations
Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019
GPU Node 1
GPU Node 2 GPU Node 4
GPU Node 3
SC 19 Doctoral Showcase 14Network Based Computing Laboratory
• Streaming data through host– Fine-tuned chunked data
Ø Three-stage pipeline
– Leverages IB Scatter-Gather and GDR features
Ø Free-up PCIe resources for applicationsIB
Switch
IB HCA
CPU
GPU
Destination 1Header
d_in
IB HCA
CPU
GPU
Destination NHeader
d_in
IB Scatter (GDR Write)
Proposed Efficient and Scalable Solution
MPI_Bcast(d_in,…)
Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019
IB HCA
CPU
GPU
SourceHeader
d_out
1. Data Preparation2. IB Gather 3. IB Hardware Multicast
im_buf
MPI_Bcast(d_out,…)
SC 19 Doctoral Showcase 15Network Based Computing Laboratory
• Evaluated @ RI2 GPU nodes
Performance Evaluation
0
0.5
1
1.5
8 16 8 16 8 16
AlexNet VGG ResNet-50
Spee
dup
Scale (Number of GPU nodes)
*CA-CNTK - Image Classification
0
2
4
6
8
10
Peak
Stre
amin
g
Peak
Stre
amin
g
Peak
Stre
amin
g
4 GPU Nodes 8 GPUs Nodes 16 GPU Nodes
Thro
ughp
ut (G
B/s)
Streaming WorkloadKnomial-GDRRing-GDR-PipelineTA-Zcpy-MCAST-GDR-Pipeline
6.9X
1.2X
Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019
*D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters," CloudCom 2016.
SC 19 Doctoral Showcase 16Network Based Computing Laboratory
• Based on the architecture on RI2 cluster
Performance Model Validation and Prediction
0.001
0.1
10
1000
100000
2 4 8 16 32 64128
256512
10242048
Late
ncy
(s)
Number of Broadcast Sources
K-nomial-based: Model-based EstimationRing-based: Model-based EstimationMCAST-GDR-Opt: Model-based Estimation
0.001
0.01
0.1
1
10
100
2 4 8 16
Late
ncy
(s)
Number of Broadcast Sources
K-nomial-based: Model-based EstimationK-nomial-based: ExperimentRing-based: Model-based EstimationRing-based: ExperimentMCAST-GDR-Opt: Model-based EstimationMCAST-GDR-Opt: Experiment
Within 10% of error
𝑴 = 2𝑀𝐵; 𝑪 = 512 𝐾𝐵; 𝑼 = 4 𝐾𝐵;𝑩𝑯 ≈ 100 𝐺𝑏𝑝𝑠;𝑩𝑷𝑪𝑰𝒆 = 8 𝐺𝑏𝑝𝑠; 𝒕𝒐 𝒏 ≈1𝛼× ln 𝑛 , 15 ≤ 𝛼 ≤ 20
Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019
SC 19 Doctoral Showcase 17Network Based Computing Laboratory
• Introduction
• Problem Statement
• Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks– Efficient Scheduling of Non-contiguous Data Transfer– GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations
• Ongoing Work and Future Research Directions
• Broader Impact on the HPC Community
• Expected Contributions
Outline
SC 19 Doctoral Showcase 18Network Based Computing Laboratory
• Wide usages of MPI derived datatype for Non-contiguous Data Transfer
– Requires Low-latency and high overlap processing
Motivated Example #2 – Non-contiguous Data Transfer
M. Martinasso, G. Kwasniewski, S. R. Alam, Thomas C. Schulthess, and T. Hoefler. “A PCIe congestion-aware performance model for densely populated accelerator servers. “ SC 2016
Weather Simulation: COSMO model
Mike Clark. “GPU Computing with QUDA, “Developer Technology Group, https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Clark_M_LQCD.pdf
Quantum Chromodynamics: MILC with QUDA
SC 19 Doctoral Showcase 19Network Based Computing Laboratory
CPU
Progress
GPU
Time
Initi
ate
Kern
el
Star
t Se
nd
Isend(1)
Initi
ate
Kern
el
Star
t Se
nd
Initi
ate
Kern
el
GPU
CPU
Initi
ate
Kern
el
Star
tSe
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)Existing Design
Proposed Design
Kernel on Stream
Kernel on Stream
Isend(2)Isend(3)
Kernel on Stream
Initi
ate
Kern
el
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1)
Initi
ate
Kern
el
Star
t Se
nd
Wait For Kernel(WFK)
Kernel on Stream
Isend(1) Wait
WFK
Star
t Se
nd
Wait
Progress
Start Finish Proposed Finish Existing
WFK
WFK
Expected Benefits
Existing GPU-enabled MPI Datatype Processing
Waste of computing resources on CPU and GPUCommon Scenario
*A, B…contain non-contiguous MPI Datatype
MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…
MPI_Waitall (…);
Ching-Hsiang Chu et al., "Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, " IEEE IPDPS 2016.
SC 19 Doctoral Showcase 20Network Based Computing Laboratory
Proposed Event-based Design – Low Latency
MPI_Isend()stream
1cudaEventRecord()
HCA CPU GPU
MPI_Waitall()Query / Progress
Send
CompletionRequest
Complete
pack_kernel1<<< >>>
pack_kernel2<<< >>>
pack_kernel3<<< >>>cudaEventRecord()
cudaEventRecord()
MPI_Isend()
MPI_Isend()
stream3
stream2
Default stream
SC 19 Doctoral Showcase 21Network Based Computing Laboratory
Proposed Callback-based Design – High Overlap
MPI_Isend()addCallback()
HCA CPU GPU
Send
CompletionRequestComplete
main helper callback
pack_kernel1<<< >>>
Callback
MPI_Waitall()
addCallback()
addCallback()CPU
Computations
pack_kernel2<<< >>>
pack_kernel3<<< >>>
MPI_Isend()
MPI_Isend()
CallbackCallback
stream1
stream2
stream3D
efault stream
SC 19 Doctoral Showcase 22Network Based Computing Laboratory
Application-level (COSMO HaloExchange) Evaluation
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU clusterDefault Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU ClusterDefault Callback-based Event-based
2X 1.6X
MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1);MPI_Wait (request2, status2);
Ching-Hsiang Chu et al., "Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, " IEEE IPDPS 2016.
SC 19 Doctoral Showcase 23Network Based Computing Laboratory
• Introduction
• Problem Statement
• Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks– Efficient Scheduling of Non-contiguous Data Transfer– GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations
• Ongoing Work and Future Research Directions
• Broader Impact on the HPC Community
• Expected Contributions
Outline
SC 19 Doctoral Showcase 24Network Based Computing Laboratory
• Exploiting load-store capability of modern interconnects– Eliminate extra data copies and expensive packing/unpacking processing
Proposed Zero-copy (packing-free) Datatype TransferSo
urce
GPU
Mem
ory
Dest
inat
ion
GPU
Mem
ory
PCIe
/NVL
ink
Syst
em/H
CA M
emor
y
PCIe
/NVL
ink
Sour
ce G
PU M
emor
y
Dest
inat
ion
GPU
Mem
ory
PCIe
/NVL
ink
Load-Store
Copy
Existing Packing Schem Proposed Packing-free Schem
Ching-Hsiang Chu et al., “High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems”, HiPC 2019.
SC 19 Doctoral Showcase 25Network Based Computing Laboratory
Performance Evaluation• Zero-copy (packing-free) for GPUs with peer-to-peer direct access over PCIe/NVLink
0
5
10
15
20
25
[6, 8,8,8,8] [6, 8,8,8,16] [6, 8,8,16,16] [6, 16,16,16,16]
MILC
Spee
dup
Problem size
GPU-based DDTBench mimics MILC communication kernel
OpenMPI 4.0.0 MVAPICH2-GDR 2.3.1 ProposedPlatform: Nvidia DGX-2 system
0
5
10
15
20
25
16 32 64
Exec
utio
n Ti
me
(s)
Number of GPUs
Communication Kernel of COSMO Model
MVAPICH2-GDR 2.3.1 ProposedPlatform: Cray CS-Storm
Improved 3.4X
(https://github.com/cosunae/HaloExchangeBenchmarks)
Improved 15X
Ching-Hsiang Chu et al., “High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems”, HiPC 2019.
SC 19 Doctoral Showcase 26Network Based Computing Laboratory
• Introduction
• Problem Statement
• Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks– Efficient Scheduling of Non-contiguous Data Transfer– GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations
• Ongoing Work and Future Research Directions
• Broader Impact on the HPC Community
• Expected Contributions
Outline
SC 19 Doctoral Showcase 27Network Based Computing Laboratory
Motivated Example #3 – Reduction Op. for DL Training
• Can GPU resources help improving compute-intensive communications?– E.g., MPI_Reduce, MPI_Allreduce, MPI_Scan
– Emerging distributed deep learning training• Exchange and update weights
– Requires fast and high-bandwidth solutions
https://www.oreilly.com/ideas/distributed-tensorflow
Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXivpreprint arXiv:1802.09941. 2018 Feb 26.
SC 19 Doctoral Showcase 28Network Based Computing Laboratory
• Existing designs1. Explicit copy the data from GPU to host memory
2. Host-to-Host communication to remote processes
3. Perform computation on CPU
4. Explicit copy the data from host to GPU memory
• Proposed designs1. GPU-to-GPU communication
• NVIDIA GPUDirect RDMA (GDR)
• Pipeline through host for large msg
2. Perform computation on GPU• Efficient CUDA kernels
How to leverage GPUs for MPI Reduction Operations?
CPU
Host Memory
GPU
PCIe IB Adapter
CPU
Host Memory
GPU
PCIeIB Adapter1
2
3
4
1
2
Node BNode A
Expensive!
Expensive!
Fast
Good for small data Relative slow for large data
Ching-Hsiang Chu et al., "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, " IEEE/ACM CCGrid 2016
SC 19 Doctoral Showcase 29Network Based Computing Laboratory
Communication Computation Design Algorithm Benefit
Host<->HostCPU
BR-H-HH (Default) Binomial-Reduce Large scale, small messagesRD-H-HH (Default) Recursive doubling
GR-H-HH
Gather-Reduce Small scale, small messages
GPU
GR-HH
Host<->Device (GDR) GR-HD / GR-DH
Device<->Device(GDR)
GR-DD
BR-DD Binomial-Reduce
Large messagesfor any scale
BRB-DD Binomial-Reduce-Bcast
RD-DDRecursive doubling
Host<->Device (GDR) RD-HD/RD-DH
Alternative and Extended Designs
SC 19 Doctoral Showcase 30Network Based Computing Laboratory
• Grouping GPUs which are fully connected by NVLinks– Contention-free communication within the group
• Cooperative Reduction Kernels to exploit load-compute-storeprimitives over NVLinks
Proposed NVGroup Allreduce
Ching-Hsiang Chu et al., “NV-Group: Cooperative and Link-Efficient Reductions for Deep Learning on NVLink-enabled Dense GPU Systems, ” (to be submitted)
SC 19 Doctoral Showcase 31Network Based Computing Laboratory
Preliminary Results – Allreduce Benchmark
0123456
32M 64M 128M 256M
Band
wid
th (G
B/s)
Message Size (Bytes)
Bandwidth on 1,536 GPUs
Proposed NVGroupNCCL 2.4
1.7X better
0
100
200
300
400
500
4 16 64 256 1K 4K 16K
Late
ncy
(us)
Message Size (Bytes)
Latency on 1,536 GPUs
NVGroup NCCL 2.4
1.6X better
#1 Summit Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect
0
2
4
6
8
10
24 48 96 192 384 768 1536
Band
wid
th (G
B/s)
Number of GPUs
128MB Message
SpectrumMPI 10.2.0.11 OpenMPI 4.0.1
NCCL 2.4 Proposed NVGroup
1.7X better
Ching-Hsiang Chu et al., “NV-Group: Cooperative and Link-Efficient Reductions for Deep Learning on NVLink-enabled Dense GPU Systems, ” (to be submitted)
SC 19 Doctoral Showcase 32Network Based Computing Laboratory
Preliminary Results – Distributed Deep Learning Training• ResNet-50 Training using TensorFlow benchmark on a DGX-2 machine (16 Volta GPUs)
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16
Imag
e pe
r sec
ond
Number of GPUs
NCCL-2.4 MVAPICH2-GDR 2.3.1 Ideal
8% higher
0102030405060708090
100
1 2 4 8 16
Scal
ing
Effic
ienc
y (%
)
Number of GPUs
NCCL-2.4 Proposed
Scaling EfJiciency =Actual throughput
Ideal throughput at scale×100%
Ching-Hsiang Chu et al., “NV-Group: Cooperative and Link-Efficient Reductions for Deep Learning on NVLink-enabled Dense GPU Systems, ” (to be submitted)
SC 19 Doctoral Showcase 33Network Based Computing Laboratory
• Introduction
• Problem Statement
• Detailed Description and Results
• Ongoing Work and Future Research Directions
• Broader Impact on the HPC Community
• Expected Contributions
Outline
SC 19 Doctoral Showcase 34Network Based Computing Laboratory
• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,000 organizations in 89 countries
– More than 553,000 (> 0.5 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (June ‘19 ranking)
• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
• 16th, 556,104 cores (Oakforest-PACS) in Japan
• 19th, 367,024 cores (Stampede2) at TACC
• 31st, 241,108-core (Pleiades) at NASA and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
MVAPICH2 Project
Partner in the 5th ranked TACC Frontera System
Empowering Top500 systems for over a decade
SC 19 Doctoral Showcase 35Network Based Computing Laboratory
• Accelerating GPU-enabled HPC applications worldwide– MVAPICH2-GDR is widely used on many top-ranked GPU clusters worldwide including
Summit (#1), Sierra (#2), ABCI (#8), Lassen (#10) and more
• Enabling fast weather forecasting– MeteoSwiss uses the COSMO numerical weather forecasting model for the production of
regional and local forecast products
– Use MVAPICH2-GDR to accelerate GPU Communication
• Supporting scalable and reliable data dissemination– DoD streaming applications are using MVAPICH2-GDR to accelerate GPU-based broadcast
operations
– Tutorial sessions: PETTT ’17, PETTT ‘18
Impact to the Community
SC 19 Doctoral Showcase 36Network Based Computing Laboratory
• Introduction
• Problem Statement
• Detailed Description and Results
• Ongoing Work and Future Research Directions
• Broader Impact on the HPC Community
• Expected Contributions
Outline
SC 19 Doctoral Showcase 37Network Based Computing Laboratory
• Improving Scale-out and Scale-up performance by exploiting features in modern Interconnects
– Enabling Scalable Broadcast by using InfiniBand hardware multicast and GPUDirect RDMA
– Link-efficient schemes by exploiting load-store primitives over GPU interconnects
• Efficient GPU-enabled Communication Middleware– GPU-enabled MPI derived datatype processing
– GPU-enabled reduction operations
• Significant impact on the community– The abstraction for accelerator-enabled communication middleware
– Benefit HPC & ML/DL workloads
– Broader outreach through MVAPICH2-GDR public releases
Expected Contributions
SC 19 Doctoral Showcase 38Network Based Computing Laboratory
Thank You!
Questions?
chu.368@osu.edu
http://web.cse.ohio-state.edu/~chu.368