In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... ·...
Transcript of In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... ·...
![Page 1: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/1.jpg)
Paving the Road to Exascale
June 2017
In-Network Computing
![Page 2: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/2.jpg)
© 2017 Mellanox Technologies 2
Exponential Data Growth – The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Must Wait for the Data
Creates Performance BottlenecksAnalyze Data as it Moves!
Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale
![Page 3: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/3.jpg)
© 2017 Mellanox Technologies 3
Data Centric Architecture to Overcome Latency Bottlenecks
HPC / Machine Learning
Communications Latencies of 30-40us
HPC / Machine Learning
Communications Latencies of 3-4us
CPU-Centric (Onload) Data-Centric (Offload)
Network In-Network Computing
Intelligent Interconnect Paves the Road to Exascale Performance
![Page 4: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/4.jpg)
© 2017 Mellanox Technologies 4
In-Network Computing to Enable Data-Centric Data Center
GPU
GPU
CPUCPU
CPU
CPU
CPU
GPU
GPU
In-Network Computing Key for Highest Return on Investment
![Page 5: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/5.jpg)
© 2017 Mellanox Technologies 5
In-Network Computing to Enable Data-Centric Data Centers
GPU
GPU
CPUCPU
CPU
CPU
CPU
GPU
GPU
In-Network Computing Key for Highest Return on Investment
CORE-Direct
RDMA
Tag-Matching
SHARP
SHIELD
Programmable
(FPGA)
Programmable (ARM)
Security
GPUDirect
SecurityNVMe Storage
NVMe over Fabrics
![Page 6: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/6.jpg)
© 2017 Mellanox Technologies 6
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
Reliable Scalable General Purpose Primitive
• In-network Tree based aggregation mechanism
• Large number of groups
• Multiple simultaneous outstanding operations
Applicable to Multiple Use-cases
• HPC Applications using MPI / SHMEM
• Distributed Machine Learning applications
Scalable High Performance Collective Offload
• Barrier, Reduce, All-Reduce, Broadcast and more
• Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
• Integer and Floating-Point, 16/32/64/128 bits
SHArP Tree
SHARP Tree Aggregation Node
(Process running on HCA)
SHARP Tree Endnode
(Process running on HCA)
SHARP Tree Root
![Page 7: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/7.jpg)
© 2017 Mellanox Technologies 7
Allreduce Performance
Allreduce Latency (8 Bytes, 128 Bytes)
Late
ncy (
usec)
Number of Nodes (1PPN)
Number of Nodes (1PPN)
Late
ncy (
usec)
Allreduce Latency (1K Bytes, 2K Bytes)
![Page 8: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/8.jpg)
© 2017 Mellanox Technologies 8
OpenFOAM Application Performance
OpenFOAM is a popular computational fluid dynamics application
Software In-Network Computing
![Page 9: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/9.jpg)
© 2017 Mellanox Technologies 9
Tag Matching Logistics
![Page 10: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/10.jpg)
© 2017 Mellanox Technologies 10
Tag Matching – Common Implementation Protocols
![Page 11: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/11.jpg)
© 2017 Mellanox Technologies 11
Tag Matching – Hardware Implementation Overview
![Page 12: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/12.jpg)
© 2017 Mellanox Technologies 12
MPI Tag-Matching Offload Advantages
Lo
wer
is b
ett
er31%
Lo
wer
is b
ett
er97%
31% lower latency and 97% lower CPU utilization for MPI operations
Performance comparisons based on ConnectX-5
![Page 13: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/13.jpg)
© 2017 Mellanox Technologies 13
In-Network Computing to Enable Data-Centric Data Center
GPU
GPU
CPUCPU
CPU
CPU
CPU
GPU
GPU
In-Network Computing Key for Highest Return on Investment
In-Network Computing
In-Network Memory
Self-Healing
![Page 14: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/14.jpg)
© 2017 Mellanox Technologies 14
The SHIELD Self Healing Interconnect Technology
Software-based solutions for network failures create long delays: 5-30 seconds for 1K to 10K node clusters
During software-based recovery time, data can be lost, applications can fail
Adaptive Routing creates further issues (failing links may act as “black holes”)
Mellanox SHIELD technology is an innovative hardware-based solution
SHIELD technology enables the generation of Self-Healing Interconnect
The ability to overcome network failures by the network intelligent devices
Accelerates network recovery time by 5000X
Highly scalable and available for EDR and HDR solutions and beyond
Self-Healing Network Enables Unbreakable Data Centers
![Page 15: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/15.jpg)
© 2017 Mellanox Technologies 15
Consider a Flow From A to B
Data
Server A Server B
![Page 16: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/16.jpg)
© 2017 Mellanox Technologies 16
The Simple Case: Local Fix
Server A Server B
Data
![Page 17: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/17.jpg)
© 2017 Mellanox Technologies 17
The Remote Case: Using FRN’s (Fault Recovery Notifications)
Server A Server B
Data
FRN
Data
![Page 18: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/18.jpg)
© 2017 Mellanox Technologies 18
In-Network Computing Enables Deep Learning Frameworks
CUDA
Mellanox Interconnect SolutionsMellanox Accelerations for Machine Learning and Big Data
SHARP
Middleware (MPI, gRPC) - Optional
GPUDirect RDMA NVMe over FabricsrCUDA
![Page 19: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/19.jpg)
© 2017 Mellanox Technologies 19
2X Acceleration for Baidu
Machine Learning Software from Baidu
• Usage: word prediction, translation, image processing
RDMA (GPUDirect) speeds training
• Lowers latency, increases throughput
• More cores for training
• Even better results with optimized RDMA
~2X Acceleration for Paddle Training with RDMA
![Page 20: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/20.jpg)
© 2017 Mellanox Technologies 20
The Ever Growing Demand for Higher Performance
2000 202020102005
“Roadrunner”
1st
2015
Terascale Petascale Exascale
Single-Core to Many-CoreSMP to Clusters
Performance Development
Co-Design
HW SW
APP
Hardware
Software
Application
The Interconnect is the Enabling Technology
![Page 21: In-Network Computing - Ohio State Universitynowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Software-based solutions for network failures create long delays:](https://reader036.fdocuments.in/reader036/viewer/2022071419/61175420c389dd658c1e4ccb/html5/thumbnails/21.jpg)
Thank You