CLUSTAR: AI Training PlatformPowered byHigh Performance ...Junxue ZHANG EVP CLUSTAR PhD SING Lab,...
Transcript of CLUSTAR: AI Training PlatformPowered byHigh Performance ...Junxue ZHANG EVP CLUSTAR PhD SING Lab,...
-
Junxue ZHANG
EVP CLUSTARPhD SING Lab, HKUST
AGUEST 1,2018
CLUSTAR: AI Training Platform Poweredby High Performance Networking
-
Deep Learning Is Becoming Increasingly Important
27
Computer Vision Natural Language Processing Auto-driving Cars
-
How does Deep Learning Work ?
28
๐ฆ = ๐ โ ๐ฅ + ๐
1
๐ฅ
๐ ๐ข๐
Input Layer Output Layer
๐ ๐ ๐๐๐๐๐ 1 5
2 7mini batch
๐ = 1
๐ = 1
-
How does Deep Learning Work ?
29
๐ฆ = ๐ โ ๐ฅ + ๐
1
๐ฅ
๐ ๐ข๐
Input Layer Output Layer
๐ ๐ ๐๐๐๐๐ 1 5
2 7mini batchForward Pass
๐ = 1
๐ = 1
2
3
-
How does Deep Learning Work ?
30
๐ฆ = ๐ โ ๐ฅ + ๐
1
๐ฅ
๐ ๐ข๐
Input Layer Output Layer
๐ ๐ ๐๐๐๐๐ 1 5
2 7mini batch
๐ฟ = ๐ถ4 ๐ฆ โ ๐ฆ6789 =124 ๐ฆ โ ๐ฆ6789
;
Forward Pass
๐ = 1
๐ = 1
2
3
Calculating Loss
-
How does Deep Learning Work ?
31
๐ฆ = ๐ โ ๐ฅ + ๐
1
๐ฅ
๐ ๐ข๐
Input Layer Output Layer
๐ ๐ ๐๐๐๐๐ 1 5
2 7mini batch
๐ฟ = ๐ถ4 ๐ฆ โ ๐ฆ6789 =124 ๐ฆ โ ๐ฆ6789
;
๐๐ฟ๐๐ =
๐๐ฟ๐๐ฆ6789
ร๐๐ฆ6789๐๐ =4 ๐ฆ6789 โ ๐ฆ ๐ฅ = โ11
๐๐ฟ๐๐ =
๐๐ฟ๐๐ฆ6789
ร๐๐ฆ6789๐๐ =4 ๐ฆ6789 โ ๐ฆ = โ7
๐ = ๐ โ ๐๐๐ฟ๐๐
๐ = ๐ โ ๐๐๐ฟ๐๐
2
3
Calculating Loss
Backpropagation
๐ = 1 โ 0.1 โ โ11 = 2.1
๐ = 1 โ 0.1 โ โ7 = 1.7
-
How does Deep Learning Work ?
32
๐ฆ = ๐ โ ๐ฅ + ๐
1
๐ฅ
๐ ๐ข๐
Input Layer Output Layer
๐ฟ = ๐ถ4 ๐ฆ โ ๐ฆ6789 =124 ๐ฆ โ ๐ฆ6789
;
๐๐ฟ๐๐ =
๐๐ฟ๐๐ฆ6789
ร๐๐ฆ6789๐๐ =4 ๐ฆ6789 โ ๐ฆ ๐ฅ = โ11
๐๐ฟ๐๐ =
๐๐ฟ๐๐ฆ6789
ร๐๐ฆ6789๐๐ =4 ๐ฆ6789 โ ๐ฆ = โ7
๐ = ๐ โ ๐๐๐ฟ๐๐
๐ = ๐ โ ๐๐๐ฟ๐๐
Calculating Loss
Backpropagation
๐ = 1 โ 0.1 โ โ11 = 2.1
๐ = 1 โ 0.1 โ โ7 = 1.7
๐ ๐ ๐๐๐๐๐ 3 9
5 13
NextIteration
-
How does Deep Learning Work ?
33
Input Layer Output LayerHidden Layer
-
How does Deep Learning Work ?
34
Input Layer Output LayerHidden Layer
Forward Pass Forward Pass Forward Pass
Calculating Loss
BackpropagationBackpropagationBackpropagation
๐ค;CD๐คE;D
๐คFED
-
The Big Data Drives a New Paradigm for Training
35
1. Data is too large to fit in single machine
2. The training time is too longUber: it usually takes weeks or longer to complete [1]
-
Networking Plays an Important Role
36
Networking
Worker 1 Worker 2
๐คE ๐ค; โฆ
Parameter Server
DataPartition 1
DataPartition 2
-
Networking Plays an Important Role
37
Networking
Worker 1 Worker 2
๐คE ๐ค; โฆ
Parameter Server
Pull Parameters From Servers
DataPartition 1
DataPartition 2
๐คE๐ค; ๐คE๐ค;
-
Networking Plays an Important Role
38
Networking
Worker 1 Worker 2
๐คE ๐ค; โฆ
Parameter Server
Forward Pass Forward Pass
DataPartition 1
DataPartition 2
Input Input
๐คE๐ค; ๐คE๐ค;
-
Networking Plays an Important Role
39
Networking
Worker 1 Worker 2
๐คE ๐ค; โฆ
Parameter Server
Forward Pass Forward Pass
Calculating Loss
DataPartition 1
DataPartition 2
Calculating Loss
Input Input
๐คE๐ค; ๐คE๐ค;
-
Networking Plays an Important Role
40
Networking
Worker 1 Worker 2
๐คED ๐คEDD๐ค;DD๐ค;D
๐คE ๐ค; โฆ
Parameter Server
DataPartition 1
DataPartition 2
Backpropagation Backpropagation
-
Networking Plays an Important Role
41
Networking
Worker 1 Worker 2
๐คED ๐คEDD๐ค;DD๐ค;D
๐คE ๐ค; โฆ
Parameter Server
DataPartition 1
DataPartition 2
Backpropagation Backpropagation
Push parameters to Servers
-
Networking Plays an Important Role
42
Networking
Worker 1 Worker 2
๐คED ๐คEDD๐ค;DD๐ค;D
๐คE ๐ค; โฆ
Parameter Server
DataPartition 1
DataPartition 2
Backpropagation Backpropagation
Push parameters to Servers
Networking is critical to performance !
-
Networking Plays an Important Role
43
Model Logistic Regression
Multi-layer perceptron
Alexnet VGG-16 Resnet-50
Speedup 2.59x 3.45x 1.6x 1.33x 1.03x
The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR
-
CLUSTAR: AI Training Platform Powered by High Performance Networking
Smart NetworkingScheduling
โข Co-flow scheduling
โข Elephant & Mice flowscheduling
GDR
โข Towards 0-copy data flow
โข Utilize RDMA and GPUDirect
โข Integrated with TensorFlow
ParaExpress
โข Resilient and adaptive parameter aggregation
โข Tackles the disadvantage of Parameter Server & Ring AllReduce
Key Technology๏ผWorld-leading Research Achievements๏ผ
MLT
โข Utilize the SGD of AI training
โข Semi-loss tolerance
โข Model quality awareness
Between 2 Machines Multiple Machines AI Protocol
Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
44
-
CLUSTAR: AI Training Platform Powered by High Performance Networking
Smart NetworkingScheduling
โข Co-flow scheduling
โข Elephant & Mice flowscheduling
GDR
โข Towards 0-copy data flow
โข Utilize RDMA and GPUDirect
โข Integrated with TensorFlow
ParaExpress
โข Resilient and adaptive parameter aggregation
โข Tackles the disadvantage of Parameter Server & Ring AllReduce
Key Technology๏ผWorld-leading Research Achievements๏ผ
MLT
โข Utilize the SGD of AI training
โข Semi-loss tolerance
โข Model quality awareness
Between 2 Machines Multiple Machines AI Protocol
Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
45
-
CLUSTAR: AI Training Platform Powered by High Performance Networking
Smart NetworkingScheduling
โข Co-flow scheduling
โข Elephant & Mice flowscheduling
GDR
โข Towards 0-copy data flow
โข Utilize RDMA and GPUDirect
โข Integrated with TensorFlow
ParaExpress
โข Resilient and adaptive parameter aggregation
โข Tackles the disadvantage of Parameter Server & Ring AllReduce
Key Technology๏ผWorld-leading Research Achievements๏ผ
MLT
โข Utilize the SGD of AI training
โข Semi-loss tolerance
โข Model quality awareness
Between 2 Machines Multiple Machines AI Protocol
Wider Roads Traffic Scheduling New Traffic Rule for AI
The important of networking towards AI system equals the traffic system towards cities
46
-
CLUSTAR Platform
47
ๅบ็ก่ฎพๆฝ
ๅบโฝค็จ่ฎก็ฎๆบ่ง่ง
โพฆ้๏ค่โพ่ก๏จไธๅบโฝค็จ
่ฏญโพณ้ณ่ฏๅซ โพ่ช็ถ่ฏญโพ่จๅค็๏งค โพ่ชๅจ้ฉพ้ฉถๆบ่ฝๅๆฌบ่ฏ ๆบ่ฝโฝๆ โผไบบๆบ
ๅฎ้ฒโพ่ก๏จไธๅบโฝค็จ ไบ่โฝน็ฝโพ่ก๏จไธๅบโฝค็จ ๅถ้ ไธโพ่ก๏จไธๅบโฝค็จๅป็โพ่ก๏จไธๅบโฝค็จ ๆฟๅบโพ่ก๏จไธๅบโฝค็จ
้โฝค็จ็กฌไปถ CPU GPU FPGA ASIC RDMAโฝน็ฝ็ป ๅ จ้ชๅญๅญๅจ
Intel Nvidia AMD ๅฏๆญฆ็บช Mellanox Broadcom P4E8
Storage
Clustar AI Fabrics
RoCE ๆบ่ฝโฝน็ฝๅก
ๆฐๆฎ้ขๅค็ ็ฆป็บฟ่ฎญ็ป ๅจ็บฟ่ฎญ็ป ๅค็งๆท็ฎก็ ไปปๅก่ฐๅบฆ ่ฟ็ปด็ๆง
Sparkไผๅ TensorFlowไผๅ ๅฎนๅจ๏จธ็ผๆๅผๆ ไบคไบ็ผ็จ็โพฏ้ข
ๆไบๅนณๅฐ ๅฏ็ผ็จโฝน็ฝ็ป
-
GDR: Towards Zero Copy Data Flow
48
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
49
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
50
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
51
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
52
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
53
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
54
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
55
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU
-
GDR: Towards Zero Copy Data Flow
56
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
GDR removes the unnecessary copy to boost performance
-
GDR: Towards Zero Copy Data Flow
57
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
GDR removes the unnecessary copy to boost performance
-
GDR: Towards Zero Copy Data Flow
58
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
GDR removes the unnecessary copy to boost performance
-
GDR: Towards Zero Copy Data Flow
59
CPU
Memory
GPU RDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 1
CPU
Memory
GPURDMANIC
CPU
Memory
GPU GPU
Socket 1 Socket 2
Server 2
Data Center Networking
GDR removes the unnecessary copy to boost performance
-
Memory Management
60
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
61
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
62
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
63
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Data Copy
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
64
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Data Copy
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
Manually manage malloc() and free() over pre-pinned memory
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
65
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Data Copy
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
Manually manage malloc() and free() over pre-pinned memory
Allocated Object
GDR further reduces the data copy by managing the objects manually over pinned memory
-
Memory Management
66
OS Managed Application Memory
Pinned RDMA MemorySending Buffer
Allocated Object
Data Copy
Allocated Object
Unnecessary Data Copy between pinned buffer and application memory degrades performance
Pinned RDMA MemorySending Buffer
Manually manage malloc() and free() over pre-pinned memory
Allocated Object
GDR further reduces the data copy by managing the objects manually over pinned memory
-
TensorFlow GDR
67
GDR has been contributed to TensorFlow community (We have commercial version) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gdr
-
Benchmark
68
VGG16 BERTAlexNet
-
The Evil of Parameter Server & Ring AllReduce
69
Worker A Worker B Worker C
ParameterServer
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
-
The Evil of Parameter Server & Ring AllReduce
70
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
-
The Evil of Parameter Server & Ring AllReduce
71
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
-
The Evil of Parameter Server & Ring AllReduce
72
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
-
The Evil of Parameter Server & Ring AllReduce
73
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
Delayed due to congestion
Cannot Start Transferring
Cannot Start Transferring
-
The Evil of Parameter Server & Ring AllReduce
74
Worker A Worker B Worker C
ParameterServer
Bottleneck link
Parameter Server largely degrades from congested links due to over-subscribed networking
Worker A
Worker B
Worker C
Worker D
Delayed due to congestion
Cannot Start Transferring
Cannot Start Transferring
The long dependency of Ring AllReudce may cause the whole job to wait once one hop blocks
-
ParaExpress: Networking-aware Parameter Aggregation
75
?Root
Aggregator
Leaf 1 2
?
3 4 5 6
?
7 8
Rack1 Rack2
Real-time networking conditions 1
23
4
56
78
ToR
ToR
Generate
Optimal Parameter Aggregation
The generated parameter aggregation topology has the advantage of both tree structure (Parameter Server ) and ring structure (Ring AllReduce)
-
ParaExpress Architecture
76
TaskQueue
Completion Queue
ExecutionGraph
Resolver
Operation Pool
Request Manager
Traffic Prioritization Module
ParaExpress MasterEmbedding Plan Prioritization
MPI Request
High Speed Network Interface
Change DSCP โ Priority Mapping
ParaExpress AgentExecution Graph
D
โฆโฆ โฆ
R1 A1 S1
R2 A2
An
S2
SnRnTensor
-
Highlighted Results
77
โข Compared with TensorFlow PS, Baidu Ring AllReduce and Horovod, the software optimization of ParaExpress can achieve 1.5-4.3X better performance.
โข In real environment, ParaExpess achieves 2.6X better results than Parameter Server and 3X better results than Ring AllReduce.
-
About CLUSTAR
-
World-leading Research Achievements
79
9 papers appear in top-tier Networking Conference (SIGCOMM/NSDI๏ผin recent 5 years. First in Asia.โข "AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization", ACM SIGCOMM 2018
โข "PowerMan: An Out-of-Band Management Network for Datacenters using Power Line Communication", USENIX NSDI 2018
โข "Resilient Datacenter Load Balancing in the Wild", ACM SIGCOMM 2017
โข "Enabling Wide-spread Communications on Optical Fabric with MegaSwitch", USENIX NSDI 2017
โข "Scheduling Mix-flows in Commodity Datacenters with Karuna", ACM SIGCOMM 2016
โข "CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark", ACM SIGCOMM 2016
โข "Enabling ECN in Multi-Service Multi-Queue Data Centers", USENIX NSDI 2016
โข "Information-Agnostic Flow Scheduling for Commodity Data Centers", USENIX NSDI 2015
โข "Explicit Path Control in Commodity Data Centers: Design and Applications", USENIX NSDI 2015
Statistics of universities in great China:
University Number of accepted papers
HKUST 9 (all are from teams of CLUSTAR)
Tsinghua University 5 (from different professors and labs)
Chinese Academy of Sciences 3 (from different professors and labs)
Peking University 1
Fudan University 1
National Supercomputing Center in Wuxi 1
-
Selected Clients
80
n GDR n ParaExpress
Selected Clients
Utilize GDR to boost
performance of moments
classification for WeChat; and
performance of CV for SAIC
Achievements๏ผ
1. ~3X for Wechat
2. ~1.6 for SAIC
Selected Clients
Utilize ParaExpress to
improve the performance
of AI training in the
sophisticated cloud
environment
Progress๏ผ
POC
Utilize GDR to boost the
performance of Federated
Learning. Utilize MLT to boost
the long-distance
communication
Progress๏ผ
Developing
High-speed
Networking
virtualization
Progress๏ผ
Developing
AI Unicorn1
1 NDA issue
n AI Consulting
Selected Clients
Smart customer support
system. Utilize CLUSTAR
platform to speed up the AI
training
Progress๏ผ
Developing next-gen AI
platform together
-
CLUSTAR Team
81
Kai CHENFounder
โข PhD, Northwest Universityโข Associated Professor, HKUSTโข 50+ top-tier networking conference paper
(SIGCOMM/NSDI)
Qiang YANGCo-founder
โข PhD, University of Marylandโข Chair Professor, Department Head of CSE,
HKUSTโข President of IJCAI
โข 10+ years of research experience on DCNโข Director of SING Lab, HKUSTโข Director of WHAT Lab, HKUST
โข Founder of Transfer Learningโข IEEE/ACM/AAAI Fellowโข Founding director of the Huawei Noah's Ark
Research Lab
Shuihai HUVP of Technology
โข PhD, HKUSTโข Expertise on RDMA
Pin LYUDirector of Algorithm
โข 7 years of IBM software development
Junxue ZHANGEVP
โข PhD, HKUSTโข Architecture for CLUSTAR
platform
Yajing LYUVP of Business
โข MBA, ESSECโข 6+ years of business experiences
Junhuan SUNVP of Engineering
โข 10+ years of engineering experiences
Weiyan WANGAI Scientist
โข PhD, HKUSTโข AutoML System
-
Milestone
82
2018.11
CLUSTAR v1.0 launched!
Cooperate with SAIC
2018.05
CLUSTAR is founded!
2018.01 2018.09
Join Nvidia Inception Program
2019.01
CLUSTAR v1.1 launched!
Cooperate with WeChat
2017.08 2018.11
Cooperate with Sunshine Insurance
Angle Funding
2018.03
-
THANK [email protected]
https://www.clustarai.com