HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on...
Transcript of HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on...
HetPipe: Enabling Large DNN Training on (Whimpy)
Heterogeneous GPU Clusters through Integration of
Pipelined Model Parallelism and Data Parallelism
Jay H. Park,
Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee,
Jaesik Choi †, Sam H. Noh, and Young-ri Choi
†
▪ Motivation & Background
▪ HetPipe in a Nutshell
▪ Our System: HetPipe
▪ Evaluation
▪ Conclusion
2
Contents
▪ DNN (Deep Neural Network) models continue to grow
3
Motivation
• Need more powerful GPUs for training!
▪ Short release cycle of new GPU architectures
4
Motivation
• Use of heterogeneous GPUs is inevitable!• What to do with whimpy GPUs?
Whimpy GPUs
5
DNN Training
Forward Pass 𝒊
Backward Pass 𝒊
𝒘𝒊+𝟏 = 𝒘𝒊 − 𝜼 ∙ 𝒖𝒊
Minibatch 𝒊(Training Data)
Loss
Cat?
Weight Parameter 𝒘
▪ Model parallelism (MP)▪ Data parallelism (DP)
6
Parallelizing DNN Training
• Low GPU utilization
Weights synchronizedthrough PS or AllReduce
• GPU memory limitation
Worker 1
…
Parameter Server (PS)
1
1
Worker 𝒏
𝑛
𝑛
Forward passBackward pass
▪ Attempts to improve MP utilization
• Pipelined model parallelism (PMP)
7
Parallelizing DNN Training
• Designed for homogeneous GPUs• Designed for a single PMP worker
Forward passBackward pass
PMP Worker
• PipeDream [SOSP’19]• GPipe [NIPS’19]
8
HetPipe in a Nutshell
Virtual Worker (VW) 1
Parameter Server
VW 𝒏
…
Integrates PMP + DP
WSP(Wave Synchronous Parallel)
Support Heterogeneous GPUs VW: A group of multiple GPUs
GPUGPUGPUGPU
GPUGPUGPUGPU R
RGG
R
G
GPUGPUGPUGPU
GPUGPUGPUGPU V
VQQ
V
Q
PMP DP
PMP
9
Challenges in integration PMP+DP in Heterogeneous GPUs
• What weight version should be used by each VW to synchronize with other VWs?
Parameter Server
… • How do we reduce virtual worker stragglers when we consider DP?
Many more in the paper
…
10
HetPipe Contributions
Integrates PMP + DPNovel parameter synchronization model
WSP (Wave Synchronous Parallel)
Enable Large DNN Training on Heterogeneous GPUsAggregate heterogeneous resources
Reduce the straggler problem
Proof of WSP Convergence
11
HetPipe Workflow
Model Partitioner
DNN Model
Resource Allocator
Cluster Configuration
P2 P3 P4P1
Assign k GPUs to each virtual worker
VW 1…
Divide model into k partitions
P1
P2
P3
P4
VW 𝒏
P1’
P2’
P3’
P4’
PS…
…
Time
V
V
Q
Q
R
R
G
G
P2’ P3’ P4’P1’
VW 1 VW 𝒏
12
HetPipe Workflow
Model Partitioner
DNN Model
Resource Allocator
Cluster Configuration
Assign k GPUs to each virtual worker
VW 1
…
Divide model into k partitionsVW 𝒏
PS…
…
P2 P3 P4P1
…
P1
P2
P3
P4
P1’
P2’
P3’
P4’
P2’ P3’ P4’P1’
VW 1 VW 𝒏
Global
Local
Local
Staleness
V
V
Q
Q
R
R
G
G
▪ Motivation & Background
▪ HetPipe in a Nutshell
▪ Our System: HetPipe
• Pipelined Model Parallelism Within a VW
• Data Parallelism with Multiple VWs
▪ Evaluation
▪ Conclusion
13
Outline
▪ Execution of a virtual worker
14
Pipelined Model Parallelism Within a VW
1
1
2 3 41 2 3 4
1 2 3 41 22 33
1 21
1GPU1
GPU2
GPU3
GPU4
Time
Forward pass Backward pass
𝒘𝒍𝒐𝒄𝒂𝒍
𝑾𝒍𝒐𝒄𝒂𝒍 is a consistent version of weights within a VW
𝑵𝒎 minibatches processed concurrently in pipeline manner
…
44 53 45
2 55 62
3
5 66 77 884 5 6 7
6 4 773
5 884
6 995
76
6 7 8 9
▪ Weight management procedure
15
Pipelined Model Parallelism Within a VW
1
1
2 3 41 2 3 4
1 2 3 41 22 33
1 21
1GPU1
GPU2
GPU3
GPU4
Time
…
Forward pass Backward pass
𝒘𝒍𝒐𝒄𝒂𝒍
Update 𝒖𝟏 𝒘𝟓 ← 𝒘𝒍𝒐𝒄𝒂𝒍
Initial weight version (𝒘𝟎)𝑤𝑙𝑜𝑐𝑎𝑙=𝑤0=𝑤1=𝑤2=𝑤3=𝑤4
𝑤1 𝑤2 𝑤3 𝑤4
𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟏
44 53 45
2 55 62
3
5 66 77 884 5 6 7
6 4 773
5 884
6 995
76
6 7 8 9
▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍): maximum missing updates
16
Pipelined Model Parallelism Within a VW
1
1
2 3 41 2 3 4
1 2 3 41 22 33
1 21
1GPU1
GPU2
GPU3
GPU4
Time
𝒘𝒍𝒐𝒄𝒂𝒍
…
Forward pass Backward pass
𝒘𝟓 missing updates of minibatches 2 to 4
𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟏
𝒘𝟓 ← 𝒘𝒍𝒐𝒄𝒂𝒍
𝑺𝒍𝒐𝒄𝒂𝒍 = 3
44 53 45
2 55 62
3
5 66 77 884 5 6 7
6 4 773
5 884
6 995
76
6 7 8 9
▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍): maximum missing updates
17
Pipelined Model Parallelism Within a VW
1
1
2 3 41 2 3 4
1 2 3 41 22 33 44 5
1 2 3 41
52 5
51 623
GPU1
GPU2
GPU3
GPU4
Time
Forward pass Backward pass
𝒘𝒍𝒐𝒄𝒂𝒍
Update 𝒖𝟐 𝒘𝟔 ← 𝒘𝒍𝒐𝒄𝒂𝒍
𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟐
𝒘𝟔 missing updates of minibatches 3 to 5
…
5 66 77 884 5 6 7
6 4 773
5 884
6 995
76
6 7 8 9
𝒘𝟎 + 𝒖𝟏
𝑺𝒍𝒐𝒄𝒂𝒍 = 3
▪ Motivation & Background
▪ HetPipe in a Nutshell
▪ Our System: HetPipe
• Pipelined Model Parallelism Within a VW
• Data Parallelism with Multiple VWs
▪ Evaluation
▪ Conclusion
18
Outline
Data Parallelism with Multiple VWs
Minibatch 1
Minibatch 2
Minibatch 3
Minibatch 4VW 1
Clock
56
78
…
Wave 0 Wave 1
0 1 2
…
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Progress of minibatch execution
VW 𝒏 …
Push & Pull
𝑵𝒎
Wave: Sequence of concurrently executing 𝑁𝑚 minibatches
19
▪ Push occurs every clock
20
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
…
VW 𝒏
…
8 Blocked minibatch 8
𝒘𝒈𝒍𝒐𝒃𝒂𝒍 ← 𝒘𝒈𝒍𝒐𝒃𝒂𝒍 + 𝒖
Push aggregated updates of wave0 ( 𝑢)
𝑢 = 𝑢1+𝑢2+𝑢3+𝑢4
56
7
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
Wave 0
▪ Pull occurs intermittently - Depending on user defined clock distance D
21
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
8
12
34VW 2
- If D = 0 pull occurs every clock
VW1 waits before pulluntil VW2 pushes
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
56
7
12
34
56
7
▪ Pull occurs intermittently - Depending on user defined clock distance D
22
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
8
12
34VW 2 8
If D = 0
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
7
56
7
56
VW2 Push aggregated updates ( 𝑢)
VW1 waits before pulluntil VW2 pushes
𝑤𝑔𝑙𝑜𝑏𝑎𝑙 ← 𝑤𝑔𝑙𝑜𝑏𝑎𝑙 + 𝑢
▪ Pull occurs intermittently - Depending on user defined clock distance D
23
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
8
12
34VW 2
Pull occursafter all VWs have been pushed
8
If D = 0
𝑤𝑙𝑜𝑐𝑎𝑙 ← 𝑤𝑔𝑙𝑜𝑏𝑎𝑙
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
7
56
7
56
▪ Pull occurs intermittently - Depending on user defined clock distance D
24
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
8
12
34VW 2
Minibatch 8 starts with 𝑤8
8
If D = 0
𝑤8 = 𝑤0+(𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
7
56
7
56
▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍) and global staleness (𝑺𝒈𝒍𝒐𝒃𝒂𝒍) with WSP
25
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
12
34VW 2
2
𝑤11
(𝑢8+ 𝑢9+ 𝑢10)vw1𝑺𝒈𝒍𝒐𝒃𝒂𝒍𝑺𝒍𝒐𝒄𝒂𝒍
𝑤𝑔𝑙𝑜𝑏𝑎𝑙 =𝑤0+ (𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2
+ (𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2=𝑤0
+ (𝑢5+ 𝑢6+ 𝑢7)vw1
8
8
7
56
7
56
56
78
56
78
910
1112
11
(𝑢5+ 𝑢6+ 𝑢7)vw2
5
▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍) and global staleness (𝑺𝒈𝒍𝒐𝒃𝒂𝒍) with WSP
26
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
12
34VW 2
67
8
56
78
2
910
1112
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Minibatch 12 has to wait
▪ Example of clock distance threshold D
27
Data Parallelism with Multiple VWs
If D = 1
Can start minibatch 8 without pull1
23
4VW 1
Clock0 1
8
12
34VW 2
56
7
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
▪ Example of clock distance threshold D
28
Data Parallelism with Multiple VWs
Minibatch 12 has to wait
12
34VW 1
Clock0 1
12
34
56
78
910
1112
2
𝑤11 =𝑤0+(𝑢1+ 𝑢2+ 𝑢3+ 𝑢4+ 𝑢5+ 𝑢6+ 𝑢7)vw1
(𝑢8+ 𝑢9+𝑢10)vw1𝑺𝒈𝒍𝒐𝒃𝒂𝒍𝑺𝒍𝒐𝒄𝒂𝒍
𝑤𝑔𝑙𝑜𝑏𝑎𝑙 =𝑤0
If D = 1
1111
VW 2
(𝑢1+ 𝑢2+𝑢3+𝑢4+𝑢5+𝑢6+𝑢7)vw25
67
▪ Motivation & Background
▪ HetPipe in a Nutshell
▪ Our System: HetPipe
▪ Evaluation
• Setup
• Resource Allocation for Virtual Workers
• Results
▪ Conclusion
29
Outline
InfiniBand (56 Gbps)
▪ Cluster setup - 4 heterogeneous GPU nodes
▪ Two DNN models
30
Evaluation Setup
ResNet-152 VGG-19
Dataset, minibatch size ImageNet, 32
Model parameter size 230 MB 548 MB
Characteristic Large activation output Large parameter size
Node 𝟏 Node 𝟐 Node 𝟑 Node 𝟒
V0
V1
V2
V3
R0
R1
R2
R3
G0
G1
G2
G3
Q0
Q1
Q2
Q3
TITAN V TITAN RTX GeForce RTX 2060Quadro P4000
V R G Q
Computation power
> > >
VR GQ
Memory size
> > >
▪ NP (Node Partition)
31
Resource Allocation for Virtual Workers: NP, ED, HD
Node 𝟏
V0V1V2V3
VW 𝟏
Node 𝟐
Q0Q1Q2Q3
VW 𝟐
Node 𝟑
R0R1R2R3
VW 𝟑
Node 𝟒
G0G1G2G3
VW 4
• Minimum communication overhead within VW
• Performance of each virtual worker varies• Straggler may degrade performance with DP
▪ ED (Equal Distribution)
32
Resource Allocation for Virtual Workers: NP, ED, HD
Node 𝟏
V0
V1
V2
V3
Node 𝟐
Q0
Q1
Q2
Q3
Node 𝟑
R0
R1
R2
R3
Node 𝟒
G0
G1
G2
G3
VW 𝟏
VW 𝟐
VW 𝟑
VW 𝟒
• Performance will be the same across the VWs• Mitigates the straggler problem
• High communication overhead within each VW
▪ HD (Hybrid Distribution)
33
Resource Allocation for Virtual Workers: NP, ED, HD
Node 𝟏
V0
V1
V2
V3
Node 𝟐
Q0
Q1
Q2
Q3
Node 𝟑
R0
R1
R2
R3
Node 𝟒
G0
G1
G2
G3
VW 𝟏
VW 𝟐
VW 𝟑
VW 4
• Mitigates the straggler problem • Reduces communication overhead within each VW
V R G Q
Computation power
> > >
VR GQ
Memory size
> > >
▪ Round-robin policy (default)
• Can be used in all three policies: NP, ED, and HD
Parameter Placement
Node 𝟏
V3
Node 𝟐
Q3
Node 𝟑
R3
Node 𝟒
G3
VW 𝟏
VW 𝟒
… … … … …
Example: ED
Parameters of each layer:
34
▪ Local placement policy
• ED-local
Parameter Placement
Node 𝟏
V3
Node 𝟐
Q3
Node 𝟑
R3
Node 𝟒
G3
VW 𝟏
VW 𝟒
… … … … …
Parameters of each layer:
• Significantly reduces communication overhead
ED
• Parameter communication occurs
35
▪ Baseline Horovod
• State-of-the-art DP using AllReduce
36
Compare Throughput with Horovod
ResNet-152 VGG-19
1.4 X 1.8 X• ED: reduces the straggler
problem• ED-local: significantly
reduces communication overhead
• For ResNet-152, the whole model is too large to be loaded into a single G type GPU(batch size = 32)
37
Performance Improvement of Adding Whimpy GPUs
Adding whimpy GPUs
V RRRRQQQQ
GGGG+ + +
• With additional GPUs, HetPipe achieves up to 2.3X speed up
2.3 X
• Additional whimpy systems allow for faster training
VVVV
▪ ResNet-152
38
Convergence Results
• HetPipe reduces straggler problem in heterogeneous environment
Target accuracy: 74%
Up to 39% faster
• Adding four more whimpy G GPUs, performance improves even more
7% faster
12GPUs16GPUs
▪ VGG-19
39
Convergence Results
Target accuracy: 67%
• HetPipe (D=0) is 29% faster than HorovodUp to 49% faster
D=4 D=32 D=0
4.7% slower
• Higher global staleness (i.e., 32) can degrade convergence performance
29% faster
▪ Provide convergence proof of WSP
▪ Partitioning algorithm
▪ Performance of a single virtual worker
▪ Comparison to PipeDream
40
Not Presented But Discussed in Paper
▪ HetPipe makes it possible to efficiently train large DNN models with heterogeneous GPUs
▪ Integrate pipelined model parallelism with data parallelism
▪ Propose a novel parameter synchronization model: WSP
▪ DNN models converge up to 49% faster with HetPipe
41
Conclusion
42
Thank you!