Research Faculty Summit 2018 · 2018. 8. 23. · • System innovations to bridge applications and...
Transcript of Research Faculty Summit 2018 · 2018. 8. 23. · • System innovations to bridge applications and...
-
Systems | Fueling future disruptions
ResearchFaculty Summit 2018
-
Wolong: A Back-end Optimizer for Deep Learning Computation Jilong XueResearcher, Microsoft Research Asia
-
System Challenge in Deep Learning
• Innovations are emerging very fast in deep learning area• New DNN models and workload patterns
• RNN, CNN, GAN, reinforcement learning, graph neural network, etc.• Diverse and emerging hardware accelerators,
• GPU, FPGA, ASICs, edge devices, NV-Link, RDMA, etc.
• Compiler stack is key to bridge framework and hardware• Combine information of computation graph and hardware• Optimize for both local execution and distributed scalability • Critical for both training and inference
Intermediate Representation (IR)xw b
* + y
Execution backends
Compiler & Optimizer Infrastructure
-
Wolong: Optimizer Stack for Deep Learning
• System innovation to bridge application and hardware• General computation graph optimization• Software and hardware co-design• Just-in-time compiler
• Transparent optimization• Communication efficiency• Accelerator execution efficiency • Memory efficiency
Intermediate Representation (graph of operators)
Global Optimizer:• Optimize distributed training over RDMA• Graph analyzer/ RDMA memcpy library
Local Optimizer:• Optimize execution with JIT compiler• Operator batching/ kernel fusion
Tensor Placement Optimizer:
Execution RuntimeCPU, GPU, RDMA devices
Memory layout and placement optimization
-
Global Optimizer Fast Distributed Deep Learning Computation over RDMA
-
Distributed Dataflow Graph Execution
• Deep learning computation is modeled as dataflow graph• Achieve parallel manner through graph partitioning
• Model parallelism vs. data parallelism • Tensor transmission across server becomes bottlenecks
Partition graph
Dispatchpartitions
𝑿𝑿𝑾𝑾𝟏𝟏
*
𝑾𝑾𝟐𝟐
𝑯𝑯
𝒀𝒀
*𝝈𝝈
Server0
𝑿𝑿𝑾𝑾𝟏𝟏
𝑯𝑯
*
𝝈𝝈
Server1
Send
Recv*𝑾𝑾2
𝒀𝒀
1
10
100
1000
10000
Late
ncy
(ms)
message size (MB)
Send-Recv Benchmark
gRPC RDMA10x
-
General Message Passing Library (e.g., RPC)
• Unavoidable memory copy overhead in RPC• Generally designed for dynamic data structure• Lacks knowledge of actual data placement and size• Extra memory copy from data serialization
• Software/hardware co-design to completely remove memory copy overhead • Leverage runtime application information• RDMA network
𝑿𝑿𝑾𝑾𝟏𝟏
*
𝑾𝑾𝟐𝟐
𝑯𝑯 𝒀𝒀
Server0
*
𝝈𝝈
Server1
Send
Recv
t tApplication memory Application memory
RPC managed buffer
tt tRPC managed buffer
-
Combine Dataflow Graph Computation with RDMA
• Tensor abstraction in deep learning computation• Consists of a plain byte array with sufficiently large size (tens of KB to MB)• Do NOT require variant data serialization/deserialization• Do NOT require extra batching since access pattern is already sequential
• RDMA enables to manage local and distributed memory in a unified view• One-side RDMA R/W : efficient memory copy between host memory • GPU-Direct RDMA : efficient memory copy between host and device memory
• Global graph optimizer for distributed computation• Has the entire view and control of memory placement among devices and servers• Capable of making globally optimized strategy for tensor data placement in runtime
-
Optimized Communication Mechanism
• Transfer statically placed tensor through one-side RDMA write• Phase I: graph analyzing• Phase II: graph execution
RDMA-based zero-copy communication
𝑿𝑿𝑾𝑾𝟏𝟏
*
𝑾𝑾𝟐𝟐
𝑯𝑯 𝒀𝒀
Server0
*
𝝈𝝈
Server1
Send
Recv...... 1
Source Tensor
One-sided RDMA write 0
(Polling flag byte)......
Dest Tensor
Tensor Manager:• Detect the source tensor place• Re-allocate as RDMA memory
Tensor Manager:• Pre-allocate RDMA compatible
receive tensor
RDMA lib:• Conduct remote memory copy
1
-
Global Optimizer: Performance Evaluation
• Improve training throughput, convergence speed and scalability
4.2x
2.6x
4.0x
2.3x
1.8x
8.1x0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
AlexNet Inception FC LSTM GRU VGG16Thr
ough
put (
min
i-bat
chs/
sec)
Deep Learning Benchmarks
TensorFlow(gRPC) Wolong
10
100
1000
10000
100000
0 500 1000 1500 2000 2500
Perp
lexi
ty
Run time (s)
Convergence of Seq2Seq Translation
TensorFlow(gRPC) Wolong(RDMA)
More details in our paper: RPC Considered Harmful: Fast Distributed Deep Learning on RDMA* Experiments are conducted on 8 servers 8 Nvidia GTX 1080 GPUs; The translation model uses WTM’15 datasets;
2-3x speed up
-
Local Optimizer Kernel Fusion for Deep Learning on GPU
-
Motivation
• Deep learning frameworks model computation as graph of primitive operators• Expressivity to represent arbitrary neural network structure• Flexibility to run on multi-device and multi-server through graph partitioning
• Significant framework overhead to schedule thousands of operators• Kernel-launch overhead• Cross operator communication overhead• Too fine-grained to leverage vendor’s library
• Example: 80-step LSTM model• Contains 1686 operators in TensorFlow
0.00
10.00
20.00
30.00
40.00
TensorFlow CUDNN FuseKernel
runt
ime
(ms)
LSTM 512x512 (80steps)
-
DL Frameworks vs. Vendor Provided Library
• Deep learning frameworks• E.g., TensorFlow, PyTorch, CNTK• Embrace flexibility and expressivity• Performance inefficiency
Flexibility
Efficiency
• Hardware specific library• E.g., cuDNN, cuBlas, MKL• Designed for extreme efficiency• Impossible to handle customized or
new network structure
• DL framework + Compiler• Generate library-like code in runtime• Win both of the worlds
-
Wolong Compiler Design
• Computation graph level optimization• Graph rewriting based on computational equivalence • Common subexpression elimination, constant folding etc.• Operator batching: automatically batch same type operators
to better leverage batch efficiency
• Target and application specific runtime compilation• Static shape and type inference• Static memory planning• Aggressive kernel fusion
Computational Graph (IR)xw b
* + y
Execution runtime
Graph level optimizer
Target specific JIT compiler
-
Wolong Compiler Execution Workflow
MatMul
MatMul
Input graph
Code generation
Operator batching
Detect optimize subgraph
MatMul
Relu
MulLSTM
BiasAdd
Sigmoid
Shape inferenceMemory planning FusedKernel
JIT compile
Rewrite graph
FusedKernel
InputInput
Output
Cache hit in later iterations
Runtime OptimizationBefore Graph Execution
-
Graph Level Optimization: Operator Batching
• Automatically conduct GEMM fusion and static memory placement optimization
MatMul
M0[2,4]M2[4,1]
O0[2,1]
M1[2,4]
MatMul
O1[2,1]
Tensor M0(2,4) Tensor M1(2,4)
Concat Tensor M2(4,4)
MatMul
M2[4,1]Concat(M0, M1)
Concat(O0, O1)
-
JIT Compilation: Kernel Fusion
• Leverage aggressive kernel fusion to completely remove scheduling overhead• Element-wise (i.e., point-wise) operators
• No cross-element dependency between operators• Better leverage cache, register locality
𝑥𝑥2
ℎℎ = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥1 + 𝑥𝑥2)
Add
Sig
𝑥𝑥1 𝑥𝑥2
ℎ
0 1 2 3
𝑥𝑥10 1 2 3
0 1 2 3
0 1 2 3
__global__ void kernel_0(float *x1, float *x2, float *h) {int idx = blockIdx.x * blockDim.x +
threadIdx.x;if (idx < 1024) {float temp0 = x1[idx] + x2[idx];float temp1 = sigmoidf(temp0);h[idx] = temp1;
}}
-
JIT Compilation: Kernel Fusion
• Fuse arbitrary (non element-wise) operators in to single kernel• Operator data dependency may introduce cross threads data dependency in kernel• Need global synchronization to guarantee correctness• Cross operator communication uses device memory
• E.g., fuse two matrix multiplications: Z = 𝐴𝐴 × 𝐵𝐵 × 𝐶𝐶
MM
MM
𝐴𝐴 𝐵𝐵
𝑍𝑍
𝐶𝐶
void kernel_0(float *A, float *B, float *C, float *Z) {if (idx < 1024) {buffer[idx] = MatMul_f(A, B);Global_Sync();Z[idx] = MatMul_f(buffer, C);h[idx] = temp1;
}}
-
Graph Computation in DL Frameworks
• Operators (kernels) are scheduled (launched) one by one
block0 block1 block4 block5block2 block3
block6 block7 block10block8 block9
CPU SM SM SM SM SM SM
GPU
MatMul
PCIe
block0 block1 block4 block5block2 block3
block6 block7 block8Conv
MatMul
Conv
Overhead of kernel launching, DAG scheduling, memory copy,
etc.
-
Arbitrary Kernel Fusion Is Limited by GPU Architechture
• Hard to conduct global synchronization across all threads
Fused as single kernel
CPU SM SM SM SM SM SM
MatMul
PCIe
Conv
Conv
MatMul
block0
Conv
MatMul
block1
Conv
MatMul
block2
Conv
MatMul
block3
Conv
MatMul
block4
Conv
MatMul
block5
Conv
MatMul
block6
Conv
MatMul
block7
Conv
MatMul
block8
Conv
block9
Conv
block10
GPU
barrier
Never complete due to waiting for barrier
Never be scheduled
MatMul
Conv
-
Our Solution: Persistent Threads and Virtual Blocks
• Assign virtual block task to persistent threads
Fused as single kernel
CPU SM SM SM SM SM SM
MatMul
PCIe
Conv
block0 block1 block2 block3 block4 block5
GPU
barrier
vblock0 vblock1 vblock4 vblock5vblock2 vblock3
vblock6 vblock7 vblock10vblock8 vblock9
vblock0 vblock1 vblock4 vblock5vblock2 vblock3
vblock6 vblock7 vblock8
-
Kernel Packing
• Explore graph level parallelism in static code generation
CPU SM SM SM SM SM SMPCIe
block0 block1 block2 block3 block4 block5
GPU
vblock0 vblock1 vblock4 vblock5vblock2 vblock3
vblock6
vblock0 vblock1
vblock0 vblock1 vblock2𝜎𝜎
MatMul
𝑨𝑨𝑨𝑨𝑨𝑨
[64, 512] [512, 128]
[64, 128]
vblock3
Under utilization of GPU resource
-
Code Generation
C[idx] = temp2;}
}
if (idx < 1024) {float temp2 = reluf(temp1);
for (int tix = bx; tix < ey; tix += offx) {for (int tiy = by; tiy < ey; tiy += offy) {
MatMul(buffer0, B, buffer1, 1024, 1024, 128);}
}
if (idx < 1024) {float temp0 = sigmoidf(A[idx]);buffer0[idx] = temp0;
}GlobalSync();
A B
𝜎𝜎
Relu
MatMul
C
//kernel code generated by Wolong compiler__global__ void kernel_0(float *A, float *B, float *C) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
//device function for sigmoid operator__device__ float sigmoidf(float in) {
return 1.f / (1.f + expf(-in));}
Operator device functions
…
//device function for relu operator__device__ float reluf(float in) {
return fmaxf(0.f, in);}
//device function for MatMul operator__device__ float MatMul(float *a, float *b, float *c, int m, int n, int k) {if (thread_ix < m && thread_iy < k) {
float temp = 0.f;for (int i = 0; i < n; ++i) {
temp += a[tix*n+i] * b[k*i+tiy];}c[k * tix + tiy] = temp;
}}
Operator kernels: device functionsCode Generator
MatMul, Add, Mul, Sub, Div, Relu, Sigmoid, Tanh, Split, Max, Min, Convolution, etc.
-
Performance of End-to-end Kernel Fusion
• RNN inference benchmark (LSTM-128uints-80steps)
• Experiments are conducted on Nvidia GTX 1080 Ti GPUs
fused_kernel
Fused 1686 operators into 1 kernel
Wolong
0
5
10
15
20
25
30
35
TensorFlow XLA OpBatch Fusion
Avg
Runt
ime(
ms)
LSTM Benchmark
10.9x
-
Conclusion
• A compiler infrastructure is critical for both cloud and edge AI• Optimize for fast distributed training in cloud• Optimize for efficient inference on accelerator devices
• System innovations to bridge applications and diverse hardware• Common intermediate representation (IR)• Co-design software and hardware for extreme efficiency
• Wolong prototype has demonstrated the initial improvements• Up to 8x speedup on training workloads• Up to 10x speedup on inference benchmark
-
Systems | Fueling future disruptions
Thank You!
-
Distributed Graph Optimizer of Wolong
• Transfer dynamically allocated tensor through RDMA write/read• Phase I: graph analyzing• Phase II: graph execution Supports GPUDirect RDMA as well
𝑿𝑿𝑾𝑾𝟏𝟏
*
𝑾𝑾𝟐𝟐
𝑯𝑯 𝒀𝒀
Server0
*
𝝈𝝈
Server1
Send
Recv
One-sided RDMA write
1
Tensor meta data
......
Source Tensor
Allocate
......
Dest Tensor
One-sided RDMA read
0
Tensor meta data
1
Research�Faculty Summit 2018Wolong: A Backend Optimizer for Deep Learning Computation System Challenge in Deep LearningWolong: Optimizer Stack for Deep LearningSlide Number 5Distributed Dataflow Graph ExecutionGeneral Message Passing Library (e.g., RPC)�Combine Dataflow Graph Computation with RDMAOptimized Communication MechanismGlobal Optimizer: Performance EvaluationSlide Number 11MotivationDL Frameworks vs. Vendor Provided LibraryWolong Compiler DesignWolong Compiler Execution WorkflowGraph Level Optimization: Operator BatchingJIT Compilation: Kernel FusionJIT Compilation: Kernel FusionGraph Computation in DL FrameworksArbitrary Kernel Fusion Is Limited by GPU ArchitechtureOur Solution: Persistent Threads and Virtual BlocksKernel PackingCode GenerationPerformance of End-to-end Kernel FusionConclusionThank You!Slide Number 27Distributed Graph Optimizer of Wolong