Research Faculty Summit 2018 · 2018. 8. 23. · • System innovations to bridge applications and...

Systems | Fueling future disruptions

ResearchFaculty Summit 2018

Wolong: A Back-end Optimizer for Deep Learning Computation Jilong XueResearcher, Microsoft Research Asia

System Challenge in Deep Learning

• Innovations are emerging very fast in deep learning area• New DNN models and workload patterns

• RNN, CNN, GAN, reinforcement learning, graph neural network, etc.• Diverse and emerging hardware accelerators,

• GPU, FPGA, ASICs, edge devices, NV-Link, RDMA, etc.

• Compiler stack is key to bridge framework and hardware• Combine information of computation graph and hardware• Optimize for both local execution and distributed scalability • Critical for both training and inference

Intermediate Representation (IR)xw b

* + y

Execution backends

Compiler & Optimizer Infrastructure

Wolong: Optimizer Stack for Deep Learning

• System innovation to bridge application and hardware• General computation graph optimization• Software and hardware co-design• Just-in-time compiler

• Transparent optimization• Communication efficiency• Accelerator execution efficiency • Memory efficiency

Intermediate Representation (graph of operators)

Global Optimizer:• Optimize distributed training over RDMA• Graph analyzer/ RDMA memcpy library

Local Optimizer:• Optimize execution with JIT compiler• Operator batching/ kernel fusion

Tensor Placement Optimizer:

Execution RuntimeCPU, GPU, RDMA devices

Memory layout and placement optimization

Global Optimizer Fast Distributed Deep Learning Computation over RDMA

Distributed Dataflow Graph Execution

• Deep learning computation is modeled as dataflow graph• Achieve parallel manner through graph partitioning

• Model parallelism vs. data parallelism • Tensor transmission across server becomes bottlenecks

Partition graph

Dispatchpartitions

𝑿𝑿𝑾𝑾𝟏𝟏

*

𝑾𝑾𝟐𝟐

𝑯𝑯

𝒀𝒀

*𝝈𝝈

Server0


𝑯𝑯

*

𝝈𝝈

Server1

Send

Recv*𝑾𝑾2

𝒀𝒀

1

10

100

1000

10000

Late

ncy

(ms)

message size (MB)

Send-Recv Benchmark

gRPC RDMA10x

General Message Passing Library (e.g., RPC)

• Unavoidable memory copy overhead in RPC• Generally designed for dynamic data structure• Lacks knowledge of actual data placement and size• Extra memory copy from data serialization

• Software/hardware co-design to completely remove memory copy overhead • Leverage runtime application information• RDMA network


*

𝑾𝑾𝟐𝟐

𝑯𝑯 𝒀𝒀

Server0

*

𝝈𝝈

Server1

Send

Recv

t tApplication memory Application memory

RPC managed buffer

tt tRPC managed buffer

Combine Dataflow Graph Computation with RDMA

• Tensor abstraction in deep learning computation• Consists of a plain byte array with sufficiently large size (tens of KB to MB)• Do NOT require variant data serialization/deserialization• Do NOT require extra batching since access pattern is already sequential

• RDMA enables to manage local and distributed memory in a unified view• One-side RDMA R/W : efficient memory copy between host memory • GPU-Direct RDMA : efficient memory copy between host and device memory

• Global graph optimizer for distributed computation• Has the entire view and control of memory placement among devices and servers• Capable of making globally optimized strategy for tensor data placement in runtime

Optimized Communication Mechanism

• Transfer statically placed tensor through one-side RDMA write• Phase I: graph analyzing• Phase II: graph execution

RDMA-based zero-copy communication


*

𝑾𝑾𝟐𝟐

𝑯𝑯 𝒀𝒀

Server0

*

𝝈𝝈

Server1

Send

Recv...... 1

Source Tensor

One-sided RDMA write 0

(Polling flag byte)......

Dest Tensor

Tensor Manager:• Detect the source tensor place• Re-allocate as RDMA memory

Tensor Manager:• Pre-allocate RDMA compatible

receive tensor

RDMA lib:• Conduct remote memory copy

1

Global Optimizer: Performance Evaluation

• Improve training throughput, convergence speed and scalability

4.2x

2.6x

4.0x

2.3x

1.8x

8.1x0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

AlexNet Inception FC LSTM GRU VGG16Thr

ough

put (

min

i-bat

chs/

sec)

Deep Learning Benchmarks

TensorFlow(gRPC) Wolong

10

100

1000

10000

100000

0 500 1000 1500 2000 2500

Perp

lexi

ty

Run time (s)

Convergence of Seq2Seq Translation

TensorFlow(gRPC) Wolong(RDMA)

More details in our paper: RPC Considered Harmful: Fast Distributed Deep Learning on RDMA* Experiments are conducted on 8 servers 8 Nvidia GTX 1080 GPUs; The translation model uses WTM’15 datasets;

2-3x speed up

Local Optimizer Kernel Fusion for Deep Learning on GPU

Motivation

• Deep learning frameworks model computation as graph of primitive operators• Expressivity to represent arbitrary neural network structure• Flexibility to run on multi-device and multi-server through graph partitioning

• Significant framework overhead to schedule thousands of operators• Kernel-launch overhead• Cross operator communication overhead• Too fine-grained to leverage vendor’s library

• Example: 80-step LSTM model• Contains 1686 operators in TensorFlow

0.00

10.00

20.00

30.00

40.00

TensorFlow CUDNN FuseKernel

runt

ime

(ms)

LSTM 512x512 (80steps)

DL Frameworks vs. Vendor Provided Library

• Deep learning frameworks• E.g., TensorFlow, PyTorch, CNTK• Embrace flexibility and expressivity• Performance inefficiency

Flexibility

Efficiency

• Hardware specific library• E.g., cuDNN, cuBlas, MKL• Designed for extreme efficiency• Impossible to handle customized or

new network structure

• DL framework + Compiler• Generate library-like code in runtime• Win both of the worlds

Wolong Compiler Design

• Computation graph level optimization• Graph rewriting based on computational equivalence • Common subexpression elimination, constant folding etc.• Operator batching: automatically batch same type operators

to better leverage batch efficiency

• Target and application specific runtime compilation• Static shape and type inference• Static memory planning• Aggressive kernel fusion

Computational Graph (IR)xw b

* + y

Execution runtime

Graph level optimizer

Target specific JIT compiler

Wolong Compiler Execution Workflow

MatMul

MatMul

Input graph

Code generation

Operator batching

Detect optimize subgraph

MatMul

Relu

MulLSTM

BiasAdd

Sigmoid

Shape inferenceMemory planning FusedKernel

JIT compile

Rewrite graph

FusedKernel

InputInput

Output

Cache hit in later iterations

Runtime OptimizationBefore Graph Execution

Graph Level Optimization: Operator Batching

• Automatically conduct GEMM fusion and static memory placement optimization

MatMul

M0[2,4]M2[4,1]

O0[2,1]

M1[2,4]

MatMul

O1[2,1]

Tensor M0(2,4) Tensor M1(2,4)

Concat Tensor M2(4,4)

MatMul

M2[4,1]Concat(M0, M1)

Concat(O0, O1)

JIT Compilation: Kernel Fusion

• Leverage aggressive kernel fusion to completely remove scheduling overhead• Element-wise (i.e., point-wise) operators

• No cross-element dependency between operators• Better leverage cache, register locality

𝑥𝑥2

ℎℎ = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥1 + 𝑥𝑥2)

Add

Sig

𝑥𝑥1 𝑥𝑥2

ℎ

0 1 2 3

𝑥𝑥10 1 2 3

0 1 2 3

0 1 2 3

__global__ void kernel_0(float *x1, float *x2, float *h) {int idx = blockIdx.x * blockDim.x +

threadIdx.x;if (idx < 1024) {float temp0 = x1[idx] + x2[idx];float temp1 = sigmoidf(temp0);h[idx] = temp1;

}}

JIT Compilation: Kernel Fusion

• Fuse arbitrary (non element-wise) operators in to single kernel• Operator data dependency may introduce cross threads data dependency in kernel• Need global synchronization to guarantee correctness• Cross operator communication uses device memory

• E.g., fuse two matrix multiplications: Z = 𝐴𝐴 × 𝐵𝐵 × 𝐶𝐶

MM

MM

𝐴𝐴 𝐵𝐵

𝑍𝑍

𝐶𝐶

void kernel_0(float *A, float *B, float *C, float *Z) {if (idx < 1024) {buffer[idx] = MatMul_f(A, B);Global_Sync();Z[idx] = MatMul_f(buffer, C);h[idx] = temp1;

}}

Graph Computation in DL Frameworks

• Operators (kernels) are scheduled (launched) one by one

block0 block1 block4 block5block2 block3

block6 block7 block10block8 block9

CPU SM SM SM SM SM SM

GPU

MatMul

PCIe

block0 block1 block4 block5block2 block3

block6 block7 block8Conv

MatMul

Conv

Overhead of kernel launching, DAG scheduling, memory copy,

etc.

Arbitrary Kernel Fusion Is Limited by GPU Architechture

• Hard to conduct global synchronization across all threads

Fused as single kernel


MatMul

PCIe

Conv

Conv

MatMul

block0

Conv

MatMul

block1

Conv

MatMul

block2

Conv

MatMul

block3

Conv

MatMul

block4

Conv

MatMul

block5

Conv

MatMul

block6

Conv

MatMul

block7

Conv

MatMul

block8

Conv

block9

Conv

block10

GPU

barrier

Never complete due to waiting for barrier

Never be scheduled

MatMul

Conv

Our Solution: Persistent Threads and Virtual Blocks

• Assign virtual block task to persistent threads

Fused as single kernel


MatMul

PCIe

Conv

block0 block1 block2 block3 block4 block5

GPU

barrier

vblock0 vblock1 vblock4 vblock5vblock2 vblock3

vblock6 vblock7 vblock10vblock8 vblock9


vblock6 vblock7 vblock8

Kernel Packing

• Explore graph level parallelism in static code generation

CPU SM SM SM SM SM SMPCIe

block0 block1 block2 block3 block4 block5

GPU


vblock6

vblock0 vblock1

vblock0 vblock1 vblock2𝜎𝜎

MatMul

𝑨𝑨𝑨𝑨𝑨𝑨

[64, 512] [512, 128]

[64, 128]

vblock3

Under utilization of GPU resource

Code Generation

C[idx] = temp2;}

}

if (idx < 1024) {float temp2 = reluf(temp1);

for (int tix = bx; tix < ey; tix += offx) {for (int tiy = by; tiy < ey; tiy += offy) {

MatMul(buffer0, B, buffer1, 1024, 1024, 128);}

}

if (idx < 1024) {float temp0 = sigmoidf(A[idx]);buffer0[idx] = temp0;

}GlobalSync();

A B

𝜎𝜎

Relu

MatMul

C

//kernel code generated by Wolong compiler__global__ void kernel_0(float *A, float *B, float *C) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

//device function for sigmoid operator__device__ float sigmoidf(float in) {

return 1.f / (1.f + expf(-in));}

Operator device functions

…

//device function for relu operator__device__ float reluf(float in) {

return fmaxf(0.f, in);}

//device function for MatMul operator__device__ float MatMul(float *a, float *b, float *c, int m, int n, int k) {if (thread_ix < m && thread_iy < k) {

float temp = 0.f;for (int i = 0; i < n; ++i) {

temp += a[tix*n+i] * b[k*i+tiy];}c[k * tix + tiy] = temp;

}}

Operator kernels: device functionsCode Generator

MatMul, Add, Mul, Sub, Div, Relu, Sigmoid, Tanh, Split, Max, Min, Convolution, etc.

Performance of End-to-end Kernel Fusion

• RNN inference benchmark (LSTM-128uints-80steps)

• Experiments are conducted on Nvidia GTX 1080 Ti GPUs

fused_kernel

Fused 1686 operators into 1 kernel

Wolong

0

5

10

15

20

25

30

35

TensorFlow XLA OpBatch Fusion

Avg

Runt

ime(

ms)

LSTM Benchmark

10.9x

Conclusion

• A compiler infrastructure is critical for both cloud and edge AI• Optimize for fast distributed training in cloud• Optimize for efficient inference on accelerator devices

• System innovations to bridge applications and diverse hardware• Common intermediate representation (IR)• Co-design software and hardware for extreme efficiency

• Wolong prototype has demonstrated the initial improvements• Up to 8x speedup on training workloads• Up to 10x speedup on inference benchmark

Systems | Fueling future disruptions

Thank You!

Distributed Graph Optimizer of Wolong

• Transfer dynamically allocated tensor through RDMA write/read• Phase I: graph analyzing• Phase II: graph execution Supports GPUDirect RDMA as well


*

𝑾𝑾𝟐𝟐

𝑯𝑯 𝒀𝒀

Server0

*

𝝈𝝈

Server1

Send

Recv

One-sided RDMA write

1

Tensor meta data

......

Source Tensor

Allocate

......

Dest Tensor

One-sided RDMA read

0

Tensor meta data

1

Research�Faculty Summit 2018Wolong: A Backend Optimizer for Deep Learning Computation System Challenge in Deep LearningWolong: Optimizer Stack for Deep LearningSlide Number 5Distributed Dataflow Graph ExecutionGeneral Message Passing Library (e.g., RPC)�Combine Dataflow Graph Computation with RDMAOptimized Communication MechanismGlobal Optimizer: Performance EvaluationSlide Number 11MotivationDL Frameworks vs. Vendor Provided LibraryWolong Compiler DesignWolong Compiler Execution WorkflowGraph Level Optimization: Operator BatchingJIT Compilation: Kernel FusionJIT Compilation: Kernel FusionGraph Computation in DL FrameworksArbitrary Kernel Fusion Is Limited by GPU ArchitechtureOur Solution: Persistent Threads and Virtual BlocksKernel PackingCode GenerationPerformance of End-to-end Kernel FusionConclusionThank You!Slide Number 27Distributed Graph Optimizer of Wolong

Research Faculty Summit 2018 · 2018. 8. 23. · • System innovations to bridge applications and...

Documents

Transcript of Research Faculty Summit 2018 · 2018. 8. 23. · • System innovations to bridge applications and...