Research Faculty Summit 2018 · 2018. 8. 23. · • System innovations to bridge applications and...

28
Systems | Fueling future disruptions Research Faculty Summit 2018

Transcript of Research Faculty Summit 2018 · 2018. 8. 23. · • System innovations to bridge applications and...

  • Systems | Fueling future disruptions

    ResearchFaculty Summit 2018

  • Wolong: A Back-end Optimizer for Deep Learning Computation Jilong XueResearcher, Microsoft Research Asia

  • System Challenge in Deep Learning

    • Innovations are emerging very fast in deep learning area• New DNN models and workload patterns

    • RNN, CNN, GAN, reinforcement learning, graph neural network, etc.• Diverse and emerging hardware accelerators,

    • GPU, FPGA, ASICs, edge devices, NV-Link, RDMA, etc.

    • Compiler stack is key to bridge framework and hardware• Combine information of computation graph and hardware• Optimize for both local execution and distributed scalability • Critical for both training and inference

    Intermediate Representation (IR)xw b

    * + y

    Execution backends

    Compiler & Optimizer Infrastructure

  • Wolong: Optimizer Stack for Deep Learning

    • System innovation to bridge application and hardware• General computation graph optimization• Software and hardware co-design• Just-in-time compiler

    • Transparent optimization• Communication efficiency• Accelerator execution efficiency • Memory efficiency

    Intermediate Representation (graph of operators)

    Global Optimizer:• Optimize distributed training over RDMA• Graph analyzer/ RDMA memcpy library

    Local Optimizer:• Optimize execution with JIT compiler• Operator batching/ kernel fusion

    Tensor Placement Optimizer:

    Execution RuntimeCPU, GPU, RDMA devices

    Memory layout and placement optimization

  • Global Optimizer Fast Distributed Deep Learning Computation over RDMA

  • Distributed Dataflow Graph Execution

    • Deep learning computation is modeled as dataflow graph• Achieve parallel manner through graph partitioning

    • Model parallelism vs. data parallelism • Tensor transmission across server becomes bottlenecks

    Partition graph

    Dispatchpartitions

    𝑿𝑿𝑾𝑾𝟏𝟏

    *

    𝑾𝑾𝟐𝟐

    𝑯𝑯

    𝒀𝒀

    *𝝈𝝈

    Server0

    𝑿𝑿𝑾𝑾𝟏𝟏

    𝑯𝑯

    *

    𝝈𝝈

    Server1

    Send

    Recv*𝑾𝑾2

    𝒀𝒀

    1

    10

    100

    1000

    10000

    Late

    ncy

    (ms)

    message size (MB)

    Send-Recv Benchmark

    gRPC RDMA10x

  • General Message Passing Library (e.g., RPC)

    • Unavoidable memory copy overhead in RPC• Generally designed for dynamic data structure• Lacks knowledge of actual data placement and size• Extra memory copy from data serialization

    • Software/hardware co-design to completely remove memory copy overhead • Leverage runtime application information• RDMA network

    𝑿𝑿𝑾𝑾𝟏𝟏

    *

    𝑾𝑾𝟐𝟐

    𝑯𝑯 𝒀𝒀

    Server0

    *

    𝝈𝝈

    Server1

    Send

    Recv

    t tApplication memory Application memory

    RPC managed buffer

    tt tRPC managed buffer

  • Combine Dataflow Graph Computation with RDMA

    • Tensor abstraction in deep learning computation• Consists of a plain byte array with sufficiently large size (tens of KB to MB)• Do NOT require variant data serialization/deserialization• Do NOT require extra batching since access pattern is already sequential

    • RDMA enables to manage local and distributed memory in a unified view• One-side RDMA R/W : efficient memory copy between host memory • GPU-Direct RDMA : efficient memory copy between host and device memory

    • Global graph optimizer for distributed computation• Has the entire view and control of memory placement among devices and servers• Capable of making globally optimized strategy for tensor data placement in runtime

  • Optimized Communication Mechanism

    • Transfer statically placed tensor through one-side RDMA write• Phase I: graph analyzing• Phase II: graph execution

    RDMA-based zero-copy communication

    𝑿𝑿𝑾𝑾𝟏𝟏

    *

    𝑾𝑾𝟐𝟐

    𝑯𝑯 𝒀𝒀

    Server0

    *

    𝝈𝝈

    Server1

    Send

    Recv...... 1

    Source Tensor

    One-sided RDMA write 0

    (Polling flag byte)......

    Dest Tensor

    Tensor Manager:• Detect the source tensor place• Re-allocate as RDMA memory

    Tensor Manager:• Pre-allocate RDMA compatible

    receive tensor

    RDMA lib:• Conduct remote memory copy

    1

  • Global Optimizer: Performance Evaluation

    • Improve training throughput, convergence speed and scalability

    4.2x

    2.6x

    4.0x

    2.3x

    1.8x

    8.1x0.0

    100.0

    200.0

    300.0

    400.0

    500.0

    600.0

    700.0

    AlexNet Inception FC LSTM GRU VGG16Thr

    ough

    put (

    min

    i-bat

    chs/

    sec)

    Deep Learning Benchmarks

    TensorFlow(gRPC) Wolong

    10

    100

    1000

    10000

    100000

    0 500 1000 1500 2000 2500

    Perp

    lexi

    ty

    Run time (s)

    Convergence of Seq2Seq Translation

    TensorFlow(gRPC) Wolong(RDMA)

    More details in our paper: RPC Considered Harmful: Fast Distributed Deep Learning on RDMA* Experiments are conducted on 8 servers 8 Nvidia GTX 1080 GPUs; The translation model uses WTM’15 datasets;

    2-3x speed up

  • Local Optimizer Kernel Fusion for Deep Learning on GPU

  • Motivation

    • Deep learning frameworks model computation as graph of primitive operators• Expressivity to represent arbitrary neural network structure• Flexibility to run on multi-device and multi-server through graph partitioning

    • Significant framework overhead to schedule thousands of operators• Kernel-launch overhead• Cross operator communication overhead• Too fine-grained to leverage vendor’s library

    • Example: 80-step LSTM model• Contains 1686 operators in TensorFlow

    0.00

    10.00

    20.00

    30.00

    40.00

    TensorFlow CUDNN FuseKernel

    runt

    ime

    (ms)

    LSTM 512x512 (80steps)

  • DL Frameworks vs. Vendor Provided Library

    • Deep learning frameworks• E.g., TensorFlow, PyTorch, CNTK• Embrace flexibility and expressivity• Performance inefficiency

    Flexibility

    Efficiency

    • Hardware specific library• E.g., cuDNN, cuBlas, MKL• Designed for extreme efficiency• Impossible to handle customized or

    new network structure

    • DL framework + Compiler• Generate library-like code in runtime• Win both of the worlds

  • Wolong Compiler Design

    • Computation graph level optimization• Graph rewriting based on computational equivalence • Common subexpression elimination, constant folding etc.• Operator batching: automatically batch same type operators

    to better leverage batch efficiency

    • Target and application specific runtime compilation• Static shape and type inference• Static memory planning• Aggressive kernel fusion

    Computational Graph (IR)xw b

    * + y

    Execution runtime

    Graph level optimizer

    Target specific JIT compiler

  • Wolong Compiler Execution Workflow

    MatMul

    MatMul

    Input graph

    Code generation

    Operator batching

    Detect optimize subgraph

    MatMul

    Relu

    MulLSTM

    BiasAdd

    Sigmoid

    Shape inferenceMemory planning FusedKernel

    JIT compile

    Rewrite graph

    FusedKernel

    InputInput

    Output

    Cache hit in later iterations

    Runtime OptimizationBefore Graph Execution

  • Graph Level Optimization: Operator Batching

    • Automatically conduct GEMM fusion and static memory placement optimization

    MatMul

    M0[2,4]M2[4,1]

    O0[2,1]

    M1[2,4]

    MatMul

    O1[2,1]

    Tensor M0(2,4) Tensor M1(2,4)

    Concat Tensor M2(4,4)

    MatMul

    M2[4,1]Concat(M0, M1)

    Concat(O0, O1)

  • JIT Compilation: Kernel Fusion

    • Leverage aggressive kernel fusion to completely remove scheduling overhead• Element-wise (i.e., point-wise) operators

    • No cross-element dependency between operators• Better leverage cache, register locality

    𝑥𝑥2

    ℎℎ = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥1 + 𝑥𝑥2)

    Add

    Sig

    𝑥𝑥1 𝑥𝑥2

    0 1 2 3

    𝑥𝑥10 1 2 3

    0 1 2 3

    0 1 2 3

    __global__ void kernel_0(float *x1, float *x2, float *h) {int idx = blockIdx.x * blockDim.x +

    threadIdx.x;if (idx < 1024) {float temp0 = x1[idx] + x2[idx];float temp1 = sigmoidf(temp0);h[idx] = temp1;

    }}

  • JIT Compilation: Kernel Fusion

    • Fuse arbitrary (non element-wise) operators in to single kernel• Operator data dependency may introduce cross threads data dependency in kernel• Need global synchronization to guarantee correctness• Cross operator communication uses device memory

    • E.g., fuse two matrix multiplications: Z = 𝐴𝐴 × 𝐵𝐵 × 𝐶𝐶

    MM

    MM

    𝐴𝐴 𝐵𝐵

    𝑍𝑍

    𝐶𝐶

    void kernel_0(float *A, float *B, float *C, float *Z) {if (idx < 1024) {buffer[idx] = MatMul_f(A, B);Global_Sync();Z[idx] = MatMul_f(buffer, C);h[idx] = temp1;

    }}

  • Graph Computation in DL Frameworks

    • Operators (kernels) are scheduled (launched) one by one

    block0 block1 block4 block5block2 block3

    block6 block7 block10block8 block9

    CPU SM SM SM SM SM SM

    GPU

    MatMul

    PCIe

    block0 block1 block4 block5block2 block3

    block6 block7 block8Conv

    MatMul

    Conv

    Overhead of kernel launching, DAG scheduling, memory copy,

    etc.

  • Arbitrary Kernel Fusion Is Limited by GPU Architechture

    • Hard to conduct global synchronization across all threads

    Fused as single kernel

    CPU SM SM SM SM SM SM

    MatMul

    PCIe

    Conv

    Conv

    MatMul

    block0

    Conv

    MatMul

    block1

    Conv

    MatMul

    block2

    Conv

    MatMul

    block3

    Conv

    MatMul

    block4

    Conv

    MatMul

    block5

    Conv

    MatMul

    block6

    Conv

    MatMul

    block7

    Conv

    MatMul

    block8

    Conv

    block9

    Conv

    block10

    GPU

    barrier

    Never complete due to waiting for barrier

    Never be scheduled

    MatMul

    Conv

  • Our Solution: Persistent Threads and Virtual Blocks

    • Assign virtual block task to persistent threads

    Fused as single kernel

    CPU SM SM SM SM SM SM

    MatMul

    PCIe

    Conv

    block0 block1 block2 block3 block4 block5

    GPU

    barrier

    vblock0 vblock1 vblock4 vblock5vblock2 vblock3

    vblock6 vblock7 vblock10vblock8 vblock9

    vblock0 vblock1 vblock4 vblock5vblock2 vblock3

    vblock6 vblock7 vblock8

  • Kernel Packing

    • Explore graph level parallelism in static code generation

    CPU SM SM SM SM SM SMPCIe

    block0 block1 block2 block3 block4 block5

    GPU

    vblock0 vblock1 vblock4 vblock5vblock2 vblock3

    vblock6

    vblock0 vblock1

    vblock0 vblock1 vblock2𝜎𝜎

    MatMul

    𝑨𝑨𝑨𝑨𝑨𝑨

    [64, 512] [512, 128]

    [64, 128]

    vblock3

    Under utilization of GPU resource

  • Code Generation

    C[idx] = temp2;}

    }

    if (idx < 1024) {float temp2 = reluf(temp1);

    for (int tix = bx; tix < ey; tix += offx) {for (int tiy = by; tiy < ey; tiy += offy) {

    MatMul(buffer0, B, buffer1, 1024, 1024, 128);}

    }

    if (idx < 1024) {float temp0 = sigmoidf(A[idx]);buffer0[idx] = temp0;

    }GlobalSync();

    A B

    𝜎𝜎

    Relu

    MatMul

    C

    //kernel code generated by Wolong compiler__global__ void kernel_0(float *A, float *B, float *C) {

    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    //device function for sigmoid operator__device__ float sigmoidf(float in) {

    return 1.f / (1.f + expf(-in));}

    Operator device functions

    //device function for relu operator__device__ float reluf(float in) {

    return fmaxf(0.f, in);}

    //device function for MatMul operator__device__ float MatMul(float *a, float *b, float *c, int m, int n, int k) {if (thread_ix < m && thread_iy < k) {

    float temp = 0.f;for (int i = 0; i < n; ++i) {

    temp += a[tix*n+i] * b[k*i+tiy];}c[k * tix + tiy] = temp;

    }}

    Operator kernels: device functionsCode Generator

    MatMul, Add, Mul, Sub, Div, Relu, Sigmoid, Tanh, Split, Max, Min, Convolution, etc.

  • Performance of End-to-end Kernel Fusion

    • RNN inference benchmark (LSTM-128uints-80steps)

    • Experiments are conducted on Nvidia GTX 1080 Ti GPUs

    fused_kernel

    Fused 1686 operators into 1 kernel

    Wolong

    0

    5

    10

    15

    20

    25

    30

    35

    TensorFlow XLA OpBatch Fusion

    Avg

    Runt

    ime(

    ms)

    LSTM Benchmark

    10.9x

  • Conclusion

    • A compiler infrastructure is critical for both cloud and edge AI• Optimize for fast distributed training in cloud• Optimize for efficient inference on accelerator devices

    • System innovations to bridge applications and diverse hardware• Common intermediate representation (IR)• Co-design software and hardware for extreme efficiency

    • Wolong prototype has demonstrated the initial improvements• Up to 8x speedup on training workloads• Up to 10x speedup on inference benchmark

  • Systems | Fueling future disruptions

    Thank You!

  • Distributed Graph Optimizer of Wolong

    • Transfer dynamically allocated tensor through RDMA write/read• Phase I: graph analyzing• Phase II: graph execution Supports GPUDirect RDMA as well

    𝑿𝑿𝑾𝑾𝟏𝟏

    *

    𝑾𝑾𝟐𝟐

    𝑯𝑯 𝒀𝒀

    Server0

    *

    𝝈𝝈

    Server1

    Send

    Recv

    One-sided RDMA write

    1

    Tensor meta data

    ......

    Source Tensor

    Allocate

    ......

    Dest Tensor

    One-sided RDMA read

    0

    Tensor meta data

    1

    Research�Faculty Summit 2018Wolong: A Backend Optimizer for Deep Learning Computation System Challenge in Deep LearningWolong: Optimizer Stack for Deep LearningSlide Number 5Distributed Dataflow Graph ExecutionGeneral Message Passing Library (e.g., RPC)�Combine Dataflow Graph Computation with RDMAOptimized Communication MechanismGlobal Optimizer: Performance EvaluationSlide Number 11MotivationDL Frameworks vs. Vendor Provided LibraryWolong Compiler DesignWolong Compiler Execution WorkflowGraph Level Optimization: Operator BatchingJIT Compilation: Kernel FusionJIT Compilation: Kernel FusionGraph Computation in DL FrameworksArbitrary Kernel Fusion Is Limited by GPU ArchitechtureOur Solution: Persistent Threads and Virtual BlocksKernel PackingCode GenerationPerformance of End-to-end Kernel FusionConclusionThank You!Slide Number 27Distributed Graph Optimizer of Wolong