TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End...

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu,

Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

!1

Collaborators

!2

SAMPL Lab: https://sampl.cs.washington.edu

https://sampl.cs.washington.edu

Beginning of Story

!3

Beginning of Story

!3

Acclerator

Beginning of Story

// Pseudo-code for convolution program for the VIA accelerator// Virtual Thread 00x00: LOAD(PARAM[ 0-71]) // LD@TID00x01: LOAD(ACTIV[ 0-24]) // LD@TID00x02: LOAD(LDBUF[ 0-31]) // LD@TID00x03: PUSH(LD->EX) // LD@TID00x04: POP (LD->EX) // EX@TID00x05: EXE (ACTIV[ 0-24],PARAM[ 0-71],LDBUF[ 0-31],STBUF[ 0- 7]) // EX@TID00x06: PUSH(EX->LD) // EX@TID00x07: PUSH(EX->ST) // EX@TID00x08: POP (EX->ST) // ST@TID00x09: STOR(STBUF[ 0- 7]) // ST@TID00x0A: PUSH(ST->EX) // ST@TID0// Virtual Thread 10x0B: LOAD(ACTIV[25-50]) // LD@TID10x0C: LOAD(LDBUF[32-63]) // LD@TID10x0D: PUSH(LD->EX) // LD@TID10x0E: POP (LD->EX) // EX@TID10x0F: EXE (ACTIV[25-50],PARAM[ 0-71],LDBUF[32-63],STBUF[32-39]) // EX@TID10x10: PUSH(EX->LD) // EX@TID10x11: PUSH(EX->ST) // EX@TID10x12: POP (EX->ST) // ST@TID10x13: STOR(STBUF[32-39]) // ST@TID10x14: PUSH(ST->EX) // ST@TID1// Virtual Thread 20x15: POP (EX->LD) // LD@TID20x16: LOAD(PARAM[ 0-71]) // LD@TID20x17: LOAD(ACTIV[ 0-24]) // LD@TID20x18: LOAD(LDBUF[ 0-31]) // LD@TID20x19: PUSH(LD->EX) // LD@TID20x1A: POP (LD->EX) // EX@TID20x1B: POP (ST->EX) // EX@TID20x1C: EXE (ACTIV[ 0-24],PARAM[ 0-71],LDBUF[ 0-31],STBUF[ 0- 7]) // EX@TID20x1D: PUSH(EX->ST) // EX@TID20x1E: POP (EX->ST) // ST@TID20x1F: STOR(STBUF[ 0- 7]) // ST@TID2// Virtual Thread 30x20: POP (EX->LD) // LD@TID30x21: LOAD(ACTIV[25-50]) // LD@TID30x22: LOAD(LDBUF[32-63]) // LD@TID30x23: PUSH(LD->EX) // LD@TID30x24: POP (LD->EX) // EX@TID30x25: POP (ST->EX) // EX@TID20x26: EXE (ACTIV[25-50],PARAM[ 0-71],LDBUF[32-63],STBUF[32-39]) // EX@TID30x27: PUSH(EX->ST) // EX@TID30x28: POP (EX->ST) // ST@TID30x29: STOR(STBUF[32-39]) // ST@TID3

// Convolution access pattern dictated by micro-coded program.// Each register index is derived as a 2-D affine function.// e.g. idxrf = arfy+brfx+crf

0, where crf0 is specified by

// micro op 0 fields.for y in [0…i) for x in [0…j) rf[idxrf

0] += GEVM(act[idxact0], par[idxpar

0]) rf[idxrf

1] += GEVM(act[idxact1], par[idxpar

1]) … rf[idxrf

n] += GEVM(act[idxactn], par[idxpar

n])

(b) Convolution micro-coded program

// Max-pool, batch normalization and activation function// access pattern dictated by micro-coded program.// Each register index is derived as a 2D affine function.// e.g. idxdst = adsty+bdstx+cdst

0, where cdst0 is specified by

// micro op 0 fields.for y in [0…i) for x in [0…j) // max pooling rf[idxdst

0] = MAX(rf[idxdst0], rf[idxsrc

0]) rf[idxdst

1] = MAX(rf[idxdst1], rf[idxsrc

1]) … // batch norm rf[idxdst

m] = MUL(rf[idxdstm], rf[idxsrc

m]) rf[idxdst

m+1] = ADD(rf[idxdstm+1], rf[idxsrc

m+1]) rf[idxdst

m+2] = MUL(rf[idxdstm+2], rf[idxsrc

m+2]) rf[idxdst

m+3] = ADD(rf[idxdstm+3], rf[idxsrc

m+3]) … // activation rf[idxdst

n-1] = RELU(rf[idxdstn-1], rf[idxsrc

n-1]) rf[idxdst

n] = RELU(rf[idxdstn], rf[idxsrc

n])

(c) Max pool, batch norm and activationmicro-coded program

(a) Blocked convolution program with multiple thread contexts

!3

Acclerator

Goal: Deploy Deep Learning Everywhere

!4

Frameworks


Explosion of models and frameworks

!4

Frameworks



Explosion of hardware backends

!4

Frameworks




!4

Frameworks

Acclerator




Huge gap between model/frameworks and hardware backends

!4

Frameworks

Acclerator

Existing Approach

Hardware!5

Frameworks

Existing Approach

High-level data flow graph

Hardware!5

Frameworks

Existing Approach


Hardware

Primitive Tensor operators such as Conv2D

!5

Frameworks

Existing Approach


Hardware

Primitive Tensor operators such as Conv2D

eg. cuDNN Offload to heavily optimized DNN operator library

!5

Frameworks

Existing Approach: Engineer Optimized Tensor Operators

!6!6


!6!6

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification


!6!6



for y in range(1024): for x in range(1024): C[y][x] = 0 for k in range(1024): C[y][x] += A[k][y] * B[k][x]

Vanilla Code


!7!7



for yo in range(128): for xo in range(128): C[yo*8:yo*8+8][xo*8:xo*8+8] = 0 for ko in range(128): for yi in range(8): for xi in range(8): for ki in range(8): C[yo*8+yi][xo*8+xi] += A[ko*8+ki][yo*8+yi] * B[ko*8+ki][xo*8+xi]

Loop Tiling for Locality


!8!8



inp_buffer AL[8][8], BL[8][8]acc_buffer CL[8][8]for yo in range(128): for xo in range(128): vdla.fill_zero(CL) for ko in range(128): vdla.dma_copy2d(AL, A[ko*8:ko*8+8][yo*8:yo*8+8]) vdla.dma_copy2d(BL, B[ko*8:ko*8+8][xo*8:xo*8+8]) vdla.fused_gemm8x8_add(CL, AL, BL) vdla.dma_copy2d(C[yo*8:yo*8+8,xo*8:xo*8+8], CL)

Map to Accelerators

Human exploration of optimized code

Limitations of Existing Approach

cuDNN

!9

Frameworks


cuDNN

!9

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup


cuDNN

!9

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Engineering intensive

Learning-based Learning System

High-level data flow graph and optimizations

!10

Hardware

Frameworks

Hardware aware Search Space of Optimized Tensor Programs



!10

Hardware

Frameworks


Machine Learning based Program Optimizer



!10

Hardware

Frameworks





directly generate optimized program for new operator workloads and hardware

!10

Hardware

Frameworks

Learning-based Learning System Frameworks




!11

Hardware

Hardware-aware Search Space

Hardware!12


Tensor Expression Language (Specification)


Hardware!12


Tensor Expression Language (Specification)

Define search space of hardware aware mappings from expression to hardware program

Based on Halide’s compute/schedule separation


L1D L1IL2

L3

L1D L1IL2

implicitly managed

Compute Primitives

scalar vector

!13

Memory Subsystem

Loop Transformations

Cache Locality Vectorization

CPUs

Reuse primitives from prior work: Halide, Loopy

Challenge to Support Diverse Hardware Backends

!14

CPUs GPUs TPU-like specialized Accelerators

Hardware-aware Search SpaceCompute Primitives

scalar vector

!15

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

GPUs


scalar vector

!15

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

GPUs

Shared memory among compute cores


scalar vector

!15

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

Thread Cooperation

Use of Shared Memory

GPUs

Shared memory among compute cores


!16

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Tensorization ChallengeCompute primitives

!17


scalar

!17


scalar vector

!17


scalar vector tensor

!17



w, x = t.placeholder((8, 8)), t.placeholder((8, 8))k = t.reduce_axis((0, 8))y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] * x[j, k], axis=k))

def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr(“r") xx_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("w") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) reset = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update

gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)

declare behavior

lowering rule to generatehardware intrinsics to carry out the computation

Hardware designer: declare tensor instruction interface

with Tensor Expression

!17



w, x = t.placeholder((8, 8)), t.placeholder((8, 8))k = t.reduce_axis((0, 8))y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] * x[j, k], axis=k))

def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr(“r") xx_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("w") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) reset = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update

gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)

declare behavior

lowering rule to generatehardware intrinsics to carry out the computation

Hardware designer: declare tensor instruction interface

with Tensor Expression

!17

Tensorize: transform program

to use tensor instructions

scalar tensor


!18

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Software Support for Latency Hiding

Single ModuleNo Task-Pipelining

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights


store outputs

load inputs

load weights


store outputs

load inputs

load weights


store outputs



load inputs

load weights


store outputs

load inputs

load weights


store outputs

load inputs

load weights

load inputs

load weights



store outputs

store outputs

Multiple-ModuleTask-Level Pipelining



load inputs

load weights


store outputs

load inputs

load weights


store outputs

load inputs

load weights

load inputs

load weights



store outputs

store outputs

Multiple-ModuleTask-Level Pipelining

Explicit dependency tracking managed by software to hide memory latency


Hardware!21


Tensor Expression Language


Hardware


Thread Bindings

Cache Locality

Primitives in prior work:Halide, Loopy

!21




Hardware


Thread Bindings

Cache Locality

Primitives in prior work:Halide, Loopy

Thread Cooperation Tensorization Latency

Hiding …New primitives for GPUs, and enable TPU-like Accelerators

!21




Hardware


Thread Bindings

Cache Locality


Hiding

!22




Hardware


Thread Bindings

Cache Locality


Hiding

Billions of possible optimization choices

!22







!23

Hardware

Learning-based Program Optimizer

!24

Program Optimizer


!24

Program Optimizer Program Code Generator


Runtime Measurements

!24



Runtime Measurements

High experiment cost, each trial costs ~1second !24



!25



!25


Cost Model


Need reliable cost model per hardware!25


Cost Model


!26



!26


D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data


!26



Training data

LearningStatistical Cost Model


!26



Training data

LearningStatistical Cost Model

Adapt to hardware type by learning Make prediction in 1ms level

Effectiveness of ML based Model

!27

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50


!27

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

One Conv2D Layer of ResNet18 on Titan X


!27

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials



!27

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials


Rela

tive

Spee

dup

Baseline: CuDNN


!28

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials


Rela

tive

Spee

dup

Baseline: CuDNN

TVM: Random Search


!29

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50


Rela

tive

Spee

dup

TVM: Random Search

TVM: ML-based Model

Number of Trials

Baseline: CuDNN





!30

Hardware

End to End Inference Performance (Nvidia Titan X)

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Tensorflow-XLA

Apache MXNetTensorflow



1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Tensorflow-XLA


Backed by cuDNN


Tensorflow-XLA



1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: without graph optimizations


Tensorflow-XLA

Apache MXNetTensorflow TVM: without graph optimizations


1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: all optimizations


Tensorflow-XLA



1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Competitive on standard models


Tensorflow-XLA



1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Competitive on standard models

Bigger gap on less conventional models

End to End Performance(ARM Cortex-A53)

!34ResNet-18 MobileNet

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

Tim

e(m

s)

DQN0.0

2.0

4.0

6.0

8.0

10.0

12.0

Tensorflow Lite TVM w/o graph opt TVM

End to End Performance(ARM Cortex-A53)

!34ResNet-18 MobileNet

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

Tim

e(m

s)

DQN0.0

2.0

4.0

6.0

8.0

10.0

12.0

Tensorflow Lite TVM w/o graph opt TVM

Specially optimized for Embedded system(ARM)

End to End Performance(ARM GPU)

!35

float32 float16ResNet-18

float32 float16MobileNet

0.0

50.0

100.0

150.0

200.0

250.0

Tim

e(m

s)

float32 float16DQN

0.0

1.0

2.0

3.0

4.0

5.0

ARMComputeLib TVM w/o graph opt TVM

Supporting New Specialized Accelerators Frameworks




LLVM CUDA

!36

Supporting New Specialized Accelerators Frameworks




LLVM CUDA VTA: Open, Customizable Deep Learning Accelerator

Data Center FPGAEdge FPGA ASIC

!36

TVM/VTA: Full Stack Open Source System

!37

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations


VTA MicroArchitecture

!37

ML-based Optimizer




VTA Hardware/Software Interface (ISA)


!37

ML-based Optimizer




VTA Runtime & JIT Compiler



!37

ML-based Optimizer






VTA MicroArchitecture VTA Simulator

!37

ML-based Optimizer




• JIT compile accelerator micro code

• Support heterogenous devices, 10x better than CPU on the same board.

• Move hardware complexity to software




!37

ML-based Optimizer




• JIT compile accelerator micro code

• Support heterogenous devices, 10x better than CPU on the same board.

• Move hardware complexity to software




!37

ML-based Optimizer


High-level Optimizations}compiler, driver, hardware design full stack open source

TVM: Learning-based Learning System

Check it out!ML-based Optimizer



VTALLVM CUDA

TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End...

Documents

Transcript of TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End...