TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End...

110
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy 1

Transcript of TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End...

Page 1: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu,

Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

!1

Page 2: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Collaborators

!2

SAMPL Lab: https://sampl.cs.washington.edu

Page 3: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Beginning of Story

!3

Page 4: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Beginning of Story

!3

Acclerator

Page 5: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Beginning of Story

!3

Acclerator

Page 6: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Beginning of Story

// Pseudo-code for convolution program for the VIA accelerator// Virtual Thread 00x00: LOAD(PARAM[ 0-71]) // LD@TID00x01: LOAD(ACTIV[ 0-24]) // LD@TID00x02: LOAD(LDBUF[ 0-31]) // LD@TID00x03: PUSH(LD->EX) // LD@TID00x04: POP (LD->EX) // EX@TID00x05: EXE (ACTIV[ 0-24],PARAM[ 0-71],LDBUF[ 0-31],STBUF[ 0- 7]) // EX@TID00x06: PUSH(EX->LD) // EX@TID00x07: PUSH(EX->ST) // EX@TID00x08: POP (EX->ST) // ST@TID00x09: STOR(STBUF[ 0- 7]) // ST@TID00x0A: PUSH(ST->EX) // ST@TID0// Virtual Thread 10x0B: LOAD(ACTIV[25-50]) // LD@TID10x0C: LOAD(LDBUF[32-63]) // LD@TID10x0D: PUSH(LD->EX) // LD@TID10x0E: POP (LD->EX) // EX@TID10x0F: EXE (ACTIV[25-50],PARAM[ 0-71],LDBUF[32-63],STBUF[32-39]) // EX@TID10x10: PUSH(EX->LD) // EX@TID10x11: PUSH(EX->ST) // EX@TID10x12: POP (EX->ST) // ST@TID10x13: STOR(STBUF[32-39]) // ST@TID10x14: PUSH(ST->EX) // ST@TID1// Virtual Thread 20x15: POP (EX->LD) // LD@TID20x16: LOAD(PARAM[ 0-71]) // LD@TID20x17: LOAD(ACTIV[ 0-24]) // LD@TID20x18: LOAD(LDBUF[ 0-31]) // LD@TID20x19: PUSH(LD->EX) // LD@TID20x1A: POP (LD->EX) // EX@TID20x1B: POP (ST->EX) // EX@TID20x1C: EXE (ACTIV[ 0-24],PARAM[ 0-71],LDBUF[ 0-31],STBUF[ 0- 7]) // EX@TID20x1D: PUSH(EX->ST) // EX@TID20x1E: POP (EX->ST) // ST@TID20x1F: STOR(STBUF[ 0- 7]) // ST@TID2// Virtual Thread 30x20: POP (EX->LD) // LD@TID30x21: LOAD(ACTIV[25-50]) // LD@TID30x22: LOAD(LDBUF[32-63]) // LD@TID30x23: PUSH(LD->EX) // LD@TID30x24: POP (LD->EX) // EX@TID30x25: POP (ST->EX) // EX@TID20x26: EXE (ACTIV[25-50],PARAM[ 0-71],LDBUF[32-63],STBUF[32-39]) // EX@TID30x27: PUSH(EX->ST) // EX@TID30x28: POP (EX->ST) // ST@TID30x29: STOR(STBUF[32-39]) // ST@TID3

// Convolution access pattern dictated by micro-coded program.// Each register index is derived as a 2-D affine function.// e.g. idxrf = arfy+brfx+crf

0, where crf0 is specified by

// micro op 0 fields.for y in [0…i) for x in [0…j) rf[idxrf

0] += GEVM(act[idxact0], par[idxpar

0]) rf[idxrf

1] += GEVM(act[idxact1], par[idxpar

1]) … rf[idxrf

n] += GEVM(act[idxactn], par[idxpar

n])

(b) Convolution micro-coded program

// Max-pool, batch normalization and activation function// access pattern dictated by micro-coded program.// Each register index is derived as a 2D affine function.// e.g. idxdst = adsty+bdstx+cdst

0, where cdst0 is specified by

// micro op 0 fields.for y in [0…i) for x in [0…j) // max pooling rf[idxdst

0] = MAX(rf[idxdst0], rf[idxsrc

0]) rf[idxdst

1] = MAX(rf[idxdst1], rf[idxsrc

1]) … // batch norm rf[idxdst

m] = MUL(rf[idxdstm], rf[idxsrc

m]) rf[idxdst

m+1] = ADD(rf[idxdstm+1], rf[idxsrc

m+1]) rf[idxdst

m+2] = MUL(rf[idxdstm+2], rf[idxsrc

m+2]) rf[idxdst

m+3] = ADD(rf[idxdstm+3], rf[idxsrc

m+3]) … // activation rf[idxdst

n-1] = RELU(rf[idxdstn-1], rf[idxsrc

n-1]) rf[idxdst

n] = RELU(rf[idxdstn], rf[idxsrc

n])

(c) Max pool, batch norm and activationmicro-coded program

(a) Blocked convolution program with multiple thread contexts

!3

Acclerator

Page 7: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

!4

Frameworks

Page 8: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

!4

Frameworks

Page 9: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Page 10: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Page 11: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Page 12: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Page 13: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Page 14: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Page 15: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Page 16: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Page 17: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

!4

Frameworks

Acclerator

Page 18: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Goal: Deploy Deep Learning Everywhere

Explosion of models and frameworks

Explosion of hardware backends

Huge gap between model/frameworks and hardware backends

!4

Frameworks

Acclerator

Page 19: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach

Hardware!5

Frameworks

Page 20: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach

High-level data flow graph

Hardware!5

Frameworks

Page 21: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach

High-level data flow graph

Hardware

Primitive Tensor operators such as Conv2D

!5

Frameworks

Page 22: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach

High-level data flow graph

Hardware

Primitive Tensor operators such as Conv2D

eg. cuDNN Offload to heavily optimized DNN operator library

!5

Frameworks

Page 23: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach: Engineer Optimized Tensor Operators

!6!6

Page 24: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach: Engineer Optimized Tensor Operators

!6!6

Page 25: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach: Engineer Optimized Tensor Operators

!6!6

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

Page 26: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach: Engineer Optimized Tensor Operators

!6!6

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

Page 27: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach: Engineer Optimized Tensor Operators

!6!6

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

for y in range(1024): for x in range(1024): C[y][x] = 0 for k in range(1024): C[y][x] += A[k][y] * B[k][x]

Vanilla Code

Page 28: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach: Engineer Optimized Tensor Operators

!7!7

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

for yo in range(128): for xo in range(128): C[yo*8:yo*8+8][xo*8:xo*8+8] = 0 for ko in range(128): for yi in range(8): for xi in range(8): for ki in range(8): C[yo*8+yi][xo*8+xi] += A[ko*8+ki][yo*8+yi] * B[ko*8+ki][xo*8+xi]

Loop Tiling for Locality

Page 29: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Existing Approach: Engineer Optimized Tensor Operators

!8!8

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

inp_buffer AL[8][8], BL[8][8]acc_buffer CL[8][8]for yo in range(128): for xo in range(128): vdla.fill_zero(CL) for ko in range(128): vdla.dma_copy2d(AL, A[ko*8:ko*8+8][yo*8:yo*8+8]) vdla.dma_copy2d(BL, B[ko*8:ko*8+8][xo*8:xo*8+8]) vdla.fused_gemm8x8_add(CL, AL, BL) vdla.dma_copy2d(C[yo*8:yo*8+8,xo*8:xo*8+8], CL)

Map to Accelerators

Human exploration of optimized code

Page 30: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

Page 31: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

Page 32: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

Page 33: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

Page 34: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

Page 35: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Page 36: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Page 37: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Page 38: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Limitations of Existing Approach

cuDNN

!9

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Engineering intensive

Page 39: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Learning System

High-level data flow graph and optimizations

!10

Hardware

Frameworks

Page 40: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Learning System

High-level data flow graph and optimizations

!10

Hardware

Frameworks

Page 41: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware aware Search Space of Optimized Tensor Programs

Learning-based Learning System

High-level data flow graph and optimizations

!10

Hardware

Frameworks

Page 42: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Learning-based Learning System

High-level data flow graph and optimizations

!10

Hardware

Frameworks

Page 43: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

Learning-based Learning System

High-level data flow graph and optimizations

directly generate optimized program for new operator workloads and hardware

!10

Hardware

Frameworks

Page 44: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

!11

Hardware

Page 45: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

!11

Hardware

Page 46: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

Hardware!12

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language (Specification)

Page 47: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

Hardware!12

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language (Specification)

Define search space of hardware aware mappings from expression to hardware program

Based on Halide’s compute/schedule separation

Page 48: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

L1D L1IL2

L3

L1D L1IL2

implicitly managed

Compute Primitives

scalar vector

!13

Memory Subsystem

Loop Transformations

Cache Locality Vectorization

CPUs

Reuse primitives from prior work: Halide, Loopy

Page 49: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Challenge to Support Diverse Hardware Backends

!14

CPUs GPUs TPU-like specialized Accelerators

Page 50: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search SpaceCompute Primitives

scalar vector

!15

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

GPUs

Page 51: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search SpaceCompute Primitives

scalar vector

!15

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

GPUs

Shared memory among compute cores

Page 52: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search SpaceCompute Primitives

scalar vector

!15

Memory Subsystem

L2

RF RFTX/L1

SM

RF RFTX/L1

SM

mixed

Thread Cooperation

Use of Shared Memory

GPUs

Shared memory among compute cores

Page 53: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

!16

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Page 54: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

!16

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Page 55: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Tensorization ChallengeCompute primitives

!17

Page 56: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Tensorization ChallengeCompute primitives

scalar

!17

Page 57: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Tensorization ChallengeCompute primitives

scalar vector

!17

Page 58: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Tensorization ChallengeCompute primitives

scalar vector tensor

!17

Page 59: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Tensorization ChallengeCompute primitives

scalar vector tensor

w, x = t.placeholder((8, 8)), t.placeholder((8, 8))k = t.reduce_axis((0, 8))y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] * x[j, k], axis=k))

def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr(“r") xx_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("w") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) reset = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update

gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)

declare behavior

lowering rule to generatehardware intrinsics to carry out the computation

Hardware designer: declare tensor instruction interface

with Tensor Expression

!17

Page 60: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Tensorization ChallengeCompute primitives

scalar vector tensor

w, x = t.placeholder((8, 8)), t.placeholder((8, 8))k = t.reduce_axis((0, 8))y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] * x[j, k], axis=k))

def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr(“r") xx_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("w") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) reset = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update

gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)

declare behavior

lowering rule to generatehardware intrinsics to carry out the computation

Hardware designer: declare tensor instruction interface

with Tensor Expression

!17

Tensorize: transform program

to use tensor instructions

scalar tensor

Page 61: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

!18

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Page 62: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

!18

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

FIFO

explicitly managed

Memory Subsystem

Page 63: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Software Support for Latency Hiding

Single ModuleNo Task-Pipelining

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights

matrix multiplication

store outputs

Page 64: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Software Support for Latency Hiding

Single ModuleNo Task-Pipelining

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights

load inputs

load weights

matrix multiplication

matrix multiplication

store outputs

store outputs

Multiple-ModuleTask-Level Pipelining

Page 65: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Software Support for Latency Hiding

Single ModuleNo Task-Pipelining

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights

load inputs

load weights

matrix multiplication

matrix multiplication

store outputs

store outputs

Multiple-ModuleTask-Level Pipelining

Explicit dependency tracking managed by software to hide memory latency

Page 66: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

Hardware!21

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 67: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Primitives in prior work:Halide, Loopy

!21

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 68: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Primitives in prior work:Halide, Loopy

Thread Cooperation Tensorization Latency

Hiding …New primitives for GPUs, and enable TPU-like Accelerators

!21

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 69: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

!22

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 70: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

!22

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 71: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Hardware-aware Search Space

Hardware

Loop Transformations

Thread Bindings

Cache Locality

Thread Cooperation Tensorization Latency

Hiding

Billions of possible optimization choices

!22

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Tensor Expression Language

Page 72: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

!23

Hardware

Page 73: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

!23

Hardware

Page 74: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

!24

Program Optimizer

Page 75: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

!24

Program Optimizer Program Code Generator

Page 76: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

Runtime Measurements

!24

Program Optimizer Program Code Generator

Page 77: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

Runtime Measurements

High experiment cost, each trial costs ~1second !24

Program Optimizer Program Code Generator

Page 78: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

!25

Program Optimizer Program Code Generator

Page 79: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

!25

Program Optimizer Program Code Generator

Cost Model

Page 80: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

Need reliable cost model per hardware!25

Program Optimizer Program Code Generator

Cost Model

Page 81: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

!26

Program Optimizer Program Code Generator

Page 82: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

!26

Program Optimizer Program Code Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

Page 83: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

!26

Program Optimizer Program Code Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

LearningStatistical Cost Model

Page 84: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Program Optimizer

!26

Program Optimizer Program Code Generator

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

LearningStatistical Cost Model

Adapt to hardware type by learning Make prediction in 1ms level

Page 85: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Effectiveness of ML based Model

!27

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Page 86: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Effectiveness of ML based Model

!27

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

One Conv2D Layer of ResNet18 on Titan X

Page 87: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Effectiveness of ML based Model

!27

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials

One Conv2D Layer of ResNet18 on Titan X

Page 88: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Effectiveness of ML based Model

!27

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials

One Conv2D Layer of ResNet18 on Titan X

Rela

tive

Spee

dup

Baseline: CuDNN

Page 89: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Effectiveness of ML based Model

!28

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

Number of Trials

One Conv2D Layer of ResNet18 on Titan X

Rela

tive

Spee

dup

Baseline: CuDNN

TVM: Random Search

Page 90: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Effectiveness of ML based Model

!29

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

0 100 200 300 400 500 600 700 8000.00

0.25

0.50

0.75

1.00

1.25

1.50

One Conv2D Layer of ResNet18 on Titan X

Rela

tive

Spee

dup

TVM: Random Search

TVM: ML-based Model

Number of Trials

Baseline: CuDNN

Page 91: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Learning-based Learning System Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

!30

Hardware

Page 92: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Inference Performance (Nvidia Titan X)

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Tensorflow-XLA

Apache MXNetTensorflow

Page 93: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Inference Performance (Nvidia Titan X)

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Tensorflow-XLA

Apache MXNetTensorflow

Backed by cuDNN

Page 94: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Inference Performance (Nvidia Titan X)

Tensorflow-XLA

Apache MXNetTensorflow

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: without graph optimizations

Page 95: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Inference Performance (Nvidia Titan X)

Tensorflow-XLA

Apache MXNetTensorflow TVM: without graph optimizations

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: all optimizations

Page 96: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Inference Performance (Nvidia Titan X)

Tensorflow-XLA

Apache MXNetTensorflow TVM: without graph optimizations

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: all optimizations

Competitive on standard models

Page 97: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Inference Performance (Nvidia Titan X)

Tensorflow-XLA

Apache MXNetTensorflow TVM: without graph optimizations

ResNet-18 MobileNet0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Tim

e(m

s)

LSTM LM DQN DCGAN0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

TVM: all optimizations

Competitive on standard models

Bigger gap on less conventional models

Page 98: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Performance(ARM Cortex-A53)

!34ResNet-18 MobileNet

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

Tim

e(m

s)

DQN0.0

2.0

4.0

6.0

8.0

10.0

12.0

Tensorflow Lite TVM w/o graph opt TVM

Page 99: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Performance(ARM Cortex-A53)

!34ResNet-18 MobileNet

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

Tim

e(m

s)

DQN0.0

2.0

4.0

6.0

8.0

10.0

12.0

Tensorflow Lite TVM w/o graph opt TVM

Specially optimized for Embedded system(ARM)

Page 100: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

End to End Performance(ARM GPU)

!35

float32 float16ResNet-18

float32 float16MobileNet

0.0

50.0

100.0

150.0

200.0

250.0

Tim

e(m

s)

float32 float16DQN

0.0

1.0

2.0

3.0

4.0

5.0

ARMComputeLib TVM w/o graph opt TVM

Page 101: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Supporting New Specialized Accelerators Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

LLVM CUDA

!36

Page 102: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

Supporting New Specialized Accelerators Frameworks

High-level data flow graph and optimizations

Hardware aware Search Space of Optimized Tensor Programs

Machine Learning based Program Optimizer

LLVM CUDA VTA: Open, Customizable Deep Learning Accelerator

Data Center FPGAEdge FPGA ASIC

!36

Page 103: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM/VTA: Full Stack Open Source System

!37

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations

Page 104: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM/VTA: Full Stack Open Source System

VTA MicroArchitecture

!37

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations

Page 105: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM/VTA: Full Stack Open Source System

VTA Hardware/Software Interface (ISA)

VTA MicroArchitecture

!37

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations

Page 106: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM/VTA: Full Stack Open Source System

VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)

VTA MicroArchitecture

!37

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations

Page 107: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM/VTA: Full Stack Open Source System

VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)

VTA MicroArchitecture VTA Simulator

!37

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations

Page 108: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM/VTA: Full Stack Open Source System

• JIT compile accelerator micro code

• Support heterogenous devices, 10x better than CPU on the same board.

• Move hardware complexity to software

VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)

VTA MicroArchitecture VTA Simulator

!37

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations

Page 109: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM/VTA: Full Stack Open Source System

• JIT compile accelerator micro code

• Support heterogenous devices, 10x better than CPU on the same board.

• Move hardware complexity to software

VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)

VTA MicroArchitecture VTA Simulator

!37

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations}compiler, driver, hardware design full stack open source

Page 110: TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin

TVM: Learning-based Learning System

Check it out!ML-based Optimizer

Tensor Program Search Space

High-level Optimizations

VTALLVM CUDA