TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End...

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu,

Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Collaborators

SAMPL Lab: https://sampl.cs.washington.edu

Beginning of Story

Acclerator

Beginning of Story

Acclerator

Beginning of Story

// Pseudo-code for convolution program for the VIA accelerator// Virtual Thread 00x00: LOAD(PARAM[ 0-71]) // LD@TID00x01: LOAD(ACTIV[ 0-24]) // LD@TID00x02: LOAD(LDBUF[ 0-31]) // LD@TID00x03: PUSH(LD->EX) // LD@TID00x04: POP (LD->EX) // EX@TID00x05: EXE (ACTIV[ 0-24],PARAM[ 0-71],LDBUF[ 0-31],STBUF[ 0- 7]) // EX@TID00x06: PUSH(EX->LD) // EX@TID00x07: PUSH(EX->ST) // EX@TID00x08: POP (EX->ST) // ST@TID00x09: STOR(STBUF[ 0- 7]) // ST@TID00x0A: PUSH(ST->EX) // ST@TID0// Virtual Thread 10x0B: LOAD(ACTIV[25-50]) // LD@TID10x0C: LOAD(LDBUF[32-63]) // LD@TID10x0D: PUSH(LD->EX) // LD@TID10x0E: POP (LD->EX) // EX@TID10x0F: EXE (ACTIV[25-50],PARAM[ 0-71],LDBUF[32-63],STBUF[32-39]) // EX@TID10x10: PUSH(EX->LD) // EX@TID10x11: PUSH(EX->ST) // EX@TID10x12: POP (EX->ST) // ST@TID10x13: STOR(STBUF[32-39]) // ST@TID10x14: PUSH(ST->EX) // ST@TID1// Virtual Thread 20x15: POP (EX->LD) // LD@TID20x16: LOAD(PARAM[ 0-71]) // LD@TID20x17: LOAD(ACTIV[ 0-24]) // LD@TID20x18: LOAD(LDBUF[ 0-31]) // LD@TID20x19: PUSH(LD->EX) // LD@TID20x1A: POP (LD->EX) // EX@TID20x1B: POP (ST->EX) // EX@TID20x1C: EXE (ACTIV[ 0-24],PARAM[ 0-71],LDBUF[ 0-31],STBUF[ 0- 7]) // EX@TID20x1D: PUSH(EX->ST) // EX@TID20x1E: POP (EX->ST) // ST@TID20x1F: STOR(STBUF[ 0- 7]) // ST@TID2// Virtual Thread 30x20: POP (EX->LD) // LD@TID30x21: LOAD(ACTIV[25-50]) // LD@TID30x22: LOAD(LDBUF[32-63]) // LD@TID30x23: PUSH(LD->EX) // LD@TID30x24: POP (LD->EX) // EX@TID30x25: POP (ST->EX) // EX@TID20x26: EXE (ACTIV[25-50],PARAM[ 0-71],LDBUF[32-63],STBUF[32-39]) // EX@TID30x27: PUSH(EX->ST) // EX@TID30x28: POP (EX->ST) // ST@TID30x29: STOR(STBUF[32-39]) // ST@TID3

// Convolution access pattern dictated by micro-coded program.// Each register index is derived as a 2-D affine function.// e.g. idxrf = arfy+brfx+crf

0, where crf0 is specified by

// micro op 0 fields.for y in [0…i) for x in [0…j) rf[idxrf

0] += GEVM(act[idxact0], par[idxpar

0]) rf[idxrf

1] += GEVM(act[idxact1], par[idxpar

1]) … rf[idxrf

n] += GEVM(act[idxactn], par[idxpar

(b) Convolution micro-coded program

// Max-pool, batch normalization and activation function// access pattern dictated by micro-coded program.// Each register index is derived as a 2D affine function.// e.g. idxdst = adsty+bdstx+cdst

0, where cdst0 is specified by

// micro op 0 fields.for y in [0…i) for x in [0…j) // max pooling rf[idxdst

0] = MAX(rf[idxdst0], rf[idxsrc

0]) rf[idxdst

1] = MAX(rf[idxdst1], rf[idxsrc

1]) … // batch norm rf[idxdst

m] = MUL(rf[idxdstm], rf[idxsrc

m]) rf[idxdst

m+1] = ADD(rf[idxdstm+1], rf[idxsrc

m+1]) rf[idxdst

m+2] = MUL(rf[idxdstm+2], rf[idxsrc

m+2]) rf[idxdst

m+3] = ADD(rf[idxdstm+3], rf[idxsrc

m+3]) … // activation rf[idxdst

n-1] = RELU(rf[idxdstn-1], rf[idxsrc

n-1]) rf[idxdst

n] = RELU(rf[idxdstn], rf[idxsrc

(c) Max pool, batch norm and activationmicro-coded program

(a) Blocked convolution program with multiple thread contexts

Acclerator

Goal: Deploy Deep Learning Everywhere

Frameworks

Explosion of models and frameworks

Frameworks

Explosion of hardware backends

Frameworks

Acclerator

Huge gap between model/frameworks and hardware backends

Frameworks

Acclerator

Existing Approach

Hardware!5

Frameworks

Existing Approach

High-level data flow graph

Hardware!5

Frameworks

Existing Approach

Hardware

Primitive Tensor operators such as Conv2D

Frameworks

Existing Approach

Hardware

Primitive Tensor operators such as Conv2D

eg. cuDNN Offload to heavily optimized DNN operator library

Frameworks

Existing Approach: Engineer Optimized Tensor Operators

C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Matmul: Operator Specification

for y in range(1024): for x in range(1024): C[y][x] = 0 for k in range(1024): C[y][x] += A[k][y] * B[k][x]

Vanilla Code

for yo in range(128): for xo in range(128): C[yo*8:yo*8+8][xo*8:xo*8+8] = 0 for ko in range(128): for yi in range(8): for xi in range(8): for ki in range(8): C[yo*8+yi][xo*8+xi] += A[ko*8+ki][yo*8+yi] * B[ko*8+ki][xo*8+xi]

Loop Tiling for Locality

inp_buffer AL[8][8], BL[8][8]acc_buffer CL[8][8]for yo in range(128): for xo in range(128): vdla.fill_zero(CL) for ko in range(128): vdla.dma_copy2d(AL, A[ko*8:ko*8+8][yo*8:yo*8+8]) vdla.dma_copy2d(BL, B[ko*8:ko*8+8][xo*8:xo*8+8]) vdla.fused_gemm8x8_add(CL, AL, BL) vdla.dma_copy2d(C[yo*8:yo*8+8,xo*8:xo*8+8], CL)

Map to Accelerators

Human exploration of optimized code

Limitations of Existing Approach

Frameworks

New operator introduced by operator fusion optimization potentially benefit: 1.5x speedup

Frameworks

Engineering intensive

Learning-based Learning System

High-level data flow graph and optimizations

Hardware

Frameworks

Hardware

Frameworks

Hardware aware Search Space of Optimized Tensor Programs

Hardware

Frameworks

Machine Learning based Program Optimizer

Hardware

Frameworks

directly generate optimized program for new operator workloads and hardware

Hardware

Frameworks

Learning-based Learning System Frameworks

Hardware

Hardware-aware Search Space

Hardware!12

Tensor Expression Language (Specification)

Hardware!12

Tensor Expression Language (Specification)

Define search space of hardware aware mappings from expression to hardware program

Based on Halide’s compute/schedule separation

L1D L1IL2

implicitly managed

Compute Primitives

scalar vector

Memory Subsystem

Loop Transformations

Cache Locality Vectorization

Reuse primitives from prior work: Halide, Loopy

Challenge to Support Diverse Hardware Backends

CPUs GPUs TPU-like specialized Accelerators

Hardware-aware Search SpaceCompute Primitives

scalar vector

Memory Subsystem

RF RFTX/L1

scalar vector

Memory Subsystem

RF RFTX/L1

Shared memory among compute cores

scalar vector

Memory Subsystem

RF RFTX/L1

Thread Cooperation

Use of Shared Memory

Shared memory among compute cores

TPU-like Specialized Accelerators

tensor

Compute Primitives

Unified Buffer Acc

explicitly managed

Memory Subsystem

tensor

Compute Primitives

Unified Buffer Acc

explicitly managed

Memory Subsystem

Tensorization ChallengeCompute primitives

scalar

scalar vector

scalar vector tensor

w, x = t.placeholder((8, 8)), t.placeholder((8, 8))k = t.reduce_axis((0, 8))y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] * x[j, k], axis=k))

def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr(“r") xx_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("w") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) reset = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update

gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)

declare behavior

lowering rule to generatehardware intrinsics to carry out the computation

Hardware designer: declare tensor instruction interface

with Tensor Expression

w, x = t.placeholder((8, 8)), t.placeholder((8, 8))k = t.reduce_axis((0, 8))y = t.compute((8, 8), lambda i, j: t.sum(w[i, k] * x[j, k], axis=k))

def gemm_intrin_lower(inputs, outputs): ww_ptr = inputs[0].access_ptr(“r") xx_ptr = inputs[1].access_ptr("r") zz_ptr = outputs[0].access_ptr("w") compute = t.hardware_intrin("gemm8x8", ww_ptr, xx_ptr, zz_ptr) reset = t.hardware_intrin("fill_zero", zz_ptr) update = t.hardware_intrin("fuse_gemm8x8_add", ww_ptr, xx_ptr, zz_ptr) return compute, reset, update

gemm8x8 = t.decl_tensor_intrin(y.op, gemm_intrin_lower)

declare behavior

lowering rule to generatehardware intrinsics to carry out the computation

Hardware designer: declare tensor instruction interface

with Tensor Expression

Tensorize: transform program

to use tensor instructions

scalar tensor

tensor

Compute Primitives

Unified Buffer Acc

explicitly managed

Memory Subsystem

tensor

Compute Primitives

Unified Buffer Acc

explicitly managed

Memory Subsystem

Software Support for Latency Hiding

Single ModuleNo Task-Pipelining

load inputs

load weights

matrix multiplication

store outputs

load inputs

load weights

store outputs

load inputs

load weights

store outputs

load inputs

load weights

store outputs

load inputs

load weights

store outputs

load inputs

load weights

store outputs

load inputs

load weights

load inputs

load weights

store outputs

Multiple-ModuleTask-Level Pipelining

load inputs

load weights

store outputs

load inputs

load weights

store outputs

load inputs

load weights

load inputs

load weights

store outputs

Multiple-ModuleTask-Level Pipelining

Explicit dependency tracking managed by software to hide memory latency

Hardware!21

Tensor Expression Language

Hardware

Thread Bindings

Cache Locality

Primitives in prior work:Halide, Loopy

Hardware

Thread Bindings

Cache Locality

Primitives in prior work:Halide, Loopy

Thread Cooperation Tensorization Latency

Hiding …New primitives for GPUs, and enable TPU-like Accelerators

Hardware

Thread Bindings

Cache Locality

Hiding

Hardware

Thread Bindings

Cache Locality

Hiding

Hardware

Thread Bindings

Cache Locality

Hiding

Billions of possible optimization choices

Hardware

Learning-based Program Optimizer

Program Optimizer

Program Optimizer Program Code Generator

Runtime Measurements

High experiment cost, each trial costs ~1second !24

Cost Model

Need reliable cost model per hardware!25

Cost Model

D<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AAAB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPppB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueeMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGGdHIqWCzSj81LCF0QkasZ6kkMTNBNo88c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X//N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTommlC0LVVsCf7yyaukfVH3vbr/cFlr3BR1lOEETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg==</latexit>

Training data

LearningStatistical Cost Model

Training data

LearningStatistical Cost Model

Adapt to hardware type by learning Make prediction in 1ms level

Effectiveness of ML based Model

0 100 200 300 400 500 600 700 8000.00

One Conv2D Layer of ResNet18 on Titan X

0 100 200 300 400 500 600 700 8000.00

Number of Trials

0 100 200 300 400 500 600 700 8000.00

Number of Trials

Baseline: CuDNN

0 100 200 300 400 500 600 700 8000.00

Number of Trials

Baseline: CuDNN

TVM: Random Search

0 100 200 300 400 500 600 700 8000.00

TVM: Random Search

TVM: ML-based Model

Number of Trials

Baseline: CuDNN

Hardware

End to End Inference Performance (Nvidia Titan X)

ResNet-18 MobileNet0.0

LSTM LM DQN DCGAN0.0

Tensorflow-XLA

Apache MXNetTensorflow

Tensorflow-XLA

Backed by cuDNN

Tensorflow-XLA

TVM: without graph optimizations

Tensorflow-XLA

Apache MXNetTensorflow TVM: without graph optimizations

TVM: all optimizations

Tensorflow-XLA

Competitive on standard models

Tensorflow-XLA

Competitive on standard models

Bigger gap on less conventional models

End to End Performance(ARM Cortex-A53)

!34ResNet-18 MobileNet

DQN0.0

Tensorflow Lite TVM w/o graph opt TVM

End to End Performance(ARM Cortex-A53)

!34ResNet-18 MobileNet

DQN0.0

Tensorflow Lite TVM w/o graph opt TVM

Specially optimized for Embedded system(ARM)

End to End Performance(ARM GPU)

float32 float16ResNet-18

float32 float16MobileNet

float32 float16DQN

ARMComputeLib TVM w/o graph opt TVM

Supporting New Specialized Accelerators Frameworks

LLVM CUDA

Supporting New Specialized Accelerators Frameworks

LLVM CUDA VTA: Open, Customizable Deep Learning Accelerator

Data Center FPGAEdge FPGA ASIC

TVM/VTA: Full Stack Open Source System

ML-based Optimizer

Tensor Program Search Space

High-level Optimizations

VTA MicroArchitecture

ML-based Optimizer

VTA Hardware/Software Interface (ISA)

ML-based Optimizer

VTA Runtime & JIT Compiler

ML-based Optimizer

VTA MicroArchitecture VTA Simulator

ML-based Optimizer

• JIT compile accelerator micro code

• Support heterogenous devices, 10x better than CPU on the same board.

• Move hardware complexity to software

ML-based Optimizer

• JIT compile accelerator micro code

• Support heterogenous devices, 10x better than CPU on the same board.

• Move hardware complexity to software

ML-based Optimizer

High-level Optimizations}compiler, driver, hardware design full stack open source

TVM: Learning-based Learning System

Check it out!ML-based Optimizer

VTALLVM CUDA

TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End...

Documents

Transcript of TVM: An Automated End-to-End Optimizing Compiler for Deep ... · TVM: An Automated End-to-End...

Verticutter Mower TVM-3077 and TVM-5130 · Verticutter Mower TVM-3077 and TVM-5130 Locke Turf 307 Highway 52E, Opp, Alabama 36467, (334) 493-1300 Parts Manual

TVM-1701 / TVM-1850 / TVM-1901 / TVM-2150 Monitor User Manual · TVM-1701 / TVM-1850 / TVM-1901 / TVM-2150 Monitor User Manual

3 tvm tables

CET, TVM Broucher

TVM: An Automated End-to-End Optimizing Compiler for Deep … · 2 AWS, 3Shanghai Jiao Tong University, 4UC Davis, 5Cornell Abstract There is an increasing need to bring machine learn-ing

TVM Stack Overview - University of Washington · 2020. 9. 1. · TVM Stack Overview Tianqi Chen. Beginning of TVM Story. Beginning of TVM Story Acclerator. Beginning of TVM Story

Tvm Problems

Tvm Power Point

Compiler Tools Lex/Yacc – Flex & Bison. Compiler Front End (from Engineering a Compiler) Scanner (Lexical Analyzer) Maps stream of characters into words.

Media kit (TVM)

Tvm w 2014

AI Compiler @Alibaba · 2019-12-17 · HowTVMisused@Alibaba •An End-to-End Deep Learning Compiler ØEmpower AI service ØGenerate high performance operators •subgraph & kernel

TVM QA Procedures

Formula Sheet TVM

CGHS Rates TVM

Physics TVM Chennai

Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.

TVM: An Automated End-to-End Optimizing Compiler for Deep ...ey204/teaching/ACS/R244... · TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen1, Thierry

End to End Optimization Stack for Deep Learninglearningsys.org/nips17/assets/slides/TVM-MLSys-NIPS17.pdfEnd to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul

Welcome | IISER TVM