L Llvm T CS 744: TVM

Post on 22-Feb-2022

1 views 0 download

Transcript of L Llvm T CS 744: TVM

CS 744: TVM

Shivaram VenkataramanFall 2020

TensorVirtual Machine

x.dk! L, Llvm

T-

ADMINISTRIVIA

- Course project titles- Project proposal aka Introduction (10/16)

IntroductionRelated WorkTimeline (with eval plan)

- Midterm: Oct 22

Assignment→

<

]→ 2 pagewriteup

MACHINE LEARNING: STACK

✓Distributed

noowed

Train efferent just lie↳ forward pass

→ Interplayinference

&

quondam \, training

makedistributedeasy ( dealing inference

groin v

Hardwareand saddle

MOTIVATION: PERFORAMNCE PORTABILITYPytoreh → model file

"intent :?÷:www.rayf/4TE-iTIyconfute primitives matrix cow

multiply ed

you want high performance I 1

across hardware backends-

Dependence onvendor specific o o • q

libraries

MLmodels evolve fast ⇒ new operators

new combination of operators Y ⇒ Notavailable

in existingvendor

libraries

AM

→ Python code describes

ML model

→Tvm

. . )=

→ Binary file thatruns on hardware

OPTIMIZATION COMPUTATION GRAPHSOperator Fusion

Data layout

[ I¥÷÷÷¥÷÷÷÷÷:* " "÷ . :÷÷÷:- -T -

→ 1-1 operators ,"

map"

-

→ Sum reduction, scaling after

↳(Spg-

Kow Major ,

column Major ,Blocked

g, Infest

teatisrepresented

2- layerNN

.in/TEHtEi/:::IS...:: :as layout

TENSOR EXPRESSION LANGUAGE

Common Arithmetic, Math operationsKnow the shape of the output and the data accessed

operator cry↳ expressed in tensor expression language

↳tensor) math operations

CODE GENERATION

Nested parallelism

Tensorization

Halide→ expression OpenMP ← gu+ of imtmhns

ead does

a.← anime fi:÷÷÷÷÷÷:÷÷i÷÷÷'

jaihe"

for i in l : "

for j . int :S

threads can use as , Bstdgmptd.im#isIL;i!dffIHdeu- poker=doopiterah#bad ,

store, add → whet is the

hardware instructionset-

= Allows you to

- - register operatorExtensible ! intrinsic

Latency HIDING

What is the goal?

Someas

Pytorchetc .

9↳ Overlap computation and

communication

Schedule thatutilizes

-

memorybandwidth &

compute units

ig:*year 'fad

AUTOMATING OPTIMIZATION

Goal: Create a specialized operator for input shape and layoutChallenge:

Choose appropriate schedule optimizationsTiling size, loop unrolling

Automate the optimizer!

---

- - - - r . lots of differentchoices&

also lots of parameters Huntsto

choose .

FimMl ? -

"" m.

what configurationsI

to Try ←

ML-Based Cost model

Machine Learning Model Design ChoicesSpeed: Faster than time it takes to evaluate a configQuality: Use a rank objective to predict the relative order of runtime

Gradient tree boosting modelmemory access countreuse ratio of each memory buffer at each loop levelone-hot encoding of loop annotations

as to

→ →n seconds

take

code generated

features

ML-BASED COST MODEL

IterationSelect a batch of candidates Collect data Use as training data to update the model

How to select candidates?Parallel Simulated Annealing

Start from a random configWalk to a nearby config à

Successful if cost decreases Else Reject

model perfwhen using ←config

,

tom >LE -y

( Cz ,20ms

→each candidate is

ccz,

8ms>41

a wyignratim lahhhh

< £ , 'fail>fief a ← :

harder:c ,

→ step a)above trashy data

- toaB7✓~,

Aa ↳ → ↳'

Task model is cj better than b

Yes → go& try d

,

on cluster

No → generateanother

oyer config

Distributed device pool

Pool of devices to speed up profilingRPC interface to run a trial on device

Share device pools for multiple graphs

SUMMARY

TVM: Compiler for ML inference modelsSupport high performance for range of models, hardware devices

Key ideasGraph-level optimizationsTensor expression language: Code-gen, Latency hiding etcML based Cost Model for automation

→→ operator fusion

-

DISCUSSIONhttps://forms.gle/WiVgJ3abGXXgfBN99

Consider that you are building an optimizer for Spark programs instead of ML inference. What would be some configuration knobs that you could similarly tune? What might be different from the TVM optimizer?

Similar logic → latency hidingoverlap comp ,

communication

7rYYdimemim'operatorfmon→mapBoperahnaccess patterns

↳laa#↳ operators are

user definedchallenging ??-Partitioning → can you automate

↳ number of partitions / co- partitioning\ performance !Had .

cache →config space

!

Persistence → manually insert

What is your takeaway from the following figure?

→ fastingqmon

f-T" bae:Enea.:c.

.

honey! week,unite:b.or

NEXT STEPS

Next class: RayCourse project: Oct 16 (introductions)Midterm: Oct 22

latency hiding in spark ?

Drddlsmaf tasks > ✓

credence tasks|D÷i;:D rddz map ← Hanffiles:

Edna > ← nocomm

D wait