AutoTM: Automatic Tensor Movement in Heterogeneous …

AutoTM: Automatic Tensor Movement in HeterogeneousMemory Systems using Integer Linear Programming

Mark Hildebrand1,Jawad Khan2, Sanjeev Trika2,

Jason Lowe-Power1, Venkatesh Akella1

1 University of California, Davis2 Intel Corporation

https://github.com/darchr/AutoTM

March 12, 2020

1/29


Executive Summary

ProblemAutomatic two-level memory management for Deep Neural Networks

Idea• Profile Guided Optimization

• Model as an Integer Linear Program (ILP)

Results• Replace 50-80% DRAM with NVDIMMs with geometric mean 27.1% performance

loss.

• 3x better performance than real hardware cache.

2/29

Outline

Background

AutoTMProfilingILP Modeling

Results

Wrap Up

3/29

Why Deep Neural Networks

to train large models on a single machine?Can we use multiple levels of memory

Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

4/29

https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

Heterogeneous Memory Systems

• Two types of memory.

• Same memory controller.

• Both are byte addressable.

• NVDIMMs for high capacity and low cost

Challenges

• All tensors in NVDIMMs memory is too slow.

• DRAM as a cache for NVDIMMs also too slow.

• Intelligent memory management required.

NVDIMM Style

5/29

Outline

Background


Results

Wrap Up

6/29

AutoTM

GoalMinimize execution time

• Arbitrary computation graph

• Size constraint on fast memory

How• Place tensors in fast or slow memory.

• Optimal tensor movement

Strategy

• Profile kernel performance.

• Model tensor assignment as ILP.

7/29

Kernel Profiling

Profile performance of kernels for all tensorIO locations.

Kernel IO Tensor Locations

K2

T1 T2 T3

DRAM DRAM DRAMDRAM DRAM PMMDRAM PMM DRAMDRAM PMM PMMPMM DRAM DRAMPMM DRAM PMMPMM PMM DRAMPMM PMM PMM

Table: Profile space for kernel K2.

DRAMDRAMDRAM

DRAMDRAMPMM

DRAMPMMDRAM

DRAMPMMPMM

PMMDRAMDRAM

PMMDRAMPMM

PMMPMMDRAM

PMMPMMPMM

0

1

2

Data In:Weight:

Data Out:

Perform

ance

relative

toallIO

inDRAM

8/29

Tensor Lifetime Flow Network

Path of flow through the graph describes where a tensor’s memory location throughoutits lifetime.

9/29



10/29



11/29



12/29



13/29



14/29



15/29

ILP Modeling

Objective Function

Computation time

min∑k∈K

ρk +∑t∈T

Mt

16/29

ILP Modeling

Objective Function

Computation time

min∑k∈K

ρk︸︷︷︸Kernel

ExecutionTime

+∑t∈T

Mt

K : Set of Kernels

ρk : Run time of kernel k

ExampleRun time of kernel k2

17/29

ILP Modeling

Objective Function

Computation time

min∑k∈K


ExecutionTime

+∑t∈T

Mt︸︷︷︸Tensor

MovementTime

T : Set of Tensors

Mt : Time moving tensor t

ExampleTime moving tensor t1

18/29

ILP Modeling

Objective Function

Computation time

min∑k∈K


ExecutionTime

+∑t∈T

Mt︸︷︷︸Tensor

MovementTime

ConstraintsLimit DRAM at eachkernel∑t∈L(k)

‖t‖IDRAMt,k ≤ Limit ∀k

19/29

Variations of AutoTM

Name DescriptionPMM

SystemGPU

System

Static Tensor’s can’t move 3 7

Synchronous Tensor’s move but blockcomputation

3 3

Asynchronous Tensor movement con-current with computa-tion

l 3

20/29

Outline

Background


Results

Wrap Up

21/29

Experiments!

Software

• Modified the ngraph1 compiler.

• Julia’s JuMP2 package for ILPmodeling.

• Gurobi3 as the ILP solver.

Hardware

• 1.5 TB OptaneTM DC PMM

• 384 GiB DRAM

Workloads −−−−−−−−−−−−−−→

Conventional Batchsize Memory (GB)

Inception v4 1024 111Vgg 19 2048 143

Resnet 200 512 132DenseNet 264 512 115

Large Batchsize Memory (GB)

Inception v4 6144 659Vgg 416 128 658

Resnet 200 2560 651DenseNet 264 3072 688

1https://github.com/NervanaSystems/ngraph2https://github.com/JuliaOpt/JuMP.jl3gurobi.com

22/29

Scaling Performance - Inception V4

Performance of Inception v4 - Batchsize 1024

0 20 40 60 80 100 120

1

2

3

4

5

Dram Limit (GB)Lower is Better

SlowdownLower is Better

synchronous

23/29



0 20 40 60 80 100 120

1

2

3

4

5



synchronous

◦ Just using PMM is too slow.

24/29



0 20 40 60 80 100 120

1

2

3

4

5



synchronous

◦ Just using PMM is too slow.24/29



0 20 40 60 80 100 120

1

2

3

4

5



synchronous

◦ Best performance when working-set fits in memory.

25/29



0 20 40 60 80 100 120

1

2

3

4

5



synchronous

◦ Best performance when working-set fits in memory.25/29

Comparison Against 2LM

Vgg416(320)

Inception v4 (6144)

Resnet200 (2560)

DenseNet 264 (3072)

0

1

2

3

Speedup over 2LMHigher is Better

static-AutoTM sync-AutoTM

2LM DRAM Cache

26/29


Vgg416(320)

Inception v4 (6144)

Resnet200 (2560)

DenseNet 264 (3072)

0

1

2

3



2LM DRAM Cache • Avoid Dirty Writebacks

• Lower Memory Contention

26/29


Vgg416(320)

Inception v4 (6144)

Resnet200 (2560)

DenseNet 264 (3072)

0

1

2

3



2LM DRAM CacheSoftware managementoutperforms hardwaremanagement by up to 3x.

26/29

Outline

Background


Results

Wrap Up

27/29

Limitations

• Static computation graphs.

• Kernel profiling overhead.

• ILP solution times.

• ILP solution may be hard to interpret.

Vgg19 Inception v4 Resnet200 DenseNet 264

101

102

103

ILP Solution Time (s)Logarithmic


28/29

Conclusion

AutoTM: A technique for managing tensors in heterogeneous memory systems.

• Profiling for Kernel Performance.

• Use ILP to optimally assign tensor location and movement.

• Three formulations: Static, Synchronous, Asynchronous.

We show

• Reduce DRAM requirement.

• Significant performance improvement over hardware solutions.

Code Available: https://github.com/darchr/AutoTM

29/29


Common Questions

30/29

Asynchronous Movement on PMMs

• Interference between DRAM and PMM.

• Low bandwidth and difficulty of DMA.

• Performance of kernels greatly impacted due to copy kernels.

31/29

GPU

12345678910

64 128 256 32 64 128 32 64 128 64 128Inception v4 Resnet200 DenseNet 264 Vgg19

Speedupover

CudaM

allocM

anag

edsynchronous asynchronous oracle

32/29

RNNs - more complex models

• AutoTM is limited to static computation graphs.• RNNs have dynamic behavior (i.e. unrolling based on sequence length).• RNNs can be implemented statically.• Key ideas from AutoTM can be used for dynamic workloads.

0 20 40 60 80 100 120

0

0.5

1

DRAM Limit (GB)

Percentof

Kernel

IOin

DRAM

static: input tensorsstatic: output tensors

synchronous: input tensorssynchronous: output tensors

33/29

Concluding Conclusion

AutoTM: A technique for managing tensors in heterogeneous memory systems.

• Profiling for Kernel Performance.

• Use ILP to optimally assign tensor location and movement.

• Three formulations: Static, Synchronous, Asynchronous.

We show

• Reduce DRAM requirement.

• Significant performance improvement over hardware solutions.

Code Available: https://github.com/darchr/AutoTM

34/29


AutoTM: Automatic Tensor Movement in Heterogeneous …

Documents

Transcript of AutoTM: Automatic Tensor Movement in Heterogeneous …