TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

20
Wonmin Byeon, August 23, 2020 MIXED PRECISION TRAINING FOR CONVOLUTIONAL TENSOR-TRAIN LSTM

Transcript of TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

Page 1: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

Wonmin Byeon, August 23, 2020

MIXED PRECISION TRAINING FOR CONVOLUTIONAL TENSOR-TRAIN LSTM

Page 2: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

2

APPLICATIONS

Early Activity Recognition

Video Prediction

Spatio-Temporal Learning

Page 3: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

3

CONVOLUTIONAL LSTM

Page 4: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

4

CONVOLUTIONAL TENSOR-TRAIN LSTM

Page 5: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

5

OPTIMIZATION TRICKS

• Speed-up tricks to train Convolutional LSTM/Convolutional Tensor-Train LSTM

#1. Enabling mixed precision training using NVIDIA AMP

#2. GPU Optimization: Fused Adam, Fused kernels in LSTM cell, Thread affinity binding

#3. LSTM Activation Checkpointing

• Fast convergence without performance change

• Model parallelism using multi-streams

Page 6: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

6

#1. ENABLING AMP

# save the checkpoint

# load the checkpoint

Page 7: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

7

#2. GPU Optimization● Fused Adam: Adam optimizer

Page 8: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

8

#2. GPU Optimization● Fused optimizer: Adam optimizer

● Fused kernel with JIT: LSTM cell

Page 9: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

9

#2. GPU Optimization● Fused optimizer: Adam optimizer

● Fused kernel with JIT: LSTM cell

● Thread affinity binding: distributed computing

Page 10: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

10

TRAININGConvolutional LSTM

Application: video prediction Machine: V100 x 8, 16GBBatch Size: 16 videos12 Conv. LSTM layersInput/output image resolution: 128x128

No performance change!

7.8x faster6.6x faster

1.5x less

Convergence time GPU Memory

Page 11: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

11

Convolutional Tensor-Train LSTM

5.5x faster

Application: video prediction Machine: V100 x 8, 16GBBatch Size: 16 videos12 Conv. LSTM layersInput/output image resolution: 128x128 TRAINING

No performance change!

1.5x less

Convergence time GPU Memory

Page 12: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

12

#3. ACTIVATION CHECKPOINTING

Page 13: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

13

TRAININGConvolutional LSTM

Convergence time GPU Memory

Application: video prediction Machine: V100 x 8, 16GBBatch Size: 16 videos12 Conv. LSTM layersInput/output image resolution: 128x128

No performance change!

7.8x faster

1.8x less memory usage

Page 14: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

14

Conv. Tensor-Train LSTM

5.5x faster

2.3x less memory usage

Application: video prediction Machine: V100 x 8, 16GBBatch Size: 16 videos12 Conv. LSTM layersInput/output image resolution: 128x128 TRAINING

No performance change!

Convergence time GPU Memory

Page 15: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

15

MODEL PARALLEL: MULTI-STREAMS

example model:

• Use multiple cuda-streams• Execute multiple kernels that do not have

data dependency in parallel

Page 16: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

16

• Use multiple cuda-streams• Execute multiple kernels that do not have

data dependency in parallel

Operations of layers at different branches are overlapped.example model:

MODEL PARALLEL: MULTI-STREAMS

Page 17: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

17

MODEL PARALLEL: MULTI-STREAMS

single-stream

multi-stream

2.8x faster

Computation time comparison

Page 18: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

18

SUMMARY

Speed-up summary to train Convolutional Tensor-Train LSTM

• Enabling AMP

• GPU Optimization

• Activation Checkpointing

• ConvLSTM: 7.8x speed-up, 1.8x less memory usage

• Conv. Tensor-Train LSTM: 5.5x speed-up, 2.3x less memory usage.

• Fast convergence without performance change

Model parallelism using multi-streams

Page 19: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …

19

CONCLUSION

• Paper: Jiahao Su* (UMD), Wonmin Byeon* (NVIDIA), Jean Kossaifi (NVIDIA), Furong Huang (UMD), Jan Kautz

(NVIDIA), Animashree Anandkumar (NVIDIA), 'Convolutional Tensor-Train LSTM for Spatio-Temporal Learning',

under submission, 2020.

• Project page: https://sites.google.com/nvidia.com/conv-tt-lstm/home

• Code Optimization: Sangkug Lym (NVIDIA)

• The code with the optimization tricks will be available soon: https://github.com/NVlabs/conv-tt-lstm

Page 20: TENSOR-TRAIN LSTM FOR CONVOLUTIONAL MIXED PRECISION …