Training in TensorFlow Automatic Mixed-Precision€¦ · Inherently mixed precision: internal...

Nathan Luehr, Sr. Developer Technology Engineer, NVIDIA Nov 5, 2019

Automatic Mixed-PrecisionTraining in TensorFlow

2

Outline

Mixed Precision Background

Automatic Loss Scaling

Mixed Precision Graph Optimizer

Using Automatic Mixed Precision in TensorFlow

Mixed Precision Results

Deep Learning Profiler

Questions

3

Mixed Precision Background

4

Motivation

Reduced precision (16-bit floating point) for speed or scale

Full precision (32-bit floating point) to maintain task-specific accuracy

By using multiple precisions, we avoid a pure tradeoff of speed and accuracy

Goals:

Maximize use of reduced precision while matching accuracy of full precision training with no changes to hyperparameters

Provide automated and general purpose tools so that mixed precision can be easily enabled

5

Automatic Mixed PrecisionIn a nutshell

Add one line to your training script.

Achieve a 1.5 to 3x speedup compared to FP32 training

with no loss in accuracy

Examples: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/

opt = tf.train.MomentumOptimizer(learning_rate, momentum)

opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

update_op = opt.minimize(loss)

https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow

6

Tensor CoresHardware support for accelerated 16-bit FP math

Peak throughput of 125 TFLOPS (8x FP32) on V100

Inherently mixed precision: internal accumulation occurs in FP32 for accuracy

Used by cuDNN and cuBLAS libraries to accelerate matrix multiply and convolution

7

Mixed Precision StrategyFor training

► Use Tensor Cores to accelerate convolutions and matrix multiplications

► Store most activations in FP16○ Enables larger models and/or larger batch sizes○ Double effective bandwidth compared to FP32

► Use FP32 for likely to overflow ops (e.g., sums, reductions, log, exp)

► Use loss scaling to maintain gradients in the FP16 representable range

► Update model parameters in FP32 to avoid truncation

8

Automatic Loss Scaling

9

Loss ScalingThe idea

Range of numbers representable in FP16 is ~40 powers of 2

Gradients are small:○ Some underflow to zero○ While ~15 powers of 2 are unused

Scaling the loss, L(x), by s uniformly increases gradient values by a factor of s

Unscale weight gradients (in FP32) for weight update

10

Automatic Loss ScalingTuning s on the fly

Start with a very large scale factor

If any gradient value results in Inf or NaN, decrease s AND skip the weight update (including any optimizer state)

If gradients have remained finite for some number of batches, increase s.

11

Mixed PrecisionGraph Optimizer

12

TensorFlow Graphs

x = tf.placeholder(tf.float32, shape=(1024, 1024))

w = tf.get_variable(‘w’, shape=(1024, 1024))

z = tf.add(x, tf.matmul(x, w))

VariableV2FP32

AddFP32

PlaceholderFP32

MatMulFP32

IdentityFP32

13

Transformed GraphsBasic idea

VariableV2FP32

PlaceholderFP32

IdentityFP32

CastFP32 to FP16

CastFP32 to FP16

AddFP16

MatMulFP16

14

Initial Training GraphAll operations performed in FP32

VariableV2

Relu

MatMul

Loss

Conv2d

VariableV2Add

Placeholder

ReluGrad

LossGrad

MatMul

MatMul

Placeholder

GradInput

GradFilter

Mul

VariableV2 Reciprocal

Mul

Mul

15

Graph ConversionChoosing what to cast

Always Cast:whitelist

Ops highly accelerated by float16. These always justify performance costs of casting inputs. Examples: MatMul and Conv2d.

Maybe Cast:graylist

Ops available for float16 execution but not accelerated sufficiently to justify casting overhead on their own. Examples: Add and Relu.

Never Cast:blacklist

Ops requiring float32 evaluation in order to maintain numerical stability. Examples: Exp and Sum.

Everything Else:

Ops lacking float16 implementations or operating on non-floating point inputs.

16

Graph Coloring ExampleStep 1: Initialize Op Colors

VariableV2

Relu

MatMul

Loss

Conv2d

VariableV2Add

Placeholder

ReluGrad

LossGrad

MatMul

MatMul

Placeholder

GradInput

GradFilter

Mul


Mul

Mul

17

Graph Coloring ExampleStep 2: Propagate ‘Never’ Tags Forward

VariableV2

Relu

MatMul

Loss

Conv2d

VariableV2Add

Placeholder

ReluGrad

LossGrad

MatMul

MatMul

Placeholder

GradInput

GradFilter

Mul


Mul

Mul

Mul

18

Graph Coloring ExampleStep 3: Paint ‘Maybe’ Ops Bounded by ‘Always’

VariableV2

Relu

MatMul

Loss

Conv2d

VariableV2Add

Placeholder

ReluGrad

LossGrad

MatMul

MatMul

Placeholder

GradInput

GradFilter

Mul


Mul

Mul

Mul

19

Graph Coloring ExampleStep 4: Find boundaries of ‘always’ sections

VariableV2

Relu

MatMul

Loss

Conv2d

VariableV2Add

Placeholder

ReluGrad

LossGrad

MatMul

MatMul

Placeholder

GradInput

GradFilter

Mul


Mul

Mul

Mul

20

Graph Coloring ExampleStep 5: Insert casts (with reuse)

VariableV2

Relu

MatMul

Loss

Conv2d

VariableV2Add

Placeholder

ReluGrad

LossGrad

MatMul

MatMul

Placeholder

BackInput

BackFilter

Mul


Mul

Mul

FP16 CastFP16 Cast

FP16 Cast

FP32 Cast

FP32 Cast

FP16 CastFP32 Cast

21

Using Automatic Mixed Precision in TensorFlow

22

Enabling TF-AMPTF 1.14+ & NGC 19.06+ TensorFlow Containers

Enable both loss scaling and mixed precision graph conversion in one line of code.

Designed to work with existing float32 models with minimal changes.

Supports Distribution Strategies

Supports either tf.train.Optimizer or tf.keras.optimizers.Optimizer

Environment variables continue to work in NGC releases

opt = tf.train.MomentumOptimizer(learning_rate, momentum)

opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

update_op = opt.minimize(loss)

23

Enabling TF-AMPTF 1.14+ & NGC 19.06+ TensorFlow Containers

Caveat: enable_mixed_precision_graph_rewrite() must be called before the TF session is created.

If that is not feasible, then the following can be added to the session config instead.

(The optimizer still needs to be wrapped as before.)

config = tf.ConfigProto()

config.graph_options.rewrite_options.auto_mixed_precision = 1

sess = tf.Session(config=config)

24

AMP Optimizer Logging

How do I verify AMP is working?

What EXACTLY did AMP do to my model?

Running auto_mixed_precision graph optimizerNo whitelist ops found, nothing to doRunning auto_mixed_precision graph optimizerConverted 824/3507 nodes to float16 precision using 1 cast(s) to float16 (excluding Const and Variable casts)

# Save before/after snapshots of optimized graphsexport TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_LOG_PATH=“/my/log/path”

# Enable VERY verbose logging of all decisions made by AMP optimizerexport TF_CPP_VMODULE=“auto_mixed_precision=2”

25

Tweaking AMP

Specific lists of ops on the whitelist, graylist, and blacklist:

tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h

You can modify the lists at runtime as follows.

# Comma separated list of subtractions/additionals to listsexport TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_BLACKLIST_REMOVE=Sumexport TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_GRAYLIST_ADD=Sum,YourCustomOp

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h

26

Customizing Loss Scaling

If loss_scale is ‘dynamic’, an initial loss scale of 215 is halved whenever nonfinite

gradients are detected and doubled if 2000 iterations elapse without NaNs.

These parameters can be changed by passing a DynamicLossScale object.

If loss_scale is an int/float value, fixed-value loss scaling is used. (No finite check is

performed.)

def enable_mixed_precision_graph_rewrite(opt, loss_scale='dynamic'):

loss_scale = tf.train.experiential.DynamicLossScale( initial_loss_scale=2**10, increment_period=1000, multiplier=4.)

opt = tf.train.experimental.enable_mixed_precision_graph_rewrite( opt, loss_scale=loss_scale)

27

Keras Mixed Precision PoliciesComing in TensorFlow 2.1

Enabled by calling set_policy() before constructing model.

Works with Eager execution

All model layers should inherit from tf.keras.layers.Layer

Data type changes are user visible

Data types can be explicitly controlled with tf.cast

policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')

tf.keras.mixed_precision.experimental.set_policy(policy)

28

Results

29

ResNet50 v1.5Training on a single V100

256 tf.identity(learning_rate, name='learning_rate_ref')

257 tf.summary.scalar('learning_rate', learning_rate)

258

259 optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,

momentum=params["momentum"])

260

261 optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)

Training script resnet_v1_5.py from github.com/NVIDIA/DeepLearningExamples

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Classification/RN50v1.5/model/resnet_v1_5.py

30

2.1x 2.8x 3.3x


31

2.1x 2.8x 3.3x


32

2.1x 2.8x 3.3x


33

Additional TensorFlow ResultsV100 Training Speedups

https://devblogs.nvidia.com/nvidia-automatic-mixed-precision-tensorflow/

https://devblogs.nvidia.com/nvidia-automatic-mixed-precision-tensorflow/

34

Mixed Precision is General PurposeSampling of models trained to match FP32 results as of May 2019

Image Classification Detection / Segmentation Generative Models (Images) Language Modeling

AlexNet DeepLab DLSS BERT

DenseNet Faster R-CNN Partial Image Inpainting BigLSTM

Inception Mask R-CNN Progress GAN 8k mLSTM (NVIDIA)

MobileNet Multibox SSD Pix2Pix Translation

NASNet NVIDIA Automotive Speech FairSeq (convolution)

ResNet RetinaNet Deep Speech 2 GNMT (RNN)

ResNeXt UNET Tacotron Transformer (self-attention)

VGG Recommendation WaveNet

XCeption DeepRecommender WaveGlow

NCF

35

What to do if AMP lowers accuracyMake sure loss scaling is enabled

Make sure the model uses either optimizer.minimize()or both of optimizer.compute_gradients() and optimizer.apply_gradients()

Specifically, models calling tf.gradients() directly will not enable loss scaling.

Make sure the optimizer is from tf.keras.optimizers or tf.train.

Still having problems?○ Let us know on the DevTalk Forum○ or file a bug on developer.nvidia.com

https://devtalk.nvidia.com/default/board/364/mixed-precision-and-tensor-cores/

https://developer.nvidia.com

36

Why doesn’t AMP speed up my model?The model may be IO or CPU bound

Try replacing your model with a trivial one-layer network.

If the trivial model isn’t significantly faster than the real model,

the IO and data pre-processing steps are likely limiting your performance.

If you are using image data, check out DALI to accelerate your input pipeline.

https://github.com/NVIDIA/dali

37

Why doesn’t AMP speed up my model?Tensor Cores shape constraints

The following dimensions should be chosen or padded to multiples of 8

○ Linear Layers: inputs, outputs, batch size○ Convolution Layers: input/output channels○ RNN Layers: hidden, embedding, batch, vocabulary

To verify that Tensor Cores are being used … the hard way

○ Run your model under Nsight Systems○ and look for kernels with [i|s|h][some numbers] in the name. For example:

volta_h884gemm_...Turing_fp16_s1688cudnn_fp16_…

Reference: Tensor Core Performance: The Ultimate Guide

https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9926-tensor+core+performance%3a+the+ultimate+guide

38

NVIDIA DLProfa Deep Learning

Profiler

39

Deep Learning Profiler

Intended for data scientists and DL researchers

Maps performance metrics back to the TensorFlow model

Adds GPU time and usage to TensorBoard

Reports Tensor Core usage

Helps locate opportunities to speed up models using reduced precision

Reason why some operations are not using Tensor cores could be as easy as dimensions of the matrix are wrong, need to be divisible by 8

40 ©2018 VMware, Inc.

NVIDIA Profiling Tools for Deep Learning

Deep Learning Profiler(DLProf)

Nsight Systems Nsight Compute

NVTX for Tensorflow

NVTX for PyTorch

NVTX for MXNet

*Nsight Systems and Nsight Compute have been built using CUDA Profiling Tools Interface(CUPTI) They rely on NVTX markers to focus on sections of code*NVTX Nvidia Tools Extension Library is a way to annotate source code with markers

*DLProf calls Nsight systems to collect the profile data and correlate with the graph

41 ©2018 VMware, Inc.

DLProf User Workflow

Generate graph definition

Prefix training script with dlprof

Visualize with Tensorboard or text reports

Use NVIDIA Optimized Tensorflow Framework container

Docs: https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide

graph_def = session.graph.as_graph_def()

with open('graphdef.pb', 'wb') as f:

f.write(graph_def.SerializeToString())

with open('graphdef.pbtxt', 'w') as f:

f.write(str(graph_def))

$ dlprof --in_graphdef=graphdef.pbtxt \

python resnet.py --layers=50 --num_iter=100 --batch=128 \

--iter_unit=batch --data_dir=/data/train-images \

https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide

42

TensorBoard and ReportsGPU Summary Tab

Provides model summary focused on TensorCore usage

15 “TC eligible” TensorFlow ops did not use Tensor Cores

Those opt required a total of 33.6 ms to compute

Runtime of ALL ops in the TF graph tool 2.62 sec

Further TC Optimization Opportunity < 3%

43

TensorBoard and Reports

10 most important ops/kernels by execution time

Shows for each whether Tensor Cores were used

Iterations report: Kernels executed and time used correlated with op names

GPU Summary: Top 10

44

TensorBoard and ReportsVisualising the interesting node

Can use TB search box to find nodes eligible but not using Tensor cores in the graph

Easily find input and output types, dimensions, etc.

User can identify potential performance improvements at specific points in their model

Overall conclusion: This model is mostly optimized to use mixed precision tensor cores

45

TensorBoard and Reports

Running with AMPRunning in FP32

46

TensorBoard and ReportsDLProf can generate reports in csv and json formats if specified on commandline

Option: --reports=detail,iteration --file_formats=csv

47

DLProf Roadmap

○ Generate graphdef automatically

○ Tensorboard 1.15 support

○ Support for running profiler with XLA

○ Running with Tensorflow 2.0

○ Adding your own user defined NVTX markers to generate profiles with DLProf

○ PyTorch support using same technology

○ Comparing consecutive runs in Tensorboard

○ More friendly recommendation steps for actions that improve performance (Expert

Systems)

Note: Subject to change

48

Resources

NVIDIA NGC TensorFlow Containershttps://ngc.nvidia.com/catalog/containers/nvidia:tensorflow

DLProf user guidehttps://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide

Example Mixed Precision Modelshttps://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow

Mixed Precision DevTalk Forumshttps://devtalk.nvidia.com/default/board/364/mixed-precision-and-tensor-cores

Mixed Precision Guidehttps://docs.nvidia.com/deeplearning/sdk/mixed-precision-training

https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow

https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide

https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow

https://devtalk.nvidia.com/default/board/364/mixed-precision-and-tensor-cores

https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training

Training in TensorFlow Automatic Mixed-Precision€¦ · Inherently mixed precision: internal...

Documents

Transcript of Training in TensorFlow Automatic Mixed-Precision€¦ · Inherently mixed precision: internal...