Training in TensorFlow Automatic Mixed-Precision€¦ · Inherently mixed precision: internal...
Transcript of Training in TensorFlow Automatic Mixed-Precision€¦ · Inherently mixed precision: internal...
Nathan Luehr, Sr. Developer Technology Engineer, NVIDIA Nov 5, 2019
Automatic Mixed-PrecisionTraining in TensorFlow
2
Outline
Mixed Precision Background
Automatic Loss Scaling
Mixed Precision Graph Optimizer
Using Automatic Mixed Precision in TensorFlow
Mixed Precision Results
Deep Learning Profiler
Questions
3
Mixed Precision Background
4
Motivation
Reduced precision (16-bit floating point) for speed or scale
Full precision (32-bit floating point) to maintain task-specific accuracy
By using multiple precisions, we avoid a pure tradeoff of speed and accuracy
Goals:
Maximize use of reduced precision while matching accuracy of full precision training with no changes to hyperparameters
Provide automated and general purpose tools so that mixed precision can be easily enabled
5
Automatic Mixed PrecisionIn a nutshell
Add one line to your training script.
Achieve a 1.5 to 3x speedup compared to FP32 training
with no loss in accuracy
Examples: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/
opt = tf.train.MomentumOptimizer(learning_rate, momentum)
opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
update_op = opt.minimize(loss)
6
Tensor CoresHardware support for accelerated 16-bit FP math
Peak throughput of 125 TFLOPS (8x FP32) on V100
Inherently mixed precision: internal accumulation occurs in FP32 for accuracy
Used by cuDNN and cuBLAS libraries to accelerate matrix multiply and convolution
7
Mixed Precision StrategyFor training
► Use Tensor Cores to accelerate convolutions and matrix multiplications
► Store most activations in FP16○ Enables larger models and/or larger batch sizes○ Double effective bandwidth compared to FP32
► Use FP32 for likely to overflow ops (e.g., sums, reductions, log, exp)
► Use loss scaling to maintain gradients in the FP16 representable range
► Update model parameters in FP32 to avoid truncation
8
Automatic Loss Scaling
9
Loss ScalingThe idea
Range of numbers representable in FP16 is ~40 powers of 2
Gradients are small:○ Some underflow to zero○ While ~15 powers of 2 are unused
Scaling the loss, L(x), by s uniformly increases gradient values by a factor of s
Unscale weight gradients (in FP32) for weight update
10
Automatic Loss ScalingTuning s on the fly
Start with a very large scale factor
If any gradient value results in Inf or NaN, decrease s AND skip the weight update (including any optimizer state)
If gradients have remained finite for some number of batches, increase s.
11
Mixed PrecisionGraph Optimizer
12
TensorFlow Graphs
x = tf.placeholder(tf.float32, shape=(1024, 1024))
w = tf.get_variable(‘w’, shape=(1024, 1024))
z = tf.add(x, tf.matmul(x, w))
VariableV2FP32
AddFP32
PlaceholderFP32
MatMulFP32
IdentityFP32
13
Transformed GraphsBasic idea
VariableV2FP32
PlaceholderFP32
IdentityFP32
CastFP32 to FP16
CastFP32 to FP16
AddFP16
MatMulFP16
14
Initial Training GraphAll operations performed in FP32
VariableV2
Relu
MatMul
Loss
Conv2d
VariableV2Add
Placeholder
ReluGrad
LossGrad
MatMul
MatMul
Placeholder
GradInput
GradFilter
Mul
VariableV2 Reciprocal
Mul
Mul
15
Graph ConversionChoosing what to cast
Always Cast:whitelist
Ops highly accelerated by float16. These always justify performance costs of casting inputs. Examples: MatMul and Conv2d.
Maybe Cast:graylist
Ops available for float16 execution but not accelerated sufficiently to justify casting overhead on their own. Examples: Add and Relu.
Never Cast:blacklist
Ops requiring float32 evaluation in order to maintain numerical stability. Examples: Exp and Sum.
Everything Else:
Ops lacking float16 implementations or operating on non-floating point inputs.
16
Graph Coloring ExampleStep 1: Initialize Op Colors
VariableV2
Relu
MatMul
Loss
Conv2d
VariableV2Add
Placeholder
ReluGrad
LossGrad
MatMul
MatMul
Placeholder
GradInput
GradFilter
Mul
VariableV2 Reciprocal
Mul
Mul
17
Graph Coloring ExampleStep 2: Propagate ‘Never’ Tags Forward
VariableV2
Relu
MatMul
Loss
Conv2d
VariableV2Add
Placeholder
ReluGrad
LossGrad
MatMul
MatMul
Placeholder
GradInput
GradFilter
Mul
VariableV2 Reciprocal
Mul
Mul
Mul
18
Graph Coloring ExampleStep 3: Paint ‘Maybe’ Ops Bounded by ‘Always’
VariableV2
Relu
MatMul
Loss
Conv2d
VariableV2Add
Placeholder
ReluGrad
LossGrad
MatMul
MatMul
Placeholder
GradInput
GradFilter
Mul
VariableV2 Reciprocal
Mul
Mul
Mul
19
Graph Coloring ExampleStep 4: Find boundaries of ‘always’ sections
VariableV2
Relu
MatMul
Loss
Conv2d
VariableV2Add
Placeholder
ReluGrad
LossGrad
MatMul
MatMul
Placeholder
GradInput
GradFilter
Mul
VariableV2 Reciprocal
Mul
Mul
Mul
20
Graph Coloring ExampleStep 5: Insert casts (with reuse)
VariableV2
Relu
MatMul
Loss
Conv2d
VariableV2Add
Placeholder
ReluGrad
LossGrad
MatMul
MatMul
Placeholder
BackInput
BackFilter
Mul
VariableV2 Reciprocal
Mul
Mul
FP16 CastFP16 Cast
FP16 Cast
FP32 Cast
FP32 Cast
FP16 CastFP32 Cast
21
Using Automatic Mixed Precision in TensorFlow
22
Enabling TF-AMPTF 1.14+ & NGC 19.06+ TensorFlow Containers
Enable both loss scaling and mixed precision graph conversion in one line of code.
Designed to work with existing float32 models with minimal changes.
Supports Distribution Strategies
Supports either tf.train.Optimizer or tf.keras.optimizers.Optimizer
Environment variables continue to work in NGC releases
opt = tf.train.MomentumOptimizer(learning_rate, momentum)
opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
update_op = opt.minimize(loss)
23
Enabling TF-AMPTF 1.14+ & NGC 19.06+ TensorFlow Containers
Caveat: enable_mixed_precision_graph_rewrite() must be called before the TF session is created.
If that is not feasible, then the following can be added to the session config instead.
(The optimizer still needs to be wrapped as before.)
config = tf.ConfigProto()
config.graph_options.rewrite_options.auto_mixed_precision = 1
sess = tf.Session(config=config)
24
AMP Optimizer Logging
How do I verify AMP is working?
What EXACTLY did AMP do to my model?
Running auto_mixed_precision graph optimizerNo whitelist ops found, nothing to doRunning auto_mixed_precision graph optimizerConverted 824/3507 nodes to float16 precision using 1 cast(s) to float16 (excluding Const and Variable casts)
# Save before/after snapshots of optimized graphsexport TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_LOG_PATH=“/my/log/path”
# Enable VERY verbose logging of all decisions made by AMP optimizerexport TF_CPP_VMODULE=“auto_mixed_precision=2”
25
Tweaking AMP
Specific lists of ops on the whitelist, graylist, and blacklist:
tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h
You can modify the lists at runtime as follows.
# Comma separated list of subtractions/additionals to listsexport TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_BLACKLIST_REMOVE=Sumexport TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_GRAYLIST_ADD=Sum,YourCustomOp
26
Customizing Loss Scaling
If loss_scale is ‘dynamic’, an initial loss scale of 215 is halved whenever nonfinite
gradients are detected and doubled if 2000 iterations elapse without NaNs.
These parameters can be changed by passing a DynamicLossScale object.
If loss_scale is an int/float value, fixed-value loss scaling is used. (No finite check is
performed.)
def enable_mixed_precision_graph_rewrite(opt, loss_scale='dynamic'):
loss_scale = tf.train.experiential.DynamicLossScale( initial_loss_scale=2**10, increment_period=1000, multiplier=4.)
opt = tf.train.experimental.enable_mixed_precision_graph_rewrite( opt, loss_scale=loss_scale)
27
Keras Mixed Precision PoliciesComing in TensorFlow 2.1
Enabled by calling set_policy() before constructing model.
Works with Eager execution
All model layers should inherit from tf.keras.layers.Layer
Data type changes are user visible
Data types can be explicitly controlled with tf.cast
policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)
28
Results
29
ResNet50 v1.5Training on a single V100
256 tf.identity(learning_rate, name='learning_rate_ref')
257 tf.summary.scalar('learning_rate', learning_rate)
258
259 optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=params["momentum"])
260
261 optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer)
Training script resnet_v1_5.py from github.com/NVIDIA/DeepLearningExamples
30
2.1x 2.8x 3.3x
ResNet50 v1.5Training on a single V100
31
2.1x 2.8x 3.3x
ResNet50 v1.5Training on a single V100
32
2.1x 2.8x 3.3x
ResNet50 v1.5Training on a single V100
33
Additional TensorFlow ResultsV100 Training Speedups
https://devblogs.nvidia.com/nvidia-automatic-mixed-precision-tensorflow/
34
Mixed Precision is General PurposeSampling of models trained to match FP32 results as of May 2019
Image Classification Detection / Segmentation Generative Models (Images) Language Modeling
AlexNet DeepLab DLSS BERT
DenseNet Faster R-CNN Partial Image Inpainting BigLSTM
Inception Mask R-CNN Progress GAN 8k mLSTM (NVIDIA)
MobileNet Multibox SSD Pix2Pix Translation
NASNet NVIDIA Automotive Speech FairSeq (convolution)
ResNet RetinaNet Deep Speech 2 GNMT (RNN)
ResNeXt UNET Tacotron Transformer (self-attention)
VGG Recommendation WaveNet
XCeption DeepRecommender WaveGlow
NCF
35
What to do if AMP lowers accuracyMake sure loss scaling is enabled
Make sure the model uses either optimizer.minimize()or both of optimizer.compute_gradients() and optimizer.apply_gradients()
Specifically, models calling tf.gradients() directly will not enable loss scaling.
Make sure the optimizer is from tf.keras.optimizers or tf.train.
Still having problems?○ Let us know on the DevTalk Forum○ or file a bug on developer.nvidia.com
36
Why doesn’t AMP speed up my model?The model may be IO or CPU bound
Try replacing your model with a trivial one-layer network.
If the trivial model isn’t significantly faster than the real model,
the IO and data pre-processing steps are likely limiting your performance.
If you are using image data, check out DALI to accelerate your input pipeline.
37
Why doesn’t AMP speed up my model?Tensor Cores shape constraints
The following dimensions should be chosen or padded to multiples of 8
○ Linear Layers: inputs, outputs, batch size○ Convolution Layers: input/output channels○ RNN Layers: hidden, embedding, batch, vocabulary
To verify that Tensor Cores are being used … the hard way
○ Run your model under Nsight Systems○ and look for kernels with [i|s|h][some numbers] in the name. For example:
volta_h884gemm_...Turing_fp16_s1688cudnn_fp16_…
Reference: Tensor Core Performance: The Ultimate Guide
38
NVIDIA DLProfa Deep Learning
Profiler
39
Deep Learning Profiler
Intended for data scientists and DL researchers
Maps performance metrics back to the TensorFlow model
Adds GPU time and usage to TensorBoard
Reports Tensor Core usage
Helps locate opportunities to speed up models using reduced precision
Reason why some operations are not using Tensor cores could be as easy as dimensions of the matrix are wrong, need to be divisible by 8
40 ©2018 VMware, Inc.
NVIDIA Profiling Tools for Deep Learning
Deep Learning Profiler(DLProf)
Nsight Systems Nsight Compute
NVTX for Tensorflow
NVTX for PyTorch
NVTX for MXNet
*Nsight Systems and Nsight Compute have been built using CUDA Profiling Tools Interface(CUPTI) They rely on NVTX markers to focus on sections of code*NVTX Nvidia Tools Extension Library is a way to annotate source code with markers
*DLProf calls Nsight systems to collect the profile data and correlate with the graph
41 ©2018 VMware, Inc.
DLProf User Workflow
Generate graph definition
Prefix training script with dlprof
Visualize with Tensorboard or text reports
Use NVIDIA Optimized Tensorflow Framework container
Docs: https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide
graph_def = session.graph.as_graph_def()
with open('graphdef.pb', 'wb') as f:
f.write(graph_def.SerializeToString())
with open('graphdef.pbtxt', 'w') as f:
f.write(str(graph_def))
$ dlprof --in_graphdef=graphdef.pbtxt \
python resnet.py --layers=50 --num_iter=100 --batch=128 \
--iter_unit=batch --data_dir=/data/train-images \
42
TensorBoard and ReportsGPU Summary Tab
Provides model summary focused on TensorCore usage
15 “TC eligible” TensorFlow ops did not use Tensor Cores
Those opt required a total of 33.6 ms to compute
Runtime of ALL ops in the TF graph tool 2.62 sec
Further TC Optimization Opportunity < 3%
43
TensorBoard and Reports
10 most important ops/kernels by execution time
Shows for each whether Tensor Cores were used
Iterations report: Kernels executed and time used correlated with op names
GPU Summary: Top 10
44
TensorBoard and ReportsVisualising the interesting node
Can use TB search box to find nodes eligible but not using Tensor cores in the graph
Easily find input and output types, dimensions, etc.
User can identify potential performance improvements at specific points in their model
Overall conclusion: This model is mostly optimized to use mixed precision tensor cores
45
TensorBoard and Reports
Running with AMPRunning in FP32
46
TensorBoard and ReportsDLProf can generate reports in csv and json formats if specified on commandline
Option: --reports=detail,iteration --file_formats=csv
47
DLProf Roadmap
○ Generate graphdef automatically
○ Tensorboard 1.15 support
○ Support for running profiler with XLA
○ Running with Tensorflow 2.0
○ Adding your own user defined NVTX markers to generate profiles with DLProf
○ PyTorch support using same technology
○ Comparing consecutive runs in Tensorboard
○ More friendly recommendation steps for actions that improve performance (Expert
Systems)
Note: Subject to change
48
Resources
NVIDIA NGC TensorFlow Containershttps://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
DLProf user guidehttps://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide
Example Mixed Precision Modelshttps://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow
Mixed Precision DevTalk Forumshttps://devtalk.nvidia.com/default/board/364/mixed-precision-and-tensor-cores
Mixed Precision Guidehttps://docs.nvidia.com/deeplearning/sdk/mixed-precision-training