Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in...

60
Taking Advantage of Low Precision to Accelerate Training and Inference Using PyTorch Presented by: Myle Ott and Sergey Edunov Facebook AI Research (FAIR) Talk ID: S9832

Transcript of Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in...

Page 1: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Taking Advantage of Low Precision to Accelerate Training and Inference Using PyTorchPresented by: Myle Ott and Sergey Edunov Facebook AI Research (FAIR)

Talk ID: S9832

Page 2: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Overview

2

Mixed precision training in PyTorch: • 3-4x speedups in training wall time • Reduced memory usage ==> bigger batch sizes • No architecture changes required

Case study: Neural Machine Translation • Train models in 30 minutes instead of 1 day+ • Semi-supervised training over much larger datasets

Page 3: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

What are Tensor Cores? • Optimized hardware units for mixed precision matrix-multiply-and-

accumulate: D = A * B + C

3

Slide credit: Nvidia

Page 4: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

4

Slide credit: Nvidia

Page 5: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

If only it were this easy…

model.half()

5

Page 6: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Why not pure FP16?

FP16 has insufficient range/precision for some ops

Better to leave some ops in FP32: • Large reductions, e.g., norms, softmax, etc. • Pointwise ops where |f(x)| >> |x|, e.g., exp, pow, log, etc.

6

Page 7: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Why not pure FP16?

In practice, pure FP16 hurts optimization.

According to Nvidia: • Sum of FP16 values whose ratio is >211 is just the larger value

• Weight update: if w >> lr*dw then update doesn’t change w

7

Page 8: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Why not pure FP16?

Solution: mixed precision training

Optimize in FP32 and use FP16 for almost* everything else

* Some operations should still happen in FP32: • Large reductions, e.g., norms, softmax, etc. • Pointwise ops where |f(x)| >> |x|, e.g., exp, pow, log, etc.

8

Page 9: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Optimizing in FP32

9

FP16 Weights

FP16 Loss

FP16Gradients

Forward PassBackprop

Page 10: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

FP16 Weights

FP16 Loss

FP32 MasterGradients

FP16Gradients

Forward PassBackprop

Copy

Optimizing in FP32

10

Page 11: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Optimizing in FP32

11

FP16 Weights

FP16 Loss

FP32 MasterGradients

FP16Gradients

FP32 MasterWeights

Forward PassBackprop

Copy

Apply

Page 12: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Optimizing in FP32

12

FP16 Weights

FP16 Loss

FP32 MasterGradients

FP16Gradients

FP32 MasterWeights

Forward PassBackprop

Copy

Apply

Copy

Page 13: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Optimizing in FP32

13

FP16 Weights

FP16 Loss

FP32 MasterGradients

FP16Gradients

FP32 MasterWeights

Forward PassBackprop

Copy

Apply

Copy

This adds overhead!

It’s only worth it because of the Tensor Cores. Don’t use mixed

precision without Tensor Cores!

Page 14: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Gradient underflow

• FP16 has a smaller representable range than FP32 (shown in blue)

• In practice gradient are quite small, so there’s a risk of underflow

14

Page 15: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Gradient underflow

150

GradientsUnderflow can not be detected

But if we scale loss up

If we scale the loss up by K, by the chain rule of derivatives, gradients will be K times bigger

Page 16: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Gradient overflow

16Inf

If overflow detected

Scale the loss down

Gradients

Page 17: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Avoiding under/overflow by loss scaling

17

FP16 Weights

FP16 Loss

ScaledFP16

Gradients

Forward PassBackprop

Loss Scaling

Scaled FP16Loss

Page 18: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Avoiding under/overflow by loss scaling

18

FP16 Weights

FP16 Loss

ScaledFP16

Gradients

Forward PassBackprop

Loss Scaling

Scaled FP16Loss

If gradients overflow (inf), throw away the batch

Page 19: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Avoiding under/overflow by loss scaling

19

FP16 Weights

FP16 Loss

ScaledFP16

Gradients

Forward PassBackprop

Copy

Loss Scaling

Scaled FP16Loss

Scaled FP32Gradients

If gradients overflow (inf), throw away the batch

Page 20: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Avoiding under/overflow by loss scaling

20

FP16 Weights

FP16 Loss

FP32 Gradients

ScaledFP16

Gradients

Forward PassBackprop

Copy

Loss Scaling

Scaled FP16Loss

Scaled FP32Gradients

Remove scale

If gradients overflow (inf), throw away the batch

Page 21: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Avoiding under/overflow by loss scaling

21

FP16 Weights

FP16 Loss

FP32 Gradients

ScaledFP16

Gradients

FP32 MasterWeights

Forward PassBackprop

Copy

Apply

Copy

Loss Scaling

Scaled FP16Loss

Scaled FP32Gradients

If gradients overflow (inf), throw away the batch

Remove scale

Page 22: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

How to pick the scaling constant (K)

• Too small and gradient will underflow

• Too big and we’ll waste compute due to overflow

• In practice the optimal scaling constant changes during training

• We can adjust it dynamically!

22

Page 23: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Dynamic loss scaling

• Every time the gradient overflows (inf), reduce the scaling constant by a factor of 2

• If the gradients haven’t overflowed in the last N updates (~1000), then increase the scaling constant by a factor of 2

23

Page 24: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Dynamic loss scaling

24

Page 25: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

So far…

Tensor Cores make FP16 ops 4-9x faster

Mixed precision training: • Forward/backward in FP16

• Optimize in FP32

• Requires maintaining two copies of the model weights

• Dynamically scale the loss to avoid gradient under/overflow

25

Page 26: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

One more thing about FP16…

For maximal safety, perform ops that sum many values in FP32 • e.g., normalization layers, softmax, L1 or L2 norm, etc. • This includes most Loss layers, e.g., CrossEntropyLoss

General advice: compute your loss in FP32 too

26

Page 27: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

The full picture

27

FP16 Weights

FP32 Loss

Forward Pass

Page 28: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

The full picture

28

FP16 Weights

FP32 Loss

ScaledFP16

Gradients

Forward PassBackprop

Loss Scaling

Scaled FP32Loss

Page 29: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

The full picture

29

FP16 Weights

FP32 Loss

ScaledFP16

Gradients

Forward PassBackprop

Loss Scaling

Scaled FP32Loss

If gradients overflow (inf), throw away the batch

Page 30: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

The full picture

30

FP16 Weights

FP32 Loss

ScaledFP16

Gradients

Forward PassBackprop

Copy

Loss Scaling

Scaled FP32Loss

Scaled FP32Gradients

If gradients overflow (inf), throw away the batch

Page 31: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

The full picture

31

FP16 Weights

FP32 Loss

FP32 Gradients

ScaledFP16

Gradients

Forward PassBackprop

Copy

Loss Scaling

Scaled FP32Loss

Scaled FP32Gradients

Remove scale

If gradients overflow (inf), throw away the batch

Page 32: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

The full picture

32

FP16 Weights

FP32 Loss

FP32 Gradients

ScaledFP16

Gradients

FP32 MasterWeights

Forward PassBackprop

Copy

Apply

Copy

Loss Scaling

Scaled FP32Loss

Scaled FP32Gradients

Remove scale

If gradients overflow (inf), throw away the batch

Page 33: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

The full picture

33

FP16 Weights

FP32 Loss

FP32 Gradients

ScaledFP16

Gradients

FP32 MasterWeights

Forward PassBackprop

Copy

Apply

Copy

Loss Scaling

Scaled FP32Loss

Scaled FP32Gradients

Remove scale

If gradients overflow (inf), throw away the batch

Distributed gradient accumulation / all-reduce

option 1 (slower) option 2 (faster)

Page 34: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

In PyTorch

To automate the recipe, start with Nvidia’s apex.amp library: from apex import amp

optim = torch.optim.Adam(…)

model, optim = amp.initialize(model, optim, opt_level="O1")

(…)

with amp.scale_loss(loss, optim) as scaled_loss: scaled_loss.backward()

optim.step()

34

Page 35: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Making it even faster

apex.amp supports different optimization levels

opt_level="O1" is conservative and keeps many ops in FP32

opt_level="O2" is faster, but may require manually converting some ops to FP32 to achieve good results

More details at: https://nvidia.github.io/apex/

35

Page 36: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Making it even fasterA useful pattern:

x = torch.nn.functional.softmax(x, dtype=torch.float32).type_as(x)

When x is FP16 (i.e., a torch.HalfTensor): • Computes the softmax in FP32 and casts back to FP16

When x is FP32 (i.e., a torch.FloatTensor): • No impact on speed or memory

36

Page 37: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

One more thing…Must have GPU with Tensor Cores (Volta+), CUDA 9.1 or newer

Additionally: • Batch size should be a multiple of 8 • M, N and K for matmul should be multiples of 8 • Dictionaries/embed layers should be padded to be a multiple of 8

37

Page 38: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

SummaryMixed precision training gives:

• Tensor Cores make FP16 ops 4-9x faster • No architecture changes required • Use Nvidia's apex library

Tradeoffs: • Some extra bookkeeping required (mostly handled by apex) • Best perf requires manual fixes for softmax, layernorm, etc.

38

Page 39: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Scaling Machine Translation

39

Teng Li Ailing Zhang Shubho SenguptaMyle Ott Sergey Edunov David Grangier Michael Auli

Page 40: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Sequence to Sequence Learning

Bonjour à tous ! Hello everybody!

• Sequence to sequence mapping • Input = sequence, output = sequence • Structured prediction problem

Page 41: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

• machine translation • text summarization • writing stories • question generation • dialogue, chatbots • paraphrasing • ...

Sequence to Sequence Learning

41

Page 42: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Why do we need to scale?

• Large benchmark ~2.4 billion words + much more unlabeled data

• Training time: CNNs up to 38 days on 8 M40 GPUs (Gehring et al., 2017)

• Train many models

• Support Multilingual training

42

Page 43: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Reducing training time

Trai

n Ti

me

(Min

utes

)

0

400

800

1200

1600

Original +16-bit + cumul +2x lr 16 nodes +overlap

1,429

43

Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)

Page 44: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Reducing training time

Trai

n Ti

me

(Min

utes

)

0

400

800

1200

1600

Original +16-bit + cumul +2x lr 16 nodes +overlap

495

1,429

44

Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)

3x faster (wall time) using the same hardware, model architecture and bsz!

Page 45: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Reducing training time

Trai

n Ti

me

(Min

utes

)

0

400

800

1200

1600

Original +16-bit + cumul +2x lr 16 nodes +overlap

447495

1,429

45

Gradient

Forward/Backward

IdleGPU 1

GPU 2

GPU 3

GPU 4

Sync After 1

Sync After 2

Time

GPU 1

GPU 2

GPU 3

GPU 4

Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)

Page 46: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Reducing training time

Trai

n Ti

me

(Min

utes

)

0

400

800

1200

1600

Original +16-bit + cumul +2x lr 16 nodes +overlap

311447495

1,429

46

Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)

Page 47: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Reducing training time

Trai

n Ti

me

(Min

utes

)

0

400

800

1200

1600

Original +16-bit + cumul +2x lr 16 nodes +overlap

37

311447495

1,429

47

Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)

Page 48: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Reducing training time

Trai

n Ti

me

(Min

utes

)

0

400

800

1200

1600

Original +16-bit + cumul +2x lr 16 nodes +overlap

3237

311447495

1,429

48

Gradient Sync

Forward

Idle

BackwardGPU 1

GPU 2

GPU 3

GPU 4

Sync After Backward

Overlap Sync with Backward

GPU 1

GPU 2

GPU 3

GPU 4

Time

Implemented in PyTorch's DistributedDataParallel

Time in minutes to train "Transformer" translation model on Volta V100 GPUs (WMT En-De)

Page 49: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Semi-supervised machine translation

49

Myle OttSergey Edunov David GrangierMichael Auli

Page 50: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Bilingual

German

English

=

German

English

=

Intermediate Model

Data augmentation for TranslationBack-translation (Bojar & Tamchyna, 2011; Sennrich et al., 2016)

Page 51: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Bilingual

German

English

= GermanEnglish=

Intermediate Model

Data augmentation for TranslationBack-translation (Bojar & Tamchyna, 2011; Sennrich et al., 2016)

Page 52: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Monolingual Source

GermanGermanGerman

Bilingual

German

English

=

Intermediate Model

Page 53: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Monolingual Generated

EnglishEnglish

Monolingual Source

German

GermanGerman

BilingualIntermediate

Model

German

English

=

Page 54: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Monolingual Generated

EnglishEnglish

Monolingual Source

German

GermanGerman

Bilingual

German

English

=

Intermediate Model

German

English

=

Model

Page 55: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Monolingual Generated

EnglishEnglishEnglishEnglish

Bilingual

Monolingual Source

German

German

English

=

German

German

English

=

Final Model

Page 56: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Bilingual

Monolingual Source

Monolingual Generated

German

EnglishEnglish

GermanEnglishEnglish=

Final Model

German

English

=

Page 57: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Scaling from 100M to 5.8B words

57

BLEU

(Acc

urac

y)

0

9

18

27

36

fairseq &

sampled BT DeepL

(2017)

SAtt + RPR

(Google, 2018)

WTransfo

rmer

(Salesforce, 2017)

Transfo

rmer

(Google, 2017)ConvS2S

(2017)GNMT

(RNN, 2016)

Phrase-based

(2014)

20.724.625.2

28.428.929.233.335

WMT'14 English-GermanHigh quality, non-benchmark data

Only benchmark bilingual +

monolingual data

Model trains in 22.5h on 128 V100

Page 58: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

WMT'18 Human evaluations

58

Ranked #1 in the human evaluation of the WMT'18 English-German translation task

Page 59: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Conclusion

59

Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required • Use Nvidia's apex library

Case study: Neural Machine Translation • Train models in 30 minutes instead of 1 day+ • State-of-the-art translation quality using semi-supervised learning

Page 60: Taking Advantage of Low Precision to Accelerate Training and ......Mixed precision training in PyTorch: • 3-4x speedups in training wall time • No architecture changes required

Thank you! Questions?

60

Contact Us Myle Ott Sergey Edunov

[email protected] [email protected] References • Scaling Neural Machine Translation: arxiv.org/abs/1806.00187 • Understanding Back-translation at Scale: arxiv.org/abs/1808.09381 • apex: nvidia.github.io/apex

Acknowledgements: Nvidia and PyTorch teams for helping us implement and optimize mixed precision training.