Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

42
Applying Deep Learning at Facebook Scale Director of Engineering Applied Machine Learning Hussein Mehanna

Transcript of Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Page 1: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Applying Deep Learning at Facebook Scale

Director of EngineeringApplied Machine Learning

Hussein Mehanna

Page 2: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Page 3: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Event prediction

Machinetranslation

Large scale computer vision

Natural languageprocessing

Applications of deep learning

Page 4: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Event prediction

Applications of deep learning

Machinet

Large scale computer vision

Natural languageprocessing

Page 5: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Why should I like this story?

Page 6: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Page 7: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

1B+new stories every day

+ Billionsof stories from

this day years ago

Page 8: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Billion peopleThousand of stories

In milliseconds

Page 9: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Deep learning for rankingTitle

Deep learning

I like soccer I am fromAustralia

I am 26 I traveledto Argentina

Massive sparse logistic regression

Deep neural networks

+

Page 10: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Deep learning for ranking

Massive sparse logistic regression

Deep neural networks

+

Title

Deep learning

I like soccer I am fromAustralia

I am 26 I traveledto Argentina

Page 11: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Applications of deep learning

Event prediction

Machinet

Large scale computer vision

Natural languageprocessing

Page 12: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Applications of deep learning

Event prediction Machine

translation

Large scale computer vision

Natural languageprocessing

Page 13: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Page 14: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Recurrent neural networks with attention decoderMachine translation with neural networks

Encoder input Encoded statesEncoder

DecoderDecoder input Decoder

have some todayGonna fun

Vamos a divertirnos hoy

Attention model

Page 15: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Applications of deep learning

Event prediction Machine

translation

Large scale computer vision

Natural languageprocessing

Page 16: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Applications of deep learning

Natural languageprocessing

Large scale computer vision

Event prediction

Machinet

Page 17: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Page 18: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Applications of deep learning

Natural languageprocessing

Large scale computer vision

Event prediction

Machinet

Page 19: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Applications of deep learning

Large scale computer vision

Event prediction

Machinet

Natural languageprocessing

Page 20: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

VIDEOVIDEOVIDEO

VIDEO

VIDEO

VIDEO

VIDEOHundreds of Convolutional neural networks run on photos uploaded to Facebook

Page 21: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Classification Detection Segmentation

person, plate, drink

Page 22: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Page 23: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Page 24: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Improving Inference for deep learning

Compress modelsMemory usagein deep networksCompute faster

Page 25: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Improving Inference for deep learning

Memory usagein deep networksCompute faster Compress models

Page 26: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Convolution implementation strategies

90%+of runtime for modern

vision models

Page 27: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Faster convolution algorithms for deep learningCompute faster

201520142013

im2col + sgemm

FFT Tiled FFT Winograd

Page 28: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

CuDNN for CPUsNNPACK

Easy integrationCuDNN-style C interface, easy to integrate

Supports the computationally-intensive layers:• Convolutions (tiled FFT, Winograd)• Pooling• Fully connected layers (GEMM/GEMV)

Via an x86-64 meta-assembler (PeachPy)

Computationally-intensive

Implementation

(2x-6x) vs baseline CPU

Excellent performance

Page 29: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Open source, integrated into frameworks

NNPACK

Caffe/Caffe2: github.com/ajtulloch/caffe/tree/nnpack-prTorch: github.com/szagoruyko/nnpack.torch

github.com/Maratyszcza/NNPACK

Integrated into several deep learning frameworks:

Page 30: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Improving Inference for deep learning

Memory usagein deep networksCompute faster Compress models

Page 31: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Compress modelsMemory usagein deep networks

Compute faster

Improving Inference for deep learning

Page 32: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

The Memory Andy-Bill Theorem

Trend

• ResNets in vision• deep LSTMs in language

modeling

GPU memory relatively stable (12GB on Titan X/M4, 16GB on P100)

CPU memory has many constraints, especially in applied settings

Scale Constraints

Page 33: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Spend in activationsThe bulk of memory is in the activations – must reuse

Memory savings for modern ConvNets

View 'activations' as virtual registers and run a register allocator (graph coloring on interference graph)

50%-90%

Ideas from compilersRun inference in an O(N)-ResNet in O(1) memory!

Run inference

Page 34: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

AlexNet

Page 35: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

AlexNet

Page 36: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Inception Network

Page 37: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Some implementations

MXNet: github.com/dmlc/mxnet-memonger

Caffe/Caffe2: github.com/facebook/fb-caffe-exts/Torch: github.com/fmassa/optimize-net

Can go further and explicitly trade-off compute and memory:

ResNet-1000 from 48GB to 7GB for 30% slower timings

Page 38: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Improving Inference for deep learning

Compress modelsMemory usagein deep networks

Compute faster

Page 39: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Improving Inference for deep learning

Memory usagein deep networks Compress modelsCompute faster

Page 40: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Train Connectivity

Train Weights

Prune Connections

Generate Code Book

Retrain Code Book

Quantize the Weights with the

Cluster the Weights

Encode Weights

Encode Index

originalartwork

originalsize

sameaccuracy

10xreduction

sameaccuracy

27x-31xreduction

sameaccuracy

35x-50xreduction

PruningLess number of weights Huffman Encoding

Quantizationless bits per weight

Deep compression pipeline (Han et al)

Page 41: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

All together: Pruning + Quantization + Huffman coding

11.32%10.91%

31.50%31.17%

49X552MB 11.3 MB

Page 42: Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Event Machine

Large scale computer

Natural language

Compress

Memory usagein deep networks

Compute faster