Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
-
Upload
mlconf -
Category
Technology
-
view
422 -
download
3
Transcript of Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Applying Deep Learning at Facebook Scale
Director of EngineeringApplied Machine Learning
Hussein Mehanna
Event prediction
Machinetranslation
Large scale computer vision
Natural languageprocessing
Applications of deep learning
Event prediction
Applications of deep learning
Machinet
Large scale computer vision
Natural languageprocessing
Why should I like this story?
1B+new stories every day
+ Billionsof stories from
this day years ago
Billion peopleThousand of stories
In milliseconds
Deep learning for rankingTitle
Deep learning
I like soccer I am fromAustralia
I am 26 I traveledto Argentina
Massive sparse logistic regression
Deep neural networks
+
Deep learning for ranking
Massive sparse logistic regression
Deep neural networks
+
Title
Deep learning
I like soccer I am fromAustralia
I am 26 I traveledto Argentina
Applications of deep learning
Event prediction
Machinet
Large scale computer vision
Natural languageprocessing
Applications of deep learning
Event prediction Machine
translation
Large scale computer vision
Natural languageprocessing
Recurrent neural networks with attention decoderMachine translation with neural networks
Encoder input Encoded statesEncoder
DecoderDecoder input Decoder
have some todayGonna fun
Vamos a divertirnos hoy
Attention model
Applications of deep learning
Event prediction Machine
translation
Large scale computer vision
Natural languageprocessing
Applications of deep learning
Natural languageprocessing
Large scale computer vision
Event prediction
Machinet
Applications of deep learning
Natural languageprocessing
Large scale computer vision
Event prediction
Machinet
Applications of deep learning
Large scale computer vision
Event prediction
Machinet
Natural languageprocessing
VIDEOVIDEOVIDEO
VIDEO
VIDEO
VIDEO
VIDEOHundreds of Convolutional neural networks run on photos uploaded to Facebook
Classification Detection Segmentation
person, plate, drink
Improving Inference for deep learning
Compress modelsMemory usagein deep networksCompute faster
Improving Inference for deep learning
Memory usagein deep networksCompute faster Compress models
Convolution implementation strategies
90%+of runtime for modern
vision models
Faster convolution algorithms for deep learningCompute faster
201520142013
im2col + sgemm
FFT Tiled FFT Winograd
CuDNN for CPUsNNPACK
Easy integrationCuDNN-style C interface, easy to integrate
Supports the computationally-intensive layers:• Convolutions (tiled FFT, Winograd)• Pooling• Fully connected layers (GEMM/GEMV)
Via an x86-64 meta-assembler (PeachPy)
Computationally-intensive
Implementation
(2x-6x) vs baseline CPU
Excellent performance
Open source, integrated into frameworks
NNPACK
Caffe/Caffe2: github.com/ajtulloch/caffe/tree/nnpack-prTorch: github.com/szagoruyko/nnpack.torch
github.com/Maratyszcza/NNPACK
Integrated into several deep learning frameworks:
Improving Inference for deep learning
Memory usagein deep networksCompute faster Compress models
Compress modelsMemory usagein deep networks
Compute faster
Improving Inference for deep learning
The Memory Andy-Bill Theorem
Trend
• ResNets in vision• deep LSTMs in language
modeling
GPU memory relatively stable (12GB on Titan X/M4, 16GB on P100)
CPU memory has many constraints, especially in applied settings
Scale Constraints
Spend in activationsThe bulk of memory is in the activations – must reuse
Memory savings for modern ConvNets
View 'activations' as virtual registers and run a register allocator (graph coloring on interference graph)
50%-90%
Ideas from compilersRun inference in an O(N)-ResNet in O(1) memory!
Run inference
AlexNet
AlexNet
Inception Network
Some implementations
MXNet: github.com/dmlc/mxnet-memonger
Caffe/Caffe2: github.com/facebook/fb-caffe-exts/Torch: github.com/fmassa/optimize-net
Can go further and explicitly trade-off compute and memory:
ResNet-1000 from 48GB to 7GB for 30% slower timings
Improving Inference for deep learning
Compress modelsMemory usagein deep networks
Compute faster
Improving Inference for deep learning
Memory usagein deep networks Compress modelsCompute faster
Train Connectivity
Train Weights
Prune Connections
Generate Code Book
Retrain Code Book
Quantize the Weights with the
Cluster the Weights
Encode Weights
Encode Index
originalartwork
originalsize
sameaccuracy
10xreduction
sameaccuracy
27x-31xreduction
sameaccuracy
35x-50xreduction
PruningLess number of weights Huffman Encoding
Quantizationless bits per weight
Deep compression pipeline (Han et al)
All together: Pruning + Quantization + Huffman coding
11.32%10.91%
31.50%31.17%
49X552MB 11.3 MB
Event Machine
Large scale computer
Natural language
Compress
Memory usagein deep networks
Compute faster