FROM DESKTOP TO CLOUD TO EMBEDDED GPUS - GTC On...
Transcript of FROM DESKTOP TO CLOUD TO EMBEDDED GPUS - GTC On...
1© 2017 The MathWorks, Inc.
FROM DESKTOP TO CLOUD TO EMBEDDED GPUS DESIGNING, TRAINING, AND COMPILING VISION AND DEEP LEARNING ALGORITHMS USING MATLAB
Avinash Nehemiah Joss Knight Girish Venkataramani
2
Talk Outline
Design Deep Learning & Vision
Algorithms
High Performance Embedded
Implementation
Highlights • Manage large image sets• Automate image labeling• Easy access to models• Pre-built training
frameworks
Highlights Automate compilation of
MATLAB to CUDA 14x speedup over Caffe
& 4x speedup over TensorFlow
Accelerate and Scale Training
Highlights • Acceleration with GPU’s• Scale to clusters
3
Let’s Use Object Detection as an Example
TRUCK
SUVCAR
In our example we’ll use deep learning for object detection.
4
Two Approaches for Deep Learning
2. Fine-tune a pre-trained model (transfer learning)
1. Train a Deep Neural Network from Scratch
5
Transfer Learning Workflow
Transfer LearningImages
New Classifier
Learn New Weights
Modify Network Structure
Load Reference NetworkLabels
Training DataLabels: Car, Truck, Large Truck, SUV, Van
Alexnet, VGG-16, VGG-19, GoogLeNet
6
Manage Large Sets of Images
Transfer LearningImages
New Classifier
Learn New Weights
Modify Network Structure
Load Reference NetworkLabels
Handle Large Sets of Images
Easily manage large sets of images- Single line of code to access images- Operates on disk, database, big-data file system
imageData = imageDataStore(‘vehicles’)
Easily manage large sets of images- Single line of code to access images- Operates on disk, database, big-data file system
Organize Images in Folders( ~ 10,000 images , 5 folders)
7
Automate Ground Truth Labeling
Transfer LearningImages
New Classifier
Learn New Weights
Modify Network Structure
Load Reference NetworkLabels
Ground Truth Labeling
8
Automate Ground Truth Labeling
9
Access Reference Models in MATLAB
Transfer LearningImages
New Classifier
Learn New Weights
Modify Network Structure
Load Reference NetworkLabels
Easily Load Reference NetworksAccess Models with 1-line of MATLAB CodeNet1 = alexnetNet2 = vgg16Net3 = vgg19
10
Access Reference Models in MATLAB
Easily manage large sets of images- Single line of code to access images- Operates on disk, database, big-data file system
1. Reference Models
2. Model Importer
3. Tutorials
11
Modify Network Structure
Transfer LearningImages
New Classifier
Learn New Weights
Modify Network Structure
Load Reference NetworkLabels
Simple MATLAB API to modify layers:layers(23) = fullyConnectedLayer(5, 'Name','fc8');layers(25) = classificationLayer('Name',‘VehicleClassifier')
12
Training Object Detectors
Transfer LearningImages
New Classifier
Learn New Weights
Modify Network Structure
Load Reference NetworkLabels
Train Any Network trainNetwork(datastore, layers, options)
Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN• Machine Learning: ACF, Cascade Object Detectors
13
Visualizing and Debugging Intermediate Results
Filters…
Activations
Deep Dream
Training Accuracy Visualization Deep Dream
Layer Activations Feature Visualization
• Many options for visualizations and debugging• Examples to get started
14
Real World Systems Use More Than Deep Learning
Deep learning vehicle detector performance degraded with environmental effects ( fog etc. )
Fog Removal
Challenge: Deep learning frameworks do not include “classical” computer vision
Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation
15
Talk Outline
Design Deep Learning & Vision
Algorithms
High Performance Embedded
ImplementationAccelerate and Scale
Training
Can you solve “real” problems for production systems with MATLAB ? Doesn’t it take hours or days to train ?
16
Problems of acceleration and scale
How can I make my code run faster? How can I scale up to bigger problems?
Will I have to learn new tools? Will I have to learn new concepts?
17
MATLAB and Parallel Computing
Accelerate your code on the GPU
– DEMO: Preprocess your image dataset
Scale to multi-GPU and clusters
– DEMO: Parallel training with an Amazon P2 cluster
18
Transfer Learning with MATLAB
Transfer LearningImages
New Classifier
Learn New Weights
Modify Network Structure
Load Reference NetworkLabels
Images
Learn New Weights
?
19
20
21
22
23
24
25
26
29
Statistics and Machine LearningDistributions, hypothesis testing, k-
means clustering, nearest neighbour
Neural Networks
Deep Learning, Neural Network training and simulation
Image Processing and Computer Vision
Feature detection, transformations, filtering, object analysis
Signal Processing and Communications
FFT filtering, cross correlation, BER simulations
Parallel Computing
Over 300 core MATLAB functions optimized for GPU
• Elementary math• Linear algebra• FFTs and IFFTs• Convolution and filtering• Fitting and interpolation• Reductions and sorting• Sparse matrix support• double, single and integer
support
Built-in function support
30
Programming with GPUs
GPU-optimized functions
Simple programming constructs
– gpuArray, gather
Writing kernels in the MATLAB language
– arrayfun
Interface with your own CUDA C and C++ code
– CUDAKernel, mexcuda
Ease
of U
se
Greater C
ontrol
Ease
of U
se
Greater C
ontrol
Prototyping Test Framework
31
MATLAB and Parallel Computing
Accelerate your code on the GPU
– DEMO: Preprocess your image dataset
Scale to multi-GPU and clusters
– DEMO: Parallel training with an Amazon P2 cluster
32
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs
33
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs
Mor
e C
PUs
34
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs
Mor
e C
PUs
36
MATLAB and Parallel Computing
Accelerate your code on the GPU
– DEMO: Preprocess your image dataset
Scale to multi-GPU and clusters
– DEMO: Parallel training with an Amazon P2 cluster
37
Transfer learning in 11 lines of MATLAB code
38
39
Performance
30-40% reduction in training time each time you double your GPUs
Communicating with the Cloud costs nothing extra
40
Talk Outline
Design Deep Learning & Vision
Algorithms
High Performance Embedded
ImplementationAccelerate and Scale
Training
Can you create high performance implementation from MATLAB code ?
41
Alexnet inference using MATLAB solution is~14x faster than pyCaffe and 60% faster than C++ Caffe~ 4x faster and ~3x less memory-use than TensorFlow
Presenting the MATLAB to CUDA parallelizing compiler
Why?
MATLAB
GPU Parallelization
Kernel creation
Memory allocation
Data transfer minimization
42
Sample Generated CUDA Code
MATLAB source code Auto-generated CUDA code
GPU Parallelization
Kernel creation
Memory allocation
Data transfer minimization
43
MATLAB to CUDA compiler flow
Control-flow graphIntermediate representation
(CFG – IR)
….
….
Identify loop-nests that will become CUDA kernels
….
Convert loop to CUDA kernelThread/blocks inferred from loop dims
Perform Use-def analysis.cudaMalloc GPU vars, insert memcpy
Infer data locality. Map to shared memory. Synthesize shared memory access
CUDA kernel optimizations
(×) cuBlas calls
(\) cuSolver calls
fft cuFFT calls
nnet cuDNN calls
Front – end
Traditional compiler optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory synthesis
CUDA code emission
Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory synthesis
44
MATLAB to CUDA compiler: It’s all about big parallel loops!
Control-flow graphIntermediate representation
(CFG – IR)
….
….
CUDA kernel optimizations
Front – end
Traditional compiler optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory synthesis
CUDA code emission
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Loopoptimizations
Scalarization
Loop fusion
Scalar replacement
45
MATLAB to CUDA compiler: It’s all about big parallel loops!
Control-flow graphIntermediate representation
(CFG – IR)
….
….
CUDA kernel optimizations
Front – end
Traditional compiler optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory synthesis
CUDA code emission
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Loopoptimizations
2 kernels (size N), 20*N bytes
1 kernel (size N), 16*N bytes
46
cudaMemcpy minimizationA(:) = ….C(:) = ….
for i = 1:N….gB = kernel1(gA);gA = kernel2(gB);if (some_condition)
gC = kernel3(gA, gB);end….
end
…. = C;
cudaMemcpy*definitely* needed
cudaMemcpy*not* needed
cudaMemcpy*may be* needed
Observations• Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a
status flag per variable• Use-Def to determine where to insert memcpy
A(:) = …A_isDirtyOnCpu = true;…for i = 1:N
if (A_isDirtyOnCpu)cudaMemcpy(gA, A);A_isDirtyOnCpu = false;
endgB = kernel1(gA);gA = kernel2(gB);if (somecondition)
gC = kernel3(gA, gB);C_isDirtyOnGpu = true;
end…
end…if (C_isDirtyOnGpu)
cudaMemcpy(C, gC);C_isDirtyOnGpu = false;
end… = C;
Assume gA, gB and gC are mapped to GPU memory
Generated (pseudo) code
47
Dotprod
Input image Conv. kernel Output image
rows
cols
kw
kh
Shared memory synthesis for stencil operations
For stencil operations, the MATLAB to CUDA compiler automatically• Infers GPU shared memory• Automates the collaborative loading in to shared memory block• Automatically translates access from global variable to shared-mem variable
48
Example: Compiling fog-rectification algorithm
49
MATLAB to CUDA Compilation in Computer Vision Applications
Distance transform
Fog removal
SURF feature extraction
Ray tracing
Stereo disparity
50
Deep learning prediction performance: Alexnet
51
Deep learning prediction performance: Alexnet
0
1
2
3
4
5
6
7
8
9CPU resident memory GPU peak memory (nvidia-smi)
Mem
ory
usag
e (G
B)
Batch Size1 16 32 64
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHzGPU Tesla K40c
Py-C
affe
MAT
LAB
to C
UD
A co
mpi
ler
Tens
orFl
ow
MAT
LAB
on
CPU
+GPU
C++
-Caf
fe
52
Deep learning prediction performance: Alexnet
0
50
100
150
200
250
1 16 32 64 128
Fram
e ra
te (F
ps)
Batch Size
C++-Caffe
MATLAB to CUDA compiler
Jetson (Tegra) TX1
53
Deep Learning Prediction Performance: VGG-16
0
10
20
30
40
50
60
70
80
90
1 16 32 64
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHzGPU Tesla K40c
MATLAB to CUDA compiler
C++ Caffe
MATLAB running on CPU+GPU
TensorFlow
py Caffe
Fram
e ra
te (F
ps)
Batch Size
54
Create CNNs with MATLAB, Deploy with MATLAB to CUDA compiler
Alexnet Vehicle Detection
People detection Lane detection
~30 Fps(Tegra X1)
~66 Fps(Tegra X1)
~20 Fps(K40c)
~130 Fps(K40c)
55
Conclusions
Design Deep Learning & Vision
AlgorithmAccelerate and Scale
Training
Deep learning design is easy in MATLAB
Parallel Computing Toolbox
7x faster than pyCaffe2x faster than TensorFlow
MATLAB to CUDA compiler14x faster than pyCaffe
4x faster than TensorFlow1.6x faster than C++ Caffe
High Performance Embedded
Implementation
56
What Next?
www.mathworks.com/matlab-cuda-beta
MATLAB to CUDA compiler:Sign up for our beta program
Try Deep Learning with MATLAB
Visit our booth, we love to chat: Booth # 804