FROM DESKTOP TO CLOUD TO EMBEDDED GPUS - GTC On...

1© 2017 The MathWorks, Inc.

FROM DESKTOP TO CLOUD TO EMBEDDED GPUS DESIGNING, TRAINING, AND COMPILING VISION AND DEEP LEARNING ALGORITHMS USING MATLAB

Avinash Nehemiah Joss Knight Girish Venkataramani

2

Talk Outline

Design Deep Learning & Vision

Algorithms

High Performance Embedded

Implementation

Highlights • Manage large image sets• Automate image labeling• Easy access to models• Pre-built training

frameworks

Highlights Automate compilation of

MATLAB to CUDA 14x speedup over Caffe

& 4x speedup over TensorFlow

Accelerate and Scale Training

Highlights • Acceleration with GPU’s• Scale to clusters

3

Let’s Use Object Detection as an Example

TRUCK

SUVCAR

In our example we’ll use deep learning for object detection.

4

Two Approaches for Deep Learning

2. Fine-tune a pre-trained model (transfer learning)

1. Train a Deep Neural Network from Scratch

5

Transfer Learning Workflow

Transfer LearningImages

New Classifier

Learn New Weights

Modify Network Structure

Load Reference NetworkLabels

Training DataLabels: Car, Truck, Large Truck, SUV, Van

Alexnet, VGG-16, VGG-19, GoogLeNet

6

Manage Large Sets of Images


New Classifier

Learn New Weights



Handle Large Sets of Images

Easily manage large sets of images- Single line of code to access images- Operates on disk, database, big-data file system

imageData = imageDataStore(‘vehicles’)


Organize Images in Folders( ~ 10,000 images , 5 folders)

7

Automate Ground Truth Labeling


New Classifier

Learn New Weights



Ground Truth Labeling

8

Automate Ground Truth Labeling

9

Access Reference Models in MATLAB


New Classifier

Learn New Weights



Easily Load Reference NetworksAccess Models with 1-line of MATLAB CodeNet1 = alexnetNet2 = vgg16Net3 = vgg19

10

Access Reference Models in MATLAB


1. Reference Models

2. Model Importer

3. Tutorials

11



New Classifier

Learn New Weights



Simple MATLAB API to modify layers:layers(23) = fullyConnectedLayer(5, 'Name','fc8');layers(25) = classificationLayer('Name',‘VehicleClassifier')

12

Training Object Detectors


New Classifier

Learn New Weights



Train Any Network trainNetwork(datastore, layers, options)

Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN• Machine Learning: ACF, Cascade Object Detectors

13

Visualizing and Debugging Intermediate Results

Filters…

Activations

Deep Dream

Training Accuracy Visualization Deep Dream

Layer Activations Feature Visualization

• Many options for visualizations and debugging• Examples to get started

14

Real World Systems Use More Than Deep Learning

Deep learning vehicle detector performance degraded with environmental effects ( fog etc. )

Fog Removal

Challenge: Deep learning frameworks do not include “classical” computer vision

Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation

15

Talk Outline


Algorithms


ImplementationAccelerate and Scale

Training

Can you solve “real” problems for production systems with MATLAB ? Doesn’t it take hours or days to train ?

16

Problems of acceleration and scale

How can I make my code run faster? How can I scale up to bigger problems?

Will I have to learn new tools? Will I have to learn new concepts?

17

MATLAB and Parallel Computing

Accelerate your code on the GPU

– DEMO: Preprocess your image dataset

Scale to multi-GPU and clusters

– DEMO: Parallel training with an Amazon P2 cluster

18

Transfer Learning with MATLAB


New Classifier

Learn New Weights



Images

Learn New Weights

?

29

Statistics and Machine LearningDistributions, hypothesis testing, k-

means clustering, nearest neighbour

Neural Networks

Deep Learning, Neural Network training and simulation

Image Processing and Computer Vision

Feature detection, transformations, filtering, object analysis

Signal Processing and Communications

FFT filtering, cross correlation, BER simulations

Parallel Computing

Over 300 core MATLAB functions optimized for GPU

• Elementary math• Linear algebra• FFTs and IFFTs• Convolution and filtering• Fitting and interpolation• Reductions and sorting• Sparse matrix support• double, single and integer

support

Built-in function support

30

Programming with GPUs

GPU-optimized functions

Simple programming constructs

– gpuArray, gather

Writing kernels in the MATLAB language

– arrayfun

Interface with your own CUDA C and C++ code

– CUDAKernel, mexcuda

Ease

of U

se

Greater C

ontrol

Ease

of U

se

Greater C

ontrol

Prototyping Test Framework

31






32

Deep learning on CPU, GPU, multi-GPU and clusters

More GPUs

33


More GPUs

Mor

e C

PUs

34


More GPUs

Mor

e C

PUs

36






37

Transfer learning in 11 lines of MATLAB code

39

Performance

30-40% reduction in training time each time you double your GPUs

Communicating with the Cloud costs nothing extra

40

Talk Outline


Algorithms


ImplementationAccelerate and Scale

Training

Can you create high performance implementation from MATLAB code ?

41

Alexnet inference using MATLAB solution is~14x faster than pyCaffe and 60% faster than C++ Caffe~ 4x faster and ~3x less memory-use than TensorFlow

Presenting the MATLAB to CUDA parallelizing compiler

Why?

MATLAB

GPU Parallelization

Kernel creation

Memory allocation

Data transfer minimization

42

Sample Generated CUDA Code

MATLAB source code Auto-generated CUDA code

GPU Parallelization

Kernel creation

Memory allocation

Data transfer minimization

43

MATLAB to CUDA compiler flow

Control-flow graphIntermediate representation

(CFG – IR)

….

….

Identify loop-nests that will become CUDA kernels

….

Convert loop to CUDA kernelThread/blocks inferred from loop dims

Perform Use-def analysis.cudaMalloc GPU vars, insert memcpy

Infer data locality. Map to shared memory. Synthesize shared memory access

CUDA kernel optimizations

(×) cuBlas calls

(\) cuSolver calls

fft cuFFT calls

nnet cuDNN calls

Front – end

Traditional compiler optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory synthesis

CUDA code emission

Library function mapping





44

MATLAB to CUDA compiler: It’s all about big parallel loops!


(CFG – IR)

….

….


Front – end







CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loopoptimizations

Scalarization

Loop fusion

Scalar replacement

45

MATLAB to CUDA compiler: It’s all about big parallel loops!


(CFG – IR)

….

….


Front – end







CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loopoptimizations

2 kernels (size N), 20*N bytes

1 kernel (size N), 16*N bytes

46

cudaMemcpy minimizationA(:) = ….C(:) = ….

for i = 1:N….gB = kernel1(gA);gA = kernel2(gB);if (some_condition)

gC = kernel3(gA, gB);end….

end

…. = C;

cudaMemcpy*definitely* needed

cudaMemcpy*not* needed

cudaMemcpy*may be* needed

Observations• Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a

status flag per variable• Use-Def to determine where to insert memcpy

A(:) = …A_isDirtyOnCpu = true;…for i = 1:N

if (A_isDirtyOnCpu)cudaMemcpy(gA, A);A_isDirtyOnCpu = false;

endgB = kernel1(gA);gA = kernel2(gB);if (somecondition)

gC = kernel3(gA, gB);C_isDirtyOnGpu = true;

end…

end…if (C_isDirtyOnGpu)

cudaMemcpy(C, gC);C_isDirtyOnGpu = false;

end… = C;

Assume gA, gB and gC are mapped to GPU memory

Generated (pseudo) code

47

Dotprod

Input image Conv. kernel Output image

rows

cols

kw

kh

Shared memory synthesis for stencil operations

For stencil operations, the MATLAB to CUDA compiler automatically• Infers GPU shared memory• Automates the collaborative loading in to shared memory block• Automatically translates access from global variable to shared-mem variable

48

Example: Compiling fog-rectification algorithm

49

MATLAB to CUDA Compilation in Computer Vision Applications

Distance transform

Fog removal

SURF feature extraction

Ray tracing

Stereo disparity

50

Deep learning prediction performance: Alexnet

51


0

1

2

3

4

5

6

7

8

9CPU resident memory GPU peak memory (nvidia-smi)

Mem

ory

usag

e (G

B)

Batch Size1 16 32 64

CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHzGPU Tesla K40c

Py-C

affe

MAT

LAB

to C

UD

A co

mpi

ler

Tens

orFl

ow

MAT

LAB

on

CPU

+GPU

C++

-Caf

fe

52


0

50

100

150

200

250

1 16 32 64 128

Fram

e ra

te (F

ps)

Batch Size

C++-Caffe

MATLAB to CUDA compiler

Jetson (Tegra) TX1

53

Deep Learning Prediction Performance: VGG-16

0

10

20

30

40

50

60

70

80

90

1 16 32 64

CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHzGPU Tesla K40c

MATLAB to CUDA compiler

C++ Caffe

MATLAB running on CPU+GPU

TensorFlow

py Caffe

Fram

e ra

te (F

ps)

Batch Size

54

Create CNNs with MATLAB, Deploy with MATLAB to CUDA compiler

Alexnet Vehicle Detection

People detection Lane detection

~30 Fps(Tegra X1)

~66 Fps(Tegra X1)

~20 Fps(K40c)

~130 Fps(K40c)

55

Conclusions


AlgorithmAccelerate and Scale

Training

Deep learning design is easy in MATLAB

Parallel Computing Toolbox

7x faster than pyCaffe2x faster than TensorFlow

MATLAB to CUDA compiler14x faster than pyCaffe

4x faster than TensorFlow1.6x faster than C++ Caffe


Implementation

56

What Next?

www.mathworks.com/matlab-cuda-beta

MATLAB to CUDA compiler:Sign up for our beta program

Try Deep Learning with MATLAB

Visit our booth, we love to chat: Booth # 804

http://www.mathworks.com/matlab-cuda-beta

http://www.mathworks.com/matlab-cuda-beta

https://www.mathworks.com/products/deep-learning-whats-new.html

FROM DESKTOP TO CLOUD TO EMBEDDED GPUS - GTC On...

Documents

Transcript of FROM DESKTOP TO CLOUD TO EMBEDDED GPUS - GTC On...