Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples...

24
Deep Learning and Computer Vision on Zynq using reVISION/SDSoC

Transcript of Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples...

Page 1: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

Deep Learning and Computer Visionon Zynq using reVISION/SDSoC

Page 2: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Applications: Wide Range of Rapidly Changing Vision Guided Systems

Embedded Vision Systems Vision Guided Autonomous Systems

Vision Guided ‘Cobots’

‘Sense and Avoid’ & Autonomous Drones

Augmented Reality and HUDs

Autonomous Vehicles

Automated Surveillance

Automated Medical Diagnostics

Factory Robotics

Camera Equipped Aircraft

Physical Displays and HMI

Forward Auto Camera

Video Security Cams

Medical Imaging and Human Eye

Target Markets: Prosumer - Automotive - Industrial - Medical - A&D

Page 2

Page 3: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Application Mandates: From Embedded Vision to Autonomous Systems

Mandates: From Embedded Vision to Autonomous Systems

Intelligent and Immediate Response with Efficiency

Flexibility to Upgrade to Latest Algorithms & Sensors

Always Connected to Other Machines and the Cloud

Page 3

Page 4: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Mandates: From Embedded Vision to Autonomous Systems

Xilinx Unique Application Advantages

Responsive

Reconfigurable

Connected

Optimized from Sensors to <8-bit Inference & Control

Reconfigurable for Latest Networks & Sensors

Any-to-Any Connectivity

Page 4

Page 5: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Application Development

Algorithm Development

Platform Development

DNN

CNNGoogLeNet

SSD

FCN …

Page 5

Page 6: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 6

Zynq SoCs Offer Superior Throughput, Latency and Efficiency

6xImages/sec/watt

1/5Latency (ms)

Machine

Learning

Inference

Xilinx

Benchmark

Xilinx

Benchmark

• nVidia GoogLeNet performance: https://devblogs.nvidia.com/parallelforall/jetpack-doubles-jetson-tx1-deep-learning-inference/

• ML based on Xilinx GoogleNet performance roadmap

• All benchmarks utilize as much resources as possible on GPU (~99%) and programmable logic (~70%)

GoogLeNet

@ batch = 1

Xilinx ZU9 Xilinx ZU5 Nvidia TX1

Images/s 370.0 155.0 70

Power (W) 7.0 4.5 7.9

Images/s/watt 53.0 34.5 8.9

Real Time

Applications

LatencyGoogLeNet

@ batch = 1

Xilinx ZU9 Xilinx ZU5 Nvidia TX1

Images/s 370.0 155.0 70

Latency (ms) 2.7 6.4 14.2

GoogLeNet

@ batch = 8

Xilinx ZU9 Xilinx ZU5 Nvidia TX1

Images/s 370.0 155.0 163

Latency (ms) 2.7 6.4 49.0

For large batch,

CPU/GPU/DSPs latency

increases significantly

Page 7: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Training: Process for machine to

“learn” and optimize model from data

Inference: Using trained models to

predict/estimate outcomes from new

observations in efficient deployments

INFERENCE

Fewer

“dog”

“dog”

Input

”cat”

TRAINING

= ?

Many

labels

Error

Input

FP-32FIXED-16

(INT16)

FIXED-8

(INT8)

Difference

vs FP32

VGG-16 86.6% 86.6% 86.4% (0.2%)

GoogLeNet 88.6% 88.5% 85.7% (2.9%)

SqueezeNet 81.4% 81.4% 80.3% (1.1%)

Inference now 8 bit and below

for maximum efficiency

The Divergence of Training and Inference in Machine Learning

Top-5

Accuracy

https://arxiv.org/pdf/1510.00149v5.pdf

Page 8: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 8

Inference Precisions Moving to Lower and Variable Precision

Citation: https://arxiv.org/pdf/1510.00149.pdf

# of weight bits # of weight bits

Convolution Layers Fully Connected Layers

Page 9: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

TOPS

8bReducing precision from 8b to 1b shrinks LUT cost by 40x

Potential to scale CNN performance to above 23TOPS (ZU9)

BNN: Unparalleled Performance

8b

4b

1b1b1b1b1b1b1b1b

Q3TX2 ZU9

1.50.3 2.7

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

• Assuming 300 MHz with 90%/70% DSP/LUT utilizations

• Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)

Page 10: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Reducing precision from 8b to 1b shrinks LUT cost by 40x

Potential to scale CNN performance to above 23TOPS (ZU9)

BNN: Unparalleled Performance

8b

4b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

TOPS

1b

23

ZU9Q3TX2

1.50.3

• Assuming 300 MHz with 90%/70% DSP/LUT utilizations

• Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)

Page 11: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Reducing precision from 8b to 1b shrinks LUT cost by 40x

Potential to scale CNN performance to above 23TOPS (ZU9)

BNN: Unparalleled Performance

8b

4b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

1b1b1b1b1b1b1b1b

• Assuming 300 MHz with 90%/70% DSP/LUT utilizations

• Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)

• 10W power assumption on ZU9

GOPS/Watt

177

(8b)

J-TX2

(*)

Q3

2300

(1b)

ZU9

Embedded

102

(8b)

Page 12: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Small expense in accuracy but fast improving

–Agreement less precision is sufficient

8b Op # 1b Op - What is the Challenge?

0.00

10.00

20.00

30.00

40.00

50.00

60.00

7/6/2009 11/18/2010 4/1/2012 8/14/2013 12/27/2014 5/10/2016 9/22/2017

Top-5 Error (ImageNet)

BNN CNN Reduced Precision

Page 13: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 13

Future Proof Architecture for Any Precisions

2012 2013 201820152014 2017 20192016 2020

FP32 FP/INT16 INT8 INT6 INT4 INT2 INT1

GPU

CPU

Xilinx

Limited to 32 bit operations

SIMD operations at INT8/16

New devices required to

support change in precision efficiently

Beyond 8 bit

Reconfigurable to scale and

optimize for different precisions

32

162

x

8

4 2 1

16 8

Page 14: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 14

Low Latency Inference by Layer to Layer Dataflow On Chip

GPU/CPU: Dataflow using Off chip memory

Xilinx: Maximum local data reuse “layer merging” between layers

*Nvidia TX1 spec: http://wccftech.com/nvidia-tegra-x1-super-chip-announced-ces-2015-features-maxwell-core-architecture-256-cuda-cores/

On-chip MemoryNvidia Tegra X1 (GPU Regfile + L2) Xilinx ZU7 (BRAM + URAM)

6 Mb 38 Mb

Up to 6x More On-chip Memory than SoCs and eGPUs

Page 15: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 15

xFdnn: Direct Deep Learning Inference from Caffe

Compiles only ARM software code in minutes. No hardware compilation

1 2 3Import .prototxt and

trained weights

Call prototxt runtime

API in your application

Cross-compile for

Cortex-A53 and run on

a board

Page 16: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 16

Deep Learning Design Examples

May 2017 Roadmap

GoogLeNet

@ batch = 1

3.2 Gops/img

Images/s 121 370

Power (W) 6.0 7.0

Images/s/watt 20.2 52.9

SSD

@ batch = 1

62.4

Gops/img

Images/s 6.3

Power (W) 6.0

Images/s/watt 1.1

FCN-AlexNet

@ batch = 142.0 Gops/img

Images/s 7.0

Power (W) 6.0

Images/s/watt 1.2

Programmable Logic running at 300 MHz, Input size: GoogLeNet, AlexNet, VGG-16 = 224x224, SSD = 300x300,

FCN=480x480

VGG-16

@ batch = 1

30.9

Gops/img

Images/s 14.5

Power (W) 6.0

Images/s/watt 2.4

AlexNet

@ batch = 11.4 Gops/img

Images/s 92

Power (W) 6.0

Images/s/watt 15.3

Page 17: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 17

Performance & Area Scalability

0

20

40

60

80

100

120

128 256 448 704 1152 1344 1664

Pe

rf (

ima

ge

s/s

ec)

#DSPs

Performance Scaling w/ DSPs (GoogleNet)

Perf (fps)

0

20

40

60

80

100

120

128 256 448 704 1152 1344 1664

#DSPs

Resource Util w/ DSP Scaling

BRAM(in tens) LUT (in thousands) FF (in thousands)

Roadmap to 370 img/s

Page 18: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 18

Caffe Prototxt to hardware operators

ARM Host

Scheduler

PS PL

Convolutions

Pooling

De-

convolutions

Fully-

connected

Page 19: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Parameterized design.

Scalable

Rich Instruction Set with

30+ opcodes

Mixed Precision support

(16b, 8b)

Page 19

DeepX: Deep Learning Inference Processor

Page 20: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Deep Learning IP Export Flow (Roadmap)

main(){

imread(A);

imread(B);

roi_crop(A,img)

xFdnn <DSP,BRAM,BUF,…>(img,prototxt,weights,out)

imshow(out);}

MIPI

AXIPS

PL

Linux

Libraries

Application

Drivers

ROI CropHDMI

SDSoC

Generated

Platform

DMA

AXI-S

CNN

Export DNN IP and ARM scheduler to integrate into real system

Compile-time configuration of DNN IP (e.g. DSP, BRAM, buffer size …)

Page 21: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Page 21

Putting It All Together: CV and CNN with Multiple Sensors

USB-3Stereo Vision

MIPI CSI ISP

SD Card File Read

VPSSScaler

Video Mixer

Frame Buffer

Optical Flow

CNN

500x500x10RGB

1280x720p60YUYV 4:2:2

2x720p60YUYV 4:2:2

Frame Buffer

Frame Buffer

Frame Buffer

HDMI TX

3840x2160p60RGB

1280x720 @ 60 FPS

Dual 1280x720 @ 30 FPS

Available as Black-box Demo in 2017.1 and Design Example in 2017.3

Page 22: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

ZCU102 (ZU9)

Page 22

Embedded Vision Platform 2017.1

HDMI

MIPI

USB3

ISP/

VPSS*

D

D

R

D

D

RHDMI

DP

Stereo

Depth Map

Optical

Flow

CNN

ARM Cortex-A53

V4L2

Video Lib

DRM

Video Lib

Linux

DM* Driver

App Stub

SDSoC Application

SDSoC Generated

SDSoC Platform

Need to select one sink

Need to select one source

Page 23: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

© Copyright 2017 Xilinx.

Embedded Vision Development KitsBase Zynq Board ZCU102 ZC702 ZC706

Device ZU9 (16nm) Z7020 (28nm) Z7045 (28nm)

CPU Quad Cortex A53 up to

1.5GHz

Dual Cortex A9 up to 1.0GHz

Peak GOPS @ INT8 7857 571 2331

On-chip RAM (Mbits) 32.1 4.9 19.1

Inputs USB3, MIPI, HDMI HDMI* HDMI*

Outputs HDMI, DisplayPort HDMI HDMI

Video Codec Units No No No

reVISION Support xFopencv, xFdnn xFopencv, xFdnn xFopencv, xFdnn

Sensor Inputs Sony IMX274 StereoLab Zed Stereo eCon camera

Spec 3840x2160 @ 60 FPS 3840x1080 @ 30 FPS 1920x1080 @ 60 FPS

Interface MIPI via FMC USB3 USB3

* Requires an HDMI IO FMC card

Page 23

Page 24: Deep Learning and Computer Vision on Zynq using ......2017/07/11  · Deep Learning Design Examples May 2017 Roadmap GoogLeNet @ batch = 1 3.2 Gops/img Images/s 121 370 Power (W) 6.0

Thank You