Deep Learning and Computer Vision on Zynq using ......2017/07/11 · Deep Learning Design Examples...
Transcript of Deep Learning and Computer Vision on Zynq using ......2017/07/11 · Deep Learning Design Examples...
Deep Learning and Computer Visionon Zynq using reVISION/SDSoC
© Copyright 2017 Xilinx.
Applications: Wide Range of Rapidly Changing Vision Guided Systems
Embedded Vision Systems Vision Guided Autonomous Systems
Vision Guided ‘Cobots’
‘Sense and Avoid’ & Autonomous Drones
Augmented Reality and HUDs
Autonomous Vehicles
Automated Surveillance
Automated Medical Diagnostics
Factory Robotics
Camera Equipped Aircraft
Physical Displays and HMI
Forward Auto Camera
Video Security Cams
Medical Imaging and Human Eye
Target Markets: Prosumer - Automotive - Industrial - Medical - A&D
Page 2
© Copyright 2017 Xilinx.
Application Mandates: From Embedded Vision to Autonomous Systems
Mandates: From Embedded Vision to Autonomous Systems
Intelligent and Immediate Response with Efficiency
Flexibility to Upgrade to Latest Algorithms & Sensors
Always Connected to Other Machines and the Cloud
Page 3
© Copyright 2017 Xilinx.
Mandates: From Embedded Vision to Autonomous Systems
Xilinx Unique Application Advantages
Responsive
Reconfigurable
Connected
Optimized from Sensors to <8-bit Inference & Control
Reconfigurable for Latest Networks & Sensors
Any-to-Any Connectivity
Page 4
© Copyright 2017 Xilinx.
Application Development
Algorithm Development
Platform Development
DNN
CNNGoogLeNet
SSD
FCN …
Page 5
© Copyright 2017 Xilinx.
Page 6
Zynq SoCs Offer Superior Throughput, Latency and Efficiency
6xImages/sec/watt
1/5Latency (ms)
Machine
Learning
Inference
Xilinx
Benchmark
Xilinx
Benchmark
• nVidia GoogLeNet performance: https://devblogs.nvidia.com/parallelforall/jetpack-doubles-jetson-tx1-deep-learning-inference/
• ML based on Xilinx GoogleNet performance roadmap
• All benchmarks utilize as much resources as possible on GPU (~99%) and programmable logic (~70%)
GoogLeNet
@ batch = 1
Xilinx ZU9 Xilinx ZU5 Nvidia TX1
Images/s 370.0 155.0 70
Power (W) 7.0 4.5 7.9
Images/s/watt 53.0 34.5 8.9
Real Time
Applications
LatencyGoogLeNet
@ batch = 1
Xilinx ZU9 Xilinx ZU5 Nvidia TX1
Images/s 370.0 155.0 70
Latency (ms) 2.7 6.4 14.2
GoogLeNet
@ batch = 8
Xilinx ZU9 Xilinx ZU5 Nvidia TX1
Images/s 370.0 155.0 163
Latency (ms) 2.7 6.4 49.0
For large batch,
CPU/GPU/DSPs latency
increases significantly
© Copyright 2017 Xilinx.
Training: Process for machine to
“learn” and optimize model from data
Inference: Using trained models to
predict/estimate outcomes from new
observations in efficient deployments
INFERENCE
Fewer
“dog”
“dog”
Input
”cat”
TRAINING
= ?
Many
labels
Error
Input
FP-32FIXED-16
(INT16)
FIXED-8
(INT8)
Difference
vs FP32
VGG-16 86.6% 86.6% 86.4% (0.2%)
GoogLeNet 88.6% 88.5% 85.7% (2.9%)
SqueezeNet 81.4% 81.4% 80.3% (1.1%)
Inference now 8 bit and below
for maximum efficiency
The Divergence of Training and Inference in Machine Learning
Top-5
Accuracy
https://arxiv.org/pdf/1510.00149v5.pdf
© Copyright 2017 Xilinx.
Page 8
Inference Precisions Moving to Lower and Variable Precision
Citation: https://arxiv.org/pdf/1510.00149.pdf
# of weight bits # of weight bits
Convolution Layers Fully Connected Layers
© Copyright 2017 Xilinx.
TOPS
8bReducing precision from 8b to 1b shrinks LUT cost by 40x
Potential to scale CNN performance to above 23TOPS (ZU9)
BNN: Unparalleled Performance
8b
4b
1b1b1b1b1b1b1b1b
Q3TX2 ZU9
1.50.3 2.7
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
• Assuming 300 MHz with 90%/70% DSP/LUT utilizations
• Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)
© Copyright 2017 Xilinx.
Reducing precision from 8b to 1b shrinks LUT cost by 40x
Potential to scale CNN performance to above 23TOPS (ZU9)
BNN: Unparalleled Performance
8b
4b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
TOPS
1b
23
ZU9Q3TX2
1.50.3
• Assuming 300 MHz with 90%/70% DSP/LUT utilizations
• Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)
© Copyright 2017 Xilinx.
Reducing precision from 8b to 1b shrinks LUT cost by 40x
Potential to scale CNN performance to above 23TOPS (ZU9)
BNN: Unparalleled Performance
8b
4b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
1b1b1b1b1b1b1b1b
• Assuming 300 MHz with 90%/70% DSP/LUT utilizations
• Resource consumption assumption: 2.5 LUTs/op (INT1), 16 LUTs/op (INT4), 0.25 DSP/op (INT8)
• 10W power assumption on ZU9
GOPS/Watt
177
(8b)
J-TX2
(*)
Q3
2300
(1b)
ZU9
Embedded
102
(8b)
© Copyright 2017 Xilinx.
Small expense in accuracy but fast improving
–Agreement less precision is sufficient
8b Op # 1b Op - What is the Challenge?
0.00
10.00
20.00
30.00
40.00
50.00
60.00
7/6/2009 11/18/2010 4/1/2012 8/14/2013 12/27/2014 5/10/2016 9/22/2017
Top-5 Error (ImageNet)
BNN CNN Reduced Precision
© Copyright 2017 Xilinx.
Page 13
Future Proof Architecture for Any Precisions
2012 2013 201820152014 2017 20192016 2020
FP32 FP/INT16 INT8 INT6 INT4 INT2 INT1
GPU
CPU
Xilinx
Limited to 32 bit operations
SIMD operations at INT8/16
New devices required to
support change in precision efficiently
Beyond 8 bit
Reconfigurable to scale and
optimize for different precisions
32
162
x
8
4 2 1
16 8
© Copyright 2017 Xilinx.
Page 14
Low Latency Inference by Layer to Layer Dataflow On Chip
GPU/CPU: Dataflow using Off chip memory
Xilinx: Maximum local data reuse “layer merging” between layers
*Nvidia TX1 spec: http://wccftech.com/nvidia-tegra-x1-super-chip-announced-ces-2015-features-maxwell-core-architecture-256-cuda-cores/
On-chip MemoryNvidia Tegra X1 (GPU Regfile + L2) Xilinx ZU7 (BRAM + URAM)
6 Mb 38 Mb
Up to 6x More On-chip Memory than SoCs and eGPUs
© Copyright 2017 Xilinx.
Page 15
xFdnn: Direct Deep Learning Inference from Caffe
Compiles only ARM software code in minutes. No hardware compilation
1 2 3Import .prototxt and
trained weights
Call prototxt runtime
API in your application
Cross-compile for
Cortex-A53 and run on
a board
© Copyright 2017 Xilinx.
Page 16
Deep Learning Design Examples
May 2017 Roadmap
GoogLeNet
@ batch = 1
3.2 Gops/img
Images/s 121 370
Power (W) 6.0 7.0
Images/s/watt 20.2 52.9
SSD
@ batch = 1
62.4
Gops/img
Images/s 6.3
Power (W) 6.0
Images/s/watt 1.1
FCN-AlexNet
@ batch = 142.0 Gops/img
Images/s 7.0
Power (W) 6.0
Images/s/watt 1.2
Programmable Logic running at 300 MHz, Input size: GoogLeNet, AlexNet, VGG-16 = 224x224, SSD = 300x300,
FCN=480x480
VGG-16
@ batch = 1
30.9
Gops/img
Images/s 14.5
Power (W) 6.0
Images/s/watt 2.4
AlexNet
@ batch = 11.4 Gops/img
Images/s 92
Power (W) 6.0
Images/s/watt 15.3
© Copyright 2017 Xilinx.
Page 17
Performance & Area Scalability
0
20
40
60
80
100
120
128 256 448 704 1152 1344 1664
Pe
rf (
ima
ge
s/s
ec)
#DSPs
Performance Scaling w/ DSPs (GoogleNet)
Perf (fps)
0
20
40
60
80
100
120
128 256 448 704 1152 1344 1664
#DSPs
Resource Util w/ DSP Scaling
BRAM(in tens) LUT (in thousands) FF (in thousands)
Roadmap to 370 img/s
© Copyright 2017 Xilinx.
Page 18
Caffe Prototxt to hardware operators
ARM Host
Scheduler
PS PL
Convolutions
Pooling
De-
convolutions
Fully-
connected
© Copyright 2017 Xilinx.
Parameterized design.
Scalable
Rich Instruction Set with
30+ opcodes
Mixed Precision support
(16b, 8b)
Page 19
DeepX: Deep Learning Inference Processor
© Copyright 2017 Xilinx.
Deep Learning IP Export Flow (Roadmap)
main(){
imread(A);
imread(B);
roi_crop(A,img)
xFdnn <DSP,BRAM,BUF,…>(img,prototxt,weights,out)
imshow(out);}
MIPI
AXIPS
PL
Linux
Libraries
Application
Drivers
ROI CropHDMI
SDSoC
Generated
Platform
DMA
AXI-S
CNN
Export DNN IP and ARM scheduler to integrate into real system
Compile-time configuration of DNN IP (e.g. DSP, BRAM, buffer size …)
© Copyright 2017 Xilinx.
Page 21
Putting It All Together: CV and CNN with Multiple Sensors
USB-3Stereo Vision
MIPI CSI ISP
SD Card File Read
VPSSScaler
Video Mixer
Frame Buffer
Optical Flow
CNN
500x500x10RGB
1280x720p60YUYV 4:2:2
2x720p60YUYV 4:2:2
Frame Buffer
Frame Buffer
Frame Buffer
HDMI TX
3840x2160p60RGB
1280x720 @ 60 FPS
Dual 1280x720 @ 30 FPS
Available as Black-box Demo in 2017.1 and Design Example in 2017.3
© Copyright 2017 Xilinx.
ZCU102 (ZU9)
Page 22
Embedded Vision Platform 2017.1
HDMI
MIPI
USB3
ISP/
VPSS*
D
D
R
D
D
RHDMI
DP
Stereo
Depth Map
Optical
Flow
CNN
ARM Cortex-A53
V4L2
Video Lib
DRM
Video Lib
Linux
DM* Driver
App Stub
SDSoC Application
SDSoC Generated
SDSoC Platform
Need to select one sink
Need to select one source
© Copyright 2017 Xilinx.
Embedded Vision Development KitsBase Zynq Board ZCU102 ZC702 ZC706
Device ZU9 (16nm) Z7020 (28nm) Z7045 (28nm)
CPU Quad Cortex A53 up to
1.5GHz
Dual Cortex A9 up to 1.0GHz
Peak GOPS @ INT8 7857 571 2331
On-chip RAM (Mbits) 32.1 4.9 19.1
Inputs USB3, MIPI, HDMI HDMI* HDMI*
Outputs HDMI, DisplayPort HDMI HDMI
Video Codec Units No No No
reVISION Support xFopencv, xFdnn xFopencv, xFdnn xFopencv, xFdnn
Sensor Inputs Sony IMX274 StereoLab Zed Stereo eCon camera
Spec 3840x2160 @ 60 FPS 3840x1080 @ 30 FPS 1920x1080 @ 60 FPS
Interface MIPI via FMC USB3 USB3
* Requires an HDMI IO FMC card
Page 23
Thank You