"Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from...
-
Upload
embedded-vision-alliance -
Category
Technology
-
view
18 -
download
0
Transcript of "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from...
1 © Copyright 2014 Xilinx
.
Kees Vissers
Distinguished Engineer, Xilinx
May 29, 2014
Programming Novel Recognition
Algorithms on Heterogeneous
Architectures
2 © Copyright 2014 Xilinx
.
A Typical Embedded Vision Pipeline: From Pixels to Information
Typical total compute load: ~10-100 billion operations/second
Loads can vary dramatically with pixel rate and algorithm complexity
Segmenta-tion
Object Analysis
Heuristics or Expert System
Image Acquisition
Image Pre-
processing
Feature Detection
Ultra-high data rates;
low to medium
algorithm complexity
High to medium data
rates; medium algorithm
complexity
Low data rates;
high algorithm
complexity
An Embedded Vision Pipeline
3 © Copyright 2014 Xilinx
.
Required Pixel Rate Processing vs.
Capabilities
FPGA
4 © Copyright 2014 Xilinx
.
Processors and Pipelines
1000:1 100:1 10:1 1:1 clock:sample
200Ks/s 2Ms/s 20Ms/s 200Ms/s Data Rate (200MHz clock)
RISC
Proc. Proc. w/
accels.
Folded
datapath
Pipelined
datapath Design
approach
Applications Expert analysis Object analysis Pixel processing
HLS tools
1:10
2 Gs/s
Replicated
datapath
Embedded System
Processors
5 © Copyright 2014 Xilinx
.
Zynq Products in Context for Video
1000:1 100:1 10:1 1:1 clock:sample
200Ks/s 2Ms/s 20Ms/s 200Ms/s Data Rate (200MHz clock)
RISC
Proc. Proc. w/
accels.
Folded
datapath
Pipelined
datapath Design
approach
Applications
1:10
2 Gs/s
Replicated
datapath
ARM® A9 processors
1-2 Gops
Fabric
10 – 500 Gops
Expert analysis Object analysis Pixel processing
6 © Copyright 2014 Xilinx
.
Zynq® All Programmable SoC Platform
2x GigE
with DMA
2x USB
with DMA
2x SDIO
with DMA
Static Memory Controller
Quad-SPI, NAND, NOR
Dynamic Memory Controller
DDR3, DDR2, LPDDR2
AMBA® Switches
I/O
MUX MIO
ARM® CoreSight™ Multi-core & Trace Debug
512 KB L2 Cache
NEON™/ FPU Engine
Cortex™-A9 MPCore™
32/32 KB I/D Caches
NEON™/ FPU Engine
Cortex™-A9 MPCore™
32/32 KB I/D Caches
Snoop Control Unit (SCU)
Timer Counters 256 KB On-Chip Memory
General Interrupt Controller DMA Configuration
2x SPI
2x I2C
2x CAN
2x UART
GPIO
Processing System
AMBA® Switches
AMBA® Switches
AMBA® Switches
Programmable
Logic: System Gates,
DSP, RAM
XADC PCIe
Multi-Standards I/Os (3.3V & High Speed 1.8V)
Mu
lti-
Sta
nd
ard
s I/O
s (
3.3
V &
Hig
h S
peed
1.8
V)
Multi Gigabit Transceivers
7 © Copyright 2014 Xilinx
.
• ARM Processor Limitations for Pixel Processing
• Poor access locality small caches perform poorly
• Generic processors limited in parallel operations and number of cores
• ARM Processor Benefits for Low Rate, Complex Code
• Can execute large programs, time-shares the ALU
• Caches take care of memory abstraction with reasonable performance
• FPGA Limitations for Complex Code
• Large programs are labor intensive to code, explicit memory model
• FPGA Benefits for Pixel Processing
• Can do 100 to 1000 operations every clock cycle, without resource
sharing
• Can stream data, and separate between on chip and off chip memory
• High Level Synthesis and OpenCV libraries for C/C++ programming
ARM Processors and FPGA Fabric
8 © Copyright 2014 Xilinx
.
• Familiar tools using standard programming languages
• Ability to target multiple systems
• Scalable performance
How a Software Programmer Wants
to Use an FPGA
9 © Copyright 2014 Xilinx
.
Using High-Level Synthesis
SW Spec HW Spec
Requirements
Verify Iterate
Accelerates Algorithmic C/C++ to Co-Processing Accelerator Integration
Verify Iterate
10 © Copyright 2014 Xilinx
.
Using OpenCV Libraries on FPGAs
Image File Read (OpenCV)
OpenCV function chain
Image File Write (OpenCV)
Image File Read (OpenCV)
OpenCV2AXIvideo
AXIvideo2Mat
HLS video library function chain
Mat2AXIvideo
AXIvideo2OpenCV
Image File Write (OpenCV)
Syn
the
siz
ab
le
Blo
ck
Live Video Input
OpenCV function chain
Live Video Output
Pure
OpenCV
Application
Integrated
OpenCV
Application
OpenCV
Reference
Accelerated
OpenCV
Application
Live Video Input
AXIvideo2Mat
HLS video library function chain
Mat2AXIvideo
Live Video Output
Syn
the
siz
ed
Blo
ck
11 © Copyright 2014 Xilinx
.
Embedded Vision SW + HW
SD
Card
HDMI HDMI Video
Input
Video
Output Video Processing
Pipeline
AXI Interconnect
Processing
System DDR Memory Controller
Dual Core
Cortex-A9
DDR3
Hardened
Peripherals
DDR3 External Memory
Image
Sensor
S_AXI_GP 32b bit
S_AXI_HP 64 bit
AXI4 Stream
IP Core
12 © Copyright 2014 Xilinx
.
1080P60 Corner Detection
13 © Copyright 2014 Xilinx
.
• Efficiently implement real-time video processing
• Throughput stays constant with more complex programs
Performance of Xilinx HLS video library
Compared to single core 1GHz Cortex-A9
ARM Processor and FPGA Performance
Throughput
(Megapixels per sec)
Acceleration *
(vs. OpenCV on ARM)
FAST9 Corner Detection,
FPGA
124 10x
FAST9 Corner Detection,
Neon optimized
24 2x
Fast9 Corner dectection,
OpenCV on ARM
12 1x
14 © Copyright 2014 Xilinx
.
• Efficiently implement real-time video processing
• Throughput stays constant with more complex programs
Performance of Xilinx HLS video library
Compared to single core 1GHz Cortex-A9
FPGA Implementations with OpenCV Libraries
Throughput
(Megapixels per sec)
(1080p)
Acceleration *
(vs. OpenCV on ARM)
FAST9 Corner Detection 124 10x
Canny Edge Detection 124 14x
Harris Corner Detection 124 50x
Erosion/Dilation 124 5x
5x5 convolution 124 27x
15 © Copyright 2014 Xilinx
.
VanGogh Imaging Starry Night
16 © Copyright 2014 Xilinx
.
• Select Functions to Be Implemented in FPGA and ARM
• FPGA — Matrix operations
• ARM — Data management
• Entire implementation done in C++ (Xilinx Vivado HLS)
• Performance: Amount of time it takes to find one object
• Before: ARM only (single-thread) — 4 seconds
• Now: Xilinx Zynq FPGA
• Zynq 7020 — 1.5 second
• Zynq 7040 (est.) — 500 msec
• Total current Speedup: 8x
VanGogh Imaging system level performance
17 © Copyright 2014 Xilinx
.
Next: Single Program for Processors + FPGA
Bitstream for PL fabric
Kernel1 ARM
CPUs
Kernel2
Kernel3
Memory
Inte
rco
nn
ec
t Zynq AP SoC
Device
Binary for CPU
Combined Compilation
and High Level Synthesis
Application in
C/C++/OpenCL
FPGA Fabric
18 © Copyright 2014 Xilinx
.
Power Measurement Zones on
Zynq AP SoC Devices
2x GigE
with DMA
2x USB
with DMA
2x SDIO
with DMA
Static Memory Controller
Quad-SPI, NAND, NOR
Dynamic Memory Controller
DDR3, DDR2, LPDDR2
AMBA® Switches
I/O
MUX MIO
ARM® CoreSight™ Multi-core & Trace Debug
512 KB L2 Cache
NEON™/ FPU Engine
Cortex™-A9 MPCore™
32/32 KB I/D Caches
NEON™/ FPU Engine
Cortex™-A9 MPCore™
32/32 KB I/D Caches
Snoop Control Unit (SCU)
Timer Counters 256 KB On-Chip Memory
General Interrupt Controller DMA Configuration
2x SPI
2x I2C
2x CAN
2x UART
GPIO
Processing System
AMBA® Switches
AMBA® Switches
AMBA® Switches
Programmable
Logic: System Gates,
DSP, RAM
XADC PCIe
Multi-Standards I/Os (3.3V & High Speed 1.8V)
Mu
lti-
Sta
nd
ard
s I/O
s (
3.3
V &
Hig
h S
peed
1.8
V)
Multi Gigabit Transceivers
INT
BRAM
AUX, ADJ
PINT
MIO 1V5
PAUX
19 © Copyright 2014 Xilinx
.
• Processing speedup shown in the range of 8x , 10x up to 50x
• A9 processors: 500mW — 800mW
• FPGA fabric fully running: 500mW — 1W,
• On Chip I/O few hundred mW, on board DRAM 800mW
• Energy efficiency is in the 10 — 100 Gops/W range for the FPGA in the
complete system
• Total Zynq AP SoC power consumption in the range of 2W for typical
applications on small Zynq AP Soc devices
Typical Performance and Power Measurements
20 © Copyright 2014 Xilinx
.
• The combination of Processors and FPGA is well suited for image
processing and object recognition applications
• Programming FPGAs with High Level Synthesis and OpenCV libraries
raises the abstraction for embedded programmers
• New tools will raise the level of abstraction to C/C++/OpenCL
programming for the combination of ARM processors and FPGA
• The power consumption of these systems is in the range of a few Watts
• Novel customer applications, e.g. ‘Starry Night’, confirm the novel
programming and efficient implementation on Zynq AP SoC Devices
Conclusion
21 © Copyright 2014 Xilinx
.
• OpenCV and HLS Video
http://www.xilinx.com/csi/training/vivado/leveraging-opencv-and-
high-level-synthesis-with-vivado.htm
• OpenCV and HLS Application Note
http://www.xilinx.com/support/documentation/application_notes/xap
p1167.pdf
• Xilinx Zynq AP SoC 702 Board
http://www.xilinx.com/products/boards-and-kits/
EK-Z7-ZC702-G.htm
References