"Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from...

1 © Copyright 2014 Xilinx

.

Kees Vissers

Distinguished Engineer, Xilinx

May 29, 2014

Programming Novel Recognition

Algorithms on Heterogeneous

Architectures


.

A Typical Embedded Vision Pipeline: From Pixels to Information

Typical total compute load: ~10-100 billion operations/second

Loads can vary dramatically with pixel rate and algorithm complexity

Segmenta-tion

Object Analysis

Heuristics or Expert System

Image Acquisition

Image Pre-

processing

Feature Detection

Ultra-high data rates;

low to medium

algorithm complexity

High to medium data

rates; medium algorithm

complexity

Low data rates;

high algorithm

complexity

An Embedded Vision Pipeline


.

Required Pixel Rate Processing vs.

Capabilities

FPGA


.

Processors and Pipelines

1000:1 100:1 10:1 1:1 clock:sample

200Ks/s 2Ms/s 20Ms/s 200Ms/s Data Rate (200MHz clock)

RISC

Proc. Proc. w/

accels.

Folded

datapath

Pipelined

datapath Design

approach

Applications Expert analysis Object analysis Pixel processing

HLS tools

1:10

2 Gs/s

Replicated

datapath

Embedded System

Processors


.

Zynq Products in Context for Video

1000:1 100:1 10:1 1:1 clock:sample

200Ks/s 2Ms/s 20Ms/s 200Ms/s Data Rate (200MHz clock)

RISC

Proc. Proc. w/

accels.

Folded

datapath

Pipelined

datapath Design

approach

Applications

1:10

2 Gs/s

Replicated

datapath

ARM® A9 processors

1-2 Gops

Fabric

10 – 500 Gops

Expert analysis Object analysis Pixel processing


.

Zynq® All Programmable SoC Platform

2x GigE

with DMA

2x USB

with DMA

2x SDIO

with DMA

Static Memory Controller

Quad-SPI, NAND, NOR

Dynamic Memory Controller

DDR3, DDR2, LPDDR2

AMBA® Switches

I/O

MUX MIO

ARM® CoreSight™ Multi-core & Trace Debug

512 KB L2 Cache

NEON™/ FPU Engine

Cortex™-A9 MPCore™

32/32 KB I/D Caches

NEON™/ FPU Engine


32/32 KB I/D Caches

Snoop Control Unit (SCU)

Timer Counters 256 KB On-Chip Memory

General Interrupt Controller DMA Configuration

2x SPI

2x I2C

2x CAN

2x UART

GPIO

Processing System

AMBA® Switches

AMBA® Switches

AMBA® Switches

Programmable

Logic: System Gates,

DSP, RAM

XADC PCIe

Multi-Standards I/Os (3.3V & High Speed 1.8V)

Mu

lti-

Sta

nd

ard

s I/O

s (

3.3

V &

Hig

h S

peed

1.8

V)

Multi Gigabit Transceivers


.

• ARM Processor Limitations for Pixel Processing

• Poor access locality small caches perform poorly

• Generic processors limited in parallel operations and number of cores

• ARM Processor Benefits for Low Rate, Complex Code

• Can execute large programs, time-shares the ALU

• Caches take care of memory abstraction with reasonable performance

• FPGA Limitations for Complex Code

• Large programs are labor intensive to code, explicit memory model

• FPGA Benefits for Pixel Processing

• Can do 100 to 1000 operations every clock cycle, without resource

sharing

• Can stream data, and separate between on chip and off chip memory

• High Level Synthesis and OpenCV libraries for C/C++ programming

ARM Processors and FPGA Fabric


.

• Familiar tools using standard programming languages

• Ability to target multiple systems

• Scalable performance

How a Software Programmer Wants

to Use an FPGA


.

Using High-Level Synthesis

SW Spec HW Spec

Requirements

Verify Iterate

Accelerates Algorithmic C/C++ to Co-Processing Accelerator Integration

Verify Iterate


.

Using OpenCV Libraries on FPGAs

Image File Read (OpenCV)

OpenCV function chain

Image File Write (OpenCV)

Image File Read (OpenCV)

OpenCV2AXIvideo

AXIvideo2Mat

HLS video library function chain

Mat2AXIvideo

AXIvideo2OpenCV

Image File Write (OpenCV)

Syn

the

siz

ab

le

Blo

ck

Live Video Input

OpenCV function chain

Live Video Output

Pure

OpenCV

Application

Integrated

OpenCV

Application

OpenCV

Reference

Accelerated

OpenCV

Application

Live Video Input

AXIvideo2Mat

HLS video library function chain

Mat2AXIvideo

Live Video Output

Syn

the

siz

ed

Blo

ck


.

Embedded Vision SW + HW

SD

Card

HDMI HDMI Video

Input

Video

Output Video Processing

Pipeline

AXI Interconnect

Processing

System DDR Memory Controller

Dual Core

Cortex-A9

DDR3

Hardened

Peripherals

DDR3 External Memory

Image

Sensor

S_AXI_GP 32b bit

S_AXI_HP 64 bit

AXI4 Stream

IP Core


.

1080P60 Corner Detection


.

• Efficiently implement real-time video processing

• Throughput stays constant with more complex programs

Performance of Xilinx HLS video library

Compared to single core 1GHz Cortex-A9

ARM Processor and FPGA Performance

Throughput

(Megapixels per sec)

Acceleration *

(vs. OpenCV on ARM)

FAST9 Corner Detection,

FPGA

124 10x

FAST9 Corner Detection,

Neon optimized

24 2x

Fast9 Corner dectection,

OpenCV on ARM

12 1x


.

• Efficiently implement real-time video processing

• Throughput stays constant with more complex programs

Performance of Xilinx HLS video library

Compared to single core 1GHz Cortex-A9

FPGA Implementations with OpenCV Libraries

Throughput

(Megapixels per sec)

(1080p)

Acceleration *

(vs. OpenCV on ARM)

FAST9 Corner Detection 124 10x

Canny Edge Detection 124 14x

Harris Corner Detection 124 50x

Erosion/Dilation 124 5x

5x5 convolution 124 27x


.

VanGogh Imaging Starry Night


.

• Select Functions to Be Implemented in FPGA and ARM

• FPGA — Matrix operations

• ARM — Data management

• Entire implementation done in C++ (Xilinx Vivado HLS)

• Performance: Amount of time it takes to find one object

• Before: ARM only (single-thread) — 4 seconds

• Now: Xilinx Zynq FPGA

• Zynq 7020 — 1.5 second

• Zynq 7040 (est.) — 500 msec

• Total current Speedup: 8x

VanGogh Imaging system level performance


.

Next: Single Program for Processors + FPGA

Bitstream for PL fabric

Kernel1 ARM

CPUs

Kernel2

Kernel3

Memory

Inte

rco

nn

ec

t Zynq AP SoC

Device

Binary for CPU

Combined Compilation

and High Level Synthesis

Application in

C/C++/OpenCL

FPGA Fabric


.

Power Measurement Zones on

Zynq AP SoC Devices

2x GigE

with DMA

2x USB

with DMA

2x SDIO

with DMA

Static Memory Controller

Quad-SPI, NAND, NOR

Dynamic Memory Controller

DDR3, DDR2, LPDDR2

AMBA® Switches

I/O

MUX MIO

ARM® CoreSight™ Multi-core & Trace Debug

512 KB L2 Cache

NEON™/ FPU Engine


32/32 KB I/D Caches

NEON™/ FPU Engine


32/32 KB I/D Caches

Snoop Control Unit (SCU)

Timer Counters 256 KB On-Chip Memory

General Interrupt Controller DMA Configuration

2x SPI

2x I2C

2x CAN

2x UART

GPIO

Processing System

AMBA® Switches

AMBA® Switches

AMBA® Switches

Programmable

Logic: System Gates,

DSP, RAM

XADC PCIe

Multi-Standards I/Os (3.3V & High Speed 1.8V)

Mu

lti-

Sta

nd

ard

s I/O

s (

3.3

V &

Hig

h S

peed

1.8

V)

Multi Gigabit Transceivers

INT

BRAM

AUX, ADJ

PINT

MIO 1V5

PAUX


.

• Processing speedup shown in the range of 8x , 10x up to 50x

• A9 processors: 500mW — 800mW

• FPGA fabric fully running: 500mW — 1W,

• On Chip I/O few hundred mW, on board DRAM 800mW

• Energy efficiency is in the 10 — 100 Gops/W range for the FPGA in the

complete system

• Total Zynq AP SoC power consumption in the range of 2W for typical

applications on small Zynq AP Soc devices

Typical Performance and Power Measurements


.

• The combination of Processors and FPGA is well suited for image

processing and object recognition applications

• Programming FPGAs with High Level Synthesis and OpenCV libraries

raises the abstraction for embedded programmers

• New tools will raise the level of abstraction to C/C++/OpenCL

programming for the combination of ARM processors and FPGA

• The power consumption of these systems is in the range of a few Watts

• Novel customer applications, e.g. ‘Starry Night’, confirm the novel

programming and efficient implementation on Zynq AP SoC Devices

Conclusion


.

• OpenCV and HLS Video

http://www.xilinx.com/csi/training/vivado/leveraging-opencv-and-

high-level-synthesis-with-vivado.htm

• OpenCV and HLS Application Note

http://www.xilinx.com/support/documentation/application_notes/xap

p1167.pdf

• Xilinx Zynq AP SoC 702 Board

http://www.xilinx.com/products/boards-and-kits/

EK-Z7-ZC702-G.htm

References

http://www.xilinx.com/csi/training/vivado/leveraging-opencv-and-high-level-synthesis-with-vivado.htm
















http://www.xilinx.com/support/documentation/application_notes/xapp1167.pdf




"Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from...

Technology

Transcript of "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from...