"Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from...

21
1 © Copyright 2014 Xilinx . Kees Vissers Distinguished Engineer, Xilinx May 29, 2014 Programming Novel Recognition Algorithms on Heterogeneous Architectures

Transcript of "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from...

Page 1: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

1 © Copyright 2014 Xilinx

.

Kees Vissers

Distinguished Engineer, Xilinx

May 29, 2014

Programming Novel Recognition

Algorithms on Heterogeneous

Architectures

Page 2: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

2 © Copyright 2014 Xilinx

.

A Typical Embedded Vision Pipeline: From Pixels to Information

Typical total compute load: ~10-100 billion operations/second

Loads can vary dramatically with pixel rate and algorithm complexity

Segmenta-tion

Object Analysis

Heuristics or Expert System

Image Acquisition

Image Pre-

processing

Feature Detection

Ultra-high data rates;

low to medium

algorithm complexity

High to medium data

rates; medium algorithm

complexity

Low data rates;

high algorithm

complexity

An Embedded Vision Pipeline

Page 3: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

3 © Copyright 2014 Xilinx

.

Required Pixel Rate Processing vs.

Capabilities

FPGA

Page 4: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

4 © Copyright 2014 Xilinx

.

Processors and Pipelines

1000:1 100:1 10:1 1:1 clock:sample

200Ks/s 2Ms/s 20Ms/s 200Ms/s Data Rate (200MHz clock)

RISC

Proc. Proc. w/

accels.

Folded

datapath

Pipelined

datapath Design

approach

Applications Expert analysis Object analysis Pixel processing

HLS tools

1:10

2 Gs/s

Replicated

datapath

Embedded System

Processors

Page 5: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

5 © Copyright 2014 Xilinx

.

Zynq Products in Context for Video

1000:1 100:1 10:1 1:1 clock:sample

200Ks/s 2Ms/s 20Ms/s 200Ms/s Data Rate (200MHz clock)

RISC

Proc. Proc. w/

accels.

Folded

datapath

Pipelined

datapath Design

approach

Applications

1:10

2 Gs/s

Replicated

datapath

ARM® A9 processors

1-2 Gops

Fabric

10 – 500 Gops

Expert analysis Object analysis Pixel processing

Page 6: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

6 © Copyright 2014 Xilinx

.

Zynq® All Programmable SoC Platform

2x GigE

with DMA

2x USB

with DMA

2x SDIO

with DMA

Static Memory Controller

Quad-SPI, NAND, NOR

Dynamic Memory Controller

DDR3, DDR2, LPDDR2

AMBA® Switches

I/O

MUX MIO

ARM® CoreSight™ Multi-core & Trace Debug

512 KB L2 Cache

NEON™/ FPU Engine

Cortex™-A9 MPCore™

32/32 KB I/D Caches

NEON™/ FPU Engine

Cortex™-A9 MPCore™

32/32 KB I/D Caches

Snoop Control Unit (SCU)

Timer Counters 256 KB On-Chip Memory

General Interrupt Controller DMA Configuration

2x SPI

2x I2C

2x CAN

2x UART

GPIO

Processing System

AMBA® Switches

AMBA® Switches

AMBA® Switches

Programmable

Logic: System Gates,

DSP, RAM

XADC PCIe

Multi-Standards I/Os (3.3V & High Speed 1.8V)

Mu

lti-

Sta

nd

ard

s I/O

s (

3.3

V &

Hig

h S

peed

1.8

V)

Multi Gigabit Transceivers

Page 7: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

7 © Copyright 2014 Xilinx

.

• ARM Processor Limitations for Pixel Processing

• Poor access locality small caches perform poorly

• Generic processors limited in parallel operations and number of cores

• ARM Processor Benefits for Low Rate, Complex Code

• Can execute large programs, time-shares the ALU

• Caches take care of memory abstraction with reasonable performance

• FPGA Limitations for Complex Code

• Large programs are labor intensive to code, explicit memory model

• FPGA Benefits for Pixel Processing

• Can do 100 to 1000 operations every clock cycle, without resource

sharing

• Can stream data, and separate between on chip and off chip memory

• High Level Synthesis and OpenCV libraries for C/C++ programming

ARM Processors and FPGA Fabric

Page 8: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

8 © Copyright 2014 Xilinx

.

• Familiar tools using standard programming languages

• Ability to target multiple systems

• Scalable performance

How a Software Programmer Wants

to Use an FPGA

Page 9: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

9 © Copyright 2014 Xilinx

.

Using High-Level Synthesis

SW Spec HW Spec

Requirements

Verify Iterate

Accelerates Algorithmic C/C++ to Co-Processing Accelerator Integration

Verify Iterate

Page 10: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

10 © Copyright 2014 Xilinx

.

Using OpenCV Libraries on FPGAs

Image File Read (OpenCV)

OpenCV function chain

Image File Write (OpenCV)

Image File Read (OpenCV)

OpenCV2AXIvideo

AXIvideo2Mat

HLS video library function chain

Mat2AXIvideo

AXIvideo2OpenCV

Image File Write (OpenCV)

Syn

the

siz

ab

le

Blo

ck

Live Video Input

OpenCV function chain

Live Video Output

Pure

OpenCV

Application

Integrated

OpenCV

Application

OpenCV

Reference

Accelerated

OpenCV

Application

Live Video Input

AXIvideo2Mat

HLS video library function chain

Mat2AXIvideo

Live Video Output

Syn

the

siz

ed

Blo

ck

Page 11: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

11 © Copyright 2014 Xilinx

.

Embedded Vision SW + HW

SD

Card

HDMI HDMI Video

Input

Video

Output Video Processing

Pipeline

AXI Interconnect

Processing

System DDR Memory Controller

Dual Core

Cortex-A9

DDR3

Hardened

Peripherals

DDR3 External Memory

Image

Sensor

S_AXI_GP 32b bit

S_AXI_HP 64 bit

AXI4 Stream

IP Core

Page 12: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

12 © Copyright 2014 Xilinx

.

1080P60 Corner Detection

Page 13: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

13 © Copyright 2014 Xilinx

.

• Efficiently implement real-time video processing

• Throughput stays constant with more complex programs

Performance of Xilinx HLS video library

Compared to single core 1GHz Cortex-A9

ARM Processor and FPGA Performance

Throughput

(Megapixels per sec)

Acceleration *

(vs. OpenCV on ARM)

FAST9 Corner Detection,

FPGA

124 10x

FAST9 Corner Detection,

Neon optimized

24 2x

Fast9 Corner dectection,

OpenCV on ARM

12 1x

Page 14: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

14 © Copyright 2014 Xilinx

.

• Efficiently implement real-time video processing

• Throughput stays constant with more complex programs

Performance of Xilinx HLS video library

Compared to single core 1GHz Cortex-A9

FPGA Implementations with OpenCV Libraries

Throughput

(Megapixels per sec)

(1080p)

Acceleration *

(vs. OpenCV on ARM)

FAST9 Corner Detection 124 10x

Canny Edge Detection 124 14x

Harris Corner Detection 124 50x

Erosion/Dilation 124 5x

5x5 convolution 124 27x

Page 15: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

15 © Copyright 2014 Xilinx

.

VanGogh Imaging Starry Night

Page 16: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

16 © Copyright 2014 Xilinx

.

• Select Functions to Be Implemented in FPGA and ARM

• FPGA — Matrix operations

• ARM — Data management

• Entire implementation done in C++ (Xilinx Vivado HLS)

• Performance: Amount of time it takes to find one object

• Before: ARM only (single-thread) — 4 seconds

• Now: Xilinx Zynq FPGA

• Zynq 7020 — 1.5 second

• Zynq 7040 (est.) — 500 msec

• Total current Speedup: 8x

VanGogh Imaging system level performance

Page 17: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

17 © Copyright 2014 Xilinx

.

Next: Single Program for Processors + FPGA

Bitstream for PL fabric

Kernel1 ARM

CPUs

Kernel2

Kernel3

Memory

Inte

rco

nn

ec

t Zynq AP SoC

Device

Binary for CPU

Combined Compilation

and High Level Synthesis

Application in

C/C++/OpenCL

FPGA Fabric

Page 18: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

18 © Copyright 2014 Xilinx

.

Power Measurement Zones on

Zynq AP SoC Devices

2x GigE

with DMA

2x USB

with DMA

2x SDIO

with DMA

Static Memory Controller

Quad-SPI, NAND, NOR

Dynamic Memory Controller

DDR3, DDR2, LPDDR2

AMBA® Switches

I/O

MUX MIO

ARM® CoreSight™ Multi-core & Trace Debug

512 KB L2 Cache

NEON™/ FPU Engine

Cortex™-A9 MPCore™

32/32 KB I/D Caches

NEON™/ FPU Engine

Cortex™-A9 MPCore™

32/32 KB I/D Caches

Snoop Control Unit (SCU)

Timer Counters 256 KB On-Chip Memory

General Interrupt Controller DMA Configuration

2x SPI

2x I2C

2x CAN

2x UART

GPIO

Processing System

AMBA® Switches

AMBA® Switches

AMBA® Switches

Programmable

Logic: System Gates,

DSP, RAM

XADC PCIe

Multi-Standards I/Os (3.3V & High Speed 1.8V)

Mu

lti-

Sta

nd

ard

s I/O

s (

3.3

V &

Hig

h S

peed

1.8

V)

Multi Gigabit Transceivers

INT

BRAM

AUX, ADJ

PINT

MIO 1V5

PAUX

Page 19: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

19 © Copyright 2014 Xilinx

.

• Processing speedup shown in the range of 8x , 10x up to 50x

• A9 processors: 500mW — 800mW

• FPGA fabric fully running: 500mW — 1W,

• On Chip I/O few hundred mW, on board DRAM 800mW

• Energy efficiency is in the 10 — 100 Gops/W range for the FPGA in the

complete system

• Total Zynq AP SoC power consumption in the range of 2W for typical

applications on small Zynq AP Soc devices

Typical Performance and Power Measurements

Page 20: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

20 © Copyright 2014 Xilinx

.

• The combination of Processors and FPGA is well suited for image

processing and object recognition applications

• Programming FPGAs with High Level Synthesis and OpenCV libraries

raises the abstraction for embedded programmers

• New tools will raise the level of abstraction to C/C++/OpenCL

programming for the combination of ARM processors and FPGA

• The power consumption of these systems is in the range of a few Watts

• Novel customer applications, e.g. ‘Starry Night’, confirm the novel

programming and efficient implementation on Zynq AP SoC Devices

Conclusion

Page 21: "Programming Novel Recognition Algorithms on Heterogeneous Architectures," a Presentation from Xilinx

21 © Copyright 2014 Xilinx

.

• OpenCV and HLS Video

http://www.xilinx.com/csi/training/vivado/leveraging-opencv-and-

high-level-synthesis-with-vivado.htm

• OpenCV and HLS Application Note

http://www.xilinx.com/support/documentation/application_notes/xap

p1167.pdf

• Xilinx Zynq AP SoC 702 Board

http://www.xilinx.com/products/boards-and-kits/

EK-Z7-ZC702-G.htm

References