Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI...

19
Implementation of a Target Recognition Application Using Pipelined Reconfigurable Hardware Benjamin Levine* and Herman Schmit Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA [email protected], [email protected] *Benjamin Levine is now an Assistant Professor in the Electrical Engineering Department of the University of Pittsburgh

Transcript of Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI...

Page 1: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

Implementation of a Target Recognition Application Using

Pipelined Reconfigurable Hardware

Benjamin Levine* and Herman SchmitDepartment of Electrical and Computer Engineering

Carnegie Mellon UniversityPittsburgh, PA

[email protected], [email protected]

*Benjamin Levine is now an Assistant Professor in the Electrical Engineering Department of the University of Pittsburgh

Page 2: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

2Benjamin Levine Paper F2

OutlinePipeRenchSAR ATR applicationResults from 2000 projectPipeRench prototype devicesHardware test platformApplication implementationResultsFuture Work

Page 3: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

3Benjamin Levine Paper F2

Reconfigurable Computing• Design “hardware” for each application.

– Custom logic design gives high performance, like an ASIC.

• Map it onto programmable hardware. – We can change the hardware, like software on a CPU.

• Advantages:– Performance approaching that of an ASIC.– Flexibility approaching a general purpose processor.– Quick design cycles and low design costs for each

application, like software design.

Page 4: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

4Benjamin Levine Paper F2

PipeRench• PipeRench is a reconfigurable computing chip developed

at Carnegie Mellon University.• Has complete tool set, including high-level entry language.• The PipeRench architecture, chip design, simulation tools,

compiler, and assembler were all produced by faculty and students in the ECE Department and the School of Computer Science.

• Part of the DARPA ACS program

More information on PipeRench at:

http://www.ece.cmu.edu/research/piperench

Page 5: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

5Benjamin Levine Paper F2

PipeRench ArchitectureReconfigurable fabric of 8-bit processing elements.• Compiler targets unbounded virtual architecture.• On-chip, run-time reconfiguration allows virtual

application to run on bounded physical hardware.

PE1 PE2 PEN

In terconnec t Ne t w o rk

...

In terconnec t Ne t w o rk

...G

loba

l BusStripe 1Stripe 2Stripe 3Stripe 4

Stripe N...

PE1 PE2 PEN

In terconnec t Ne t w o rk

...

In terconnec t Ne t w o rk

...G

loba

l BusStripe 1Stripe 2Stripe 3Stripe 4

Stripe N...

Page 6: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

6Benjamin Levine Paper F2

SAR ATR Application

Screener Discriminator Classifier Identifier

• Designates Regions of Interest (ROI)

• Processes entire image

• Detect areas of high contrast

• Pixel Clustering

• Eliminates Natural Clutter

• Processes only ROI’s

• Computes several features including pose estimate

• Uses linear classifier

• Finds target class• Processes only ROI

outputs from discriminator

• Generate correlation planes using composite filters

• Computes scalar features from correlation plane

• Identifies specific target type from target class

• Processes with refined filters

• Computes detail features from correlation planes

Page 7: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

7Benjamin Levine Paper F2

2000 Project• Project only worked with first two stages, the

screener and the discriminator.– Widest variety of algorithms– Largest amounts of data

• Analyzed program to identify computational kernels.

• Implemented kernels on PipeRench simulator.• Compared to workstation and signal processor

platform implementations.

Page 8: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

8Benjamin Levine Paper F2

I&IR SAR ATR Front End – Speedup of Computational Kernels

KernelSun Time*

(ms)Mercury

Time** (ms)PipeRench Time (ms)

PipeRench vs. Sun

PipeRench vs. Mercury

A Decimate/Correlate 38852 31393 971 40.01 32.33B STD Image 34055 54978 430 79.13 127.74C Image Stats 2362 144 107 22.03 1.34D Image Threshold 1703 586 107 15.88 5.46E Clump Binary Image 1524 5450 189 8.08 28.88F Clump Maxima 15510 35257 - - -G Inv. Dis. Filter 56307 26142 1179 47.75 22.17H Cue Stats 27849 2125 315 88.55 6.76I Cue Thresh 4995 8986 315 15.88 28.57J Cue Density/PBC 5935 545 19 311.50 28.61K Bright Cue Stats 13844 668 152 90.83 4.38L Comp Fractal 13126 25940 152 86.12 170.19M Comp Neighbor 4740 6656 610 7.77 10.92

Aggregate Speedup 45.16 35.99

< 10 X Faster * Sun Ultra 5 Workstation w/ 360 MHz UltraSPARC II 10-100 X Faster ** Mercury Computer PPC-G4 AltiVec with 133MHz ASIC > 100 X faster

2000 Program

Page 9: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

9Benjamin Levine Paper F2

CUSTOM:PipeRench Fabric

STANDARD CELLS:Virtualization & Interface Logic

Configuration CacheData Store Memory

STRIPE

PE

PipeRench prototypes• 3.6M transistors• Implemented in acommercial 0.18 µ, 6 metal layer technology• 125 MHz core speed(limited by control logic)•66 MHz I/O Speed•1.5V core, 3.3V I/O

Page 10: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

10Benjamin Levine Paper F2

USC ISI Osiris Board

Xilinx VirtexXC2V1000

PMC Connectors5 MB Static RAM

512 MB Dynamic RAM

Xilinx VirtexXC2V600066 MHz

64-bit PCI

Page 11: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

11Benjamin Levine Paper F2

PipeRench Mezzanine Card

Xilinx XC2V1000 FPGA for Interface Logic

1 MB Static RAM

2 x PipeRench Chips

Page 12: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

12Benjamin Levine Paper F2

PipeRench Mezzanine Card

XPRXC2V1000 XPR MEM 0

512K x 36

XPR MEM 1512K x 36

PRAPipeRench

PRBPipeRench

PMC

Mezzanine Card

32 @60 MHz

64 @66MHz

36 @200MHz

Page 13: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

13Benjamin Levine Paper F2

Host Driven Operation

SDRAM

XP

IF

PCI

PM

CP

MC

XPR

PRA PRB

PCI

CPU

SDRAM

Host PC

Page 14: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

8 bits/pixel, ~70 MB

Decimation &Correlation

LocalStatistics

8 bits/pixel, ~8 MB

8 bits/pixel, ~8 MB

Imag

e Stat

istics

Image Mean and STD

Input Image D/C Image

STD Image

Screener, Stage IBenjamin Levine Paper F214

Page 15: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

8 bits/pixel, ~8 MB

Image Mean and STD

Threshold

1 bit/pixel, ~1 MB

1 bit/pixel, ~1 MB

STD Image Thresholded Image

Clustering

Clustered Image

Screener, Stage II

Benjamin Levine Paper F215

Page 16: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

16Benjamin Levine Paper F2

Performance

• Hardware Processing Time = 353 ms71x speedup vs. Sun Workstation15x speedup vs. 1.8 GHz P4 Xeon

270300

3030

534

Send Full ImageStage I ProcessingReturn STD ImageSend STD ImageStage II ProcessingReturn Clustered Image

Page 17: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

17Benjamin Levine Paper F2

Lessons Learned

• Many, if not most, applications map well to PipeRench, even when not specifically designed for it.

• Entering designs in high-level language much easier and preferable to HDL.

• Interfacing reconfigurable computer to host time consuming and error-prone.

• I/O limits PipeRench performance.

Page 18: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

18Benjamin Levine Paper F2

Performance ImprovementsPipeRench I/O limits performance

(32 bits in / 32 bits out @ 66 MHz)• 128 bit wide I/O @ 150 MHz I/O speed gives

9x speedup for total ~135X vs 1.8 GHz CPU

Fabric speed and size limit performance:• 500 MHz fabric clock gives another ~3x, for total

~400x speedup • Doubling fabric to 32 stripes gives another 70%

speedup, for total ~680x speedup.

Page 19: Implementation of a Target Recognition Application Using ...blevine/documents/mapld03...USC ISI Osiris Board Xilinx Virtex XC2V1000 PMC Connectors 5 MB Static RAM 512 MB Dynamic RAM

19Benjamin Levine Paper F2

Future Work

• Expand PipeRench programming model• Integrate pipelined fabrics with

processors :HASTE – Hybrid Architectures with A Single, Transformable Executable