FPGA Supercomputing Platfroms, Architecture, And Techniques

66
EURASIP Journal on Embedded Systems FPGA Supercomputing Platforms, Architectures, and Techniques for Accelerating Computationally Complex Algorithms Guest Editors: Vinay Sriram and Miriam Leeser

Transcript of FPGA Supercomputing Platfroms, Architecture, And Techniques

Page 1: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems

FPGA Supercomputing Platforms, Architectures, and Techniques for Accelerating Computationally Complex Algorithms

Guest Editors: Vinay Sriram and Miriam Leeser

Page 2: FPGA Supercomputing Platfroms, Architecture, And Techniques

FPGA Supercomputing Platforms,Architectures, and Techniques for AcceleratingComputationally Complex Algorithms

Page 3: FPGA Supercomputing Platfroms, Architecture, And Techniques
Page 4: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems

FPGA Supercomputing Platforms,Architectures, and Techniques for AcceleratingComputationally Complex Algorithms

Guest Editors: Vinay Sriram and Miriam Leeser

Page 5: FPGA Supercomputing Platfroms, Architecture, And Techniques

Copyright © 2009 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2009 of “EURASIP Journal on Embedded Systems.” All articles are open access articlesdistributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original work is properly cited.

Page 6: FPGA Supercomputing Platfroms, Architecture, And Techniques

Editor-in-ChiefZoran Salcic, University of Auckland, New Zealand

Associate Editors

Sandro Bartolini, ItalyNeil Bergmann, AustraliaShuvra Bhattacharyya, USAEd Brinksma, The NetherlandsPaul Caspi, FranceLiang-Gee Chen, TaiwanDietmar Dietrich, AustriaStephen A. Edwards, USAAlain Girault, FranceRajesh K. Gupta, USA

Thomas Kaiser, GermanyBart Kienhuis, The NetherlandsChong-Min Kyung, KoreaMiriam Leeser, USAJohn McAllister, UKKoji Nakano, JapanAntonio Nunez, SpainSri Parameswaran, AustraliaZebo Peng, SwedenMarco Platzner, Germany

Marc Pouzet, FranceS. Ramesh, IndiaPartha S. Roop, New ZealandMarkus Rupp, AustriaAsim Smailagic, USALeonel Sousa, PortugalJarmo Henrik Takala, FinlandJean-Pierre Talpin, FranceJurgen Teich, GermanyDongsheng Wang, China

Page 7: FPGA Supercomputing Platfroms, Architecture, And Techniques
Page 8: FPGA Supercomputing Platfroms, Architecture, And Techniques

Contents

FPGA Supercomputing Platforms, Architectures, and Techniques for Accelerating ComputationallyComplex Algorithms, Vinay Sriram and Miriam LeeserVolume 2009, Article ID 218456, 2 pages

Prototyping Advanced Control Systems on FPGA, Stephane Simard, Jean-Gabriel Mailloux,and Rachid BeguenaneVolume 2009, Article ID 897023, 12 pages

Parallel Backprojection: A Case Study in High-Performance Reconfigurable Computing, Ben Cordes andMiriam LeeserVolume 2009, Article ID 727965, 14 pages

Performance Analysis of Bit-Width Reduced Floating-Point Arithmetic Units in FPGAs: A Case Study ofNeural Network-Based Face Detector, Y. Lee, Y. Choi, M. Lee, and S. KoVolume 2009, Article ID 258921, 11 pages

Accelerating Seismic Computations Using Customized Number Representations on FPGAs, Haohuan Fu,William Osborne, Robert G. Clapp, Oskar Mencer, and Wayne LukVolume 2009, Article ID 382983, 13 pages

An FPGA Implementation of a Parallelized MT19937 Uniform Random Number Generator,Vinay Sriram and David KearneyVolume 2009, Article ID 507426, 6 pages

Page 9: FPGA Supercomputing Platfroms, Architecture, And Techniques

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 218456, 2 pagesdoi:10.1155/2009/218456

Editorial

FPGA Supercomputing Platforms, Architectures, and Techniquesfor Accelerating Computationally Complex Algorithms

Vinay Sriram1 and Miriam Leeser2

1 Defence and Systems Institute, University of South Australia, Adelaide, South Australia 5001, Australia2 Department of Electrical and Computer Engineering, College of Engineering, Northeastern University, Boston, MA 02115, USA

Correspondence should be addressed to Miriam Leeser, [email protected]

Received 6 May 2009; Accepted 6 May 2009

Copyright © 2009 V. Sriram and M. Leeser. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

This is a special issue on FPGA supercomputing platforms,architectures, and techniques for accelerating computation-ally complex algorithms. This issue covers a broad rangeof applications in which field programmable gate arrays(FPGAs) are successfully used to accelerate processing. Italso provides researcher’s insights on the challenges insuccessfully using FPGAs. The applications discussed includemotor control, radar processing, face recognition, processingseismic data, and accelerating random number generation.Techniques discussed by the authors include partitioningbetween a CPU and FPGA hardware, reducing bitwidthto improve performance, interfacing to analog signals, andusing high level tools to develop applications.

Two challenges that face many users of reconfigurablehardware are interfacing to the analog domain and easingthe job of developing applications. In the paper entitled“Prototyping Advanced Control Systems on FPGA,” theauthors present a rapid prototyping platform and designflow suitable for the design of onchip motion controllersand other SoCs with a need for analog interfacing. Thetarget hardware platform consists of a customized FPGAdesign for the Amirix AP1000 PCI FPGA board coupledwith a multichannel analog I/O daughter card. The designflow uses Xilinx System Generator in MATLAb/Simulink forsystem design and test, and Xilinx Platform Studio for SoCintegration. This approach has been applied to the analysis,design, and hardware implementation of a vector controllerfor 3-phase AC induction motors.

Image processing is an application area that exhibits agreat deal of parallelism. In the work entitled “Parallel Back-projection: A Case Study in High-Performance Reconfig-

urable Computing,” the authors investigate the use of a high-performance reconfigurable supercomputer built from bothgeneral-purpose processors and FPGAs. These architecturesallow a designer to exploit both fine-grained and coarse-grained parallelism, achieving high degrees of speedup. Theauthors describe how backprojection, used to reconstructSynthetic Aperture Radar (SAR) images, is implemented ona high-performance reconfigurable computer system. Theresults show an overall application speedup of 50 times.

Neural networks have successfully been used to detectfaces in video images. In the paper entitled “PerformanceAnalysis of Bit-Width Reduced Floating-Point ArithmeticUnits in FPGAs: Case Study of Neural Network-based FaceDetector,” the authors describe the implementation of anFPGA-based face detector using a neural network and bit-width reduced floating-point arithmetic units (FPUs). TheFPUs and neural network are designed using MATLAB andVHDL, and the two implementations are compared. Theauthors demonstrate that reductions in the number of bitsused in arithmetic computation can produce significant costreductions including area, speed, and power with a smallsacrifice in accuracy.

The oil and gas industry has a huge demand for high-performance computing on extremely large volumes of data.FPGAs are exceedingly well matched for this task. Reducedprecision arithmetic operations can greatly decrease the areacost and I/O bandwidth of the FPGA-based design, support-ing increased parallelism and achieving high performance.In the work entitled “Accelerating Seismic ComputationsUsing Customized Number Representations on FPGAs,” theauthors present a tool to determine the minimum-number

Page 10: FPGA Supercomputing Platfroms, Architecture, And Techniques

2 EURASIP Journal on Embedded Systems

of precision that still provides acceptable accuracy for seismicapplications. By using the minimized number format, theauthors are able to demonstrate speedups ranging from 5 to7 times, including overhead costs such as the transfer timeto and from the general purpose processors. With improvedbandwidth between CPU and FPGA, the authors show that a48-time speedup is possible.

A large number of applications require large quantitiesof uncorrelated random numbers. In the paper entitled “AnFPGA Implementation of a Parallelized MT19937 UniformRandom Number Generator”, Vinay Sriram and DavidKearney present a fast uniform random-number generatorimplemented in reconfigurable hardware that is both higherthroughput and more area efficient than previous implemen-tations. The design presented, which generates up to 624random numbers in parallel, has a throughput that is morethan 15 times better than previously published results.

This collection of papers represents an overview of activeresearch in the field of reconfigurable hardware applicationsand techniques.

Vinay SriramMiriam Leeser

Page 11: FPGA Supercomputing Platfroms, Architecture, And Techniques

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 897023, 12 pagesdoi:10.1155/2009/897023

Research Article

Prototyping Advanced Control Systems on FPGA

Stephane Simard, Jean-Gabriel Mailloux, and Rachid Beguenane

Department of Applied Sciences, University of Quebec at Chicoutimi, 555 boul. de l’Universite, Chicoutimi, QC, Canada G7H 2B1

Correspondence should be addressed to Rachid Beguenane, [email protected]

Received 19 June 2008; Accepted 3 March 2009

Recommended by Miriam Leeser

In advanced digital control and mechatronics, FPGA-based systems on a chip (SoCs) promise to supplant older technologies, suchas microcontrollers and DSPs. However, the tackling of FPGA technology by control specialists is complicated by the need forskilled hardware/software partitioning and design in order to match the performance requirements of more and more complexalgorithms while minimizing cost. Currently, without adequate software support to provide a straightforward design flow, theamount of time and efforts required is prohibitive. In this paper, we discuss our choice, adaptation, and use of a rapid prototypingplatform and design flow suitable for the design of on-chip motion controllers and other SoCs with a need for analog interfacing.The platform consists of a customized FPGA design for the Amirix AP1000 PCI FPGA board coupled with a multichannel analogI/O daughter card. The design flow uses Xilinx System Generator in Matlab/Simulink for system design and test, and XilinxPlatform Studio for SoC integration. This approach has been applied to the analysis, design, and hardware implementation ofa vector controller for 3-phase AC induction motors. It also has contributed to the development of CMC’s MEMS prototypingplatform, now used by several Canadian laboratories.

Copyright © 2009 Stephane Simard et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

The use of advanced control algorithms depends upon beingable to perform complex calculations within demandingtiming constraints, where system dynamics can requirefeedback response in as short as a couple tens of microsec-onds. Developing and implementing such capable feedbackcontrollers is currently a hard goal to achieve, and there ismuch technological challenge in making it more affordable.Thanks to major technological breakthroughs in recent years,and to sustained rapid progress in the fields of very largescale integration (VLSI) and electronic design automation(EDA), electronic systems are increasingly powerful [1,2]. In the latter paper, it is rightly stated that FPGAdevices have reached a level of development that putsthem on the edge of microelectronics fabrication technologyadvancements. They provide many advantages with respectto their nonreconfigurable counterparts such as the generalpurpose micropocessors and DSP processors. In fact, FPGA-based digital processing systems achieve better performance-cost compromise, and with a moderate design effort theycan afford the implementation of a powerful and flexible

embedded SoCs. Exploiting the FPGA technology benefitsfor industrial electrical control systems has been the source ofintensive research investigations during last decade in orderto boost their performances at lower cost [3, 4]. There isstill, however, much work to be done to bring such powerin the hands of control specialists. In [5], it is stated that thepotential of implementing one FPGA chip-based controllerhas not been fully exploited in the complicated motorcontrol or complex converter control applications. Untilnow, most related research works using FPGA devices arefocusing on designing specific parts mainly to control powerelectronic devices such as space vector pulse width modula-tion (SVPWM) and power factor correction [6, 7]. Usuallythese are implemented on small FPGAs while the maincontrol tasks are realised sequentially by the supervisingprocessor system, basically the DSP. Important and constantimprovement in FPGA devices, synthesis, place-and-routetools, and debug capabilities has made FPGA prototypingmore available and practical to ASIC/SoC designers thanever before. The validation of their hardware and softwareon a common platform can be accomplished using FPGA-based prototypes. Thanks to the existing and mature tools

Page 12: FPGA Supercomputing Platfroms, Architecture, And Techniques

2 EURASIP Journal on Embedded Systems

that provide automation while maintaining flexibility, theFPGA prototypes make it now possible for ASIC/SoC designsto be delivered on time at minimal budget. Consequently,FPGA-based prototypes could be efficiently exploited formotion control applications to permit an easy modificationof the advanced control algorithms through short-designcycles, simple simulation, and rapid verification. Still theimplementation of FPGA-based SoCs for motion controlresults in very complex tasks involving SW and HW skilleddevelopers. The efficient IP integration constitutes the maindifficulty from hardware perspective while in software sidethe issue is the complexity of debugging the software thatruns under real-time operating system (RTOS), in real hard-ware. This paper discusses the choice, adaptation, and use ofa rapid prototyping platform and design flow suitable for thedesign of on-chip motion controllers and other SoCs with aneed for analog interfacing. Section 2 describes the chosenprototyping platform and the methodology that supportsembedded application software coupled with custom FPGAlogic and analog interfacing. Section 3 presents the strategyfor simulating and prototyping any control algorithm usingXilinx system Generator (XSG) along with Matlab/Simulink.A vector control for induction motor is taken as a runningexample to explain some features related to the cosimulation.Section 4 describes the process of integrating the designedcontroller, once completely debugged, within an SoC archi-tecture using Xilinx Platform Studio (XPS) and targetingthe chosen FPGA-based platform. Section 5 discusses thecomplex task of PCI initialization of the analog I/O cardand controller setup by software under embedded Linuxoperating system. Section 6 takes the induction motor vectorcontrol algorithm as an application basis to demonstratethe usefulness of the chosen FPGA-based SoC platform todesign/verify on-chip motion controllers. The last sectionconcludes the paper.

2. The FPGA-Based Prototyping Platform forOn-Chip Motion Controllers

With the advent of a new generation of high-performanceand high-density FPGAs offering speeds in the 100 secondsof MHz and complexities of up to 2 megagates, the FPGA-based prototyping becomes appropriate for verification ofSoC and ASIC designs. Consequently the increasing designcomplexities and the availability of high-capacity FPGAsin high-pin-count packages are motivating the need forsophisticated boards. Board development has become a taskthat demands unique expertise. That is one reason whycommercial off-the-shelf (COTS) boards are quickly becom-ing the solution of choice because they are closely relatedto the implementation and debugging tools. During manyyears, and under its System-on-Chip Research Network(SOCRN) program, CMC Microsystems provided canadianuniversities with development tools, various DSP/EmbeddedSystems/multimedia boards, and SoC prototyping boardssuch as Amirix AP1000 PCI FPGA development platform.In order to support our research on on-chip motioncontrollers, we have managed the former plateform, a host

Analog I/Qdaughter card

AP1000 board Digital outputsfrom the FPGA

Figure 1: Rapid prototyping station equiped with FPGA board andmultichannel analog I/O daughter card.

PC (3.4 GHz Xeon CPU with 2.75 GB of RAM) equipedwith the Amirix AP1000 PCI FPGA development board,to support a multichannel analog I/O PMC daughter card(Figure 1) to communicate with exterior world.

The AP1000 has lots of features to support complexsystem prototyping, including test access and expansioncapabilities. The PCB is a 64-bit PCI card that can beinserted in a standard expansion slot on a PC motherboardor PCI backplane. Use of the PMC site requires a secondchassis slot on the backside of the board and an optionalextender card to provide access to the board I/O. The AP1000platform includes a Xilinx Virtex-II Pro XC2VP100 FPGAand is connected to dual banks of DDR SDRAM (64 MB)and SRAM (2 MB), Flash Memory (16 MB), Ethernet andother interfaces. It is configured as a single board computerbased on two embedded IBM PowerPC processors, and it isproviding an advanced design starting point for the designerto improve time-to-market and reduce development costs.

The analog electronics are considered modular, and caneither be external or included on the same chip (e.g., whenfabricated into an ASIC). On the prototyping platform,of course, they are supplied by the PMC daughter card.It is a General Standards PMC66-16AISS8A04 analog I/Oboard featuring twelve 16-bit channels: eight simultaneouslysampled analog inputs, and four analog outputs, with inputsampling rates up to 2.0 MSPS per channel. It acts as a two-way analog interface between the FPGA and lab equipment,connected through an 80-pin ribbon cable and a breakeoutboard to the appropriate ports of the power module.

The application software is compiled with the freeEmbedded Linux Development Kit (ELDK) from DENXSoftware Engineering. Since it runs under such a completeoperating system as Linux, it can perform elaborated func-tions, including user interface management (via a seriallink or through networking), and real-time supervision andadaptation of a process such as adaptive control.

The overall platform is very well suited to FPGA-in-the-loop control and SoC controller prototyping (Figure 2).The controller can either be implemented completely indigital hardware, or executed on an application-specificinstruction set processor (ASIP). The hardware approach hasa familiar design flow, using the Xilinx System Generator

Page 13: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 3

ACinduction

motor

Powermodule

Digitaloutputs

(PMW,etc)

RJ45 RS232Xcvr

PMC Ethernet RJ45

PCIbridge

Bridge

Bridge

Externallocal

bridge

InterruptcontrollerUART

PLB

OPB

FPGA Virtex-II Pro XC2VP100

AP1000 FPGA board

PowerPC405

SDRAMcontroller

Userlogic

Interface

Applicationsoftware + HW logic driver

under Linux

General standards PMC analog I/Q card with 12 16 bit analog

channels: 4 outputs, and 8 simultaneously sampled inputs

-

Figure 2: Architecture of the embedded platform driving a powersystem (schematic not to scale).

(XSG) blockset and hardware/software cosimulation featuresin Matlab/Simulink. An ASIP specially devised for advancedcontrol applications is currently under development withinour laboratory.

3. Matlab/Simulink/XSG Controller Design

It is well known that simulation of large systems withinsystem analysis and modelling software environments takesa prohibitive amount of time. The main advantage of a rapidprototyping design flow with hardware/software cosimula-tion is that it provides the best of a system analysis andmodelling environment while offering adequate hardwareacceleration.

Hardware/software cosimulation has been introducedby major EDA vendors around year 2000, combiningMatlab/Simulink, the computing, and Model-Based Designsoftware, with synthesizable blocksets and automated hard-ware synthesis software such as DSP Builder from Altera,and System Generator from Xilinx (XSG). Such a designflow reduces the learning time and development risk forDSP developers, shortens the path from design concept toworking hardware, and enables engineers to rapidly createand implement innovative, high-performance DSP designs.

The XSG cosimulation feature allows the user to run adesign on the FPGA found on a certain platform. An impor-tant advantage of XSG is that it allows for quick evaluationof system response when making changes (e.g., changingcoefficient and data widths). As the AP1000 is not supportedby XSG among the preprogrammed cosimulation targets, weuse the Virtex-4 ML402 SX XtremeDSP Evaluation Platforminstead (Figure 3). The AP1000 is only targetted at the SoCintegration step (see Section 4).

Figure 3: Virtex-4 ML402 SX XtremeDSP evaluation platform.

We begin with a conventional, floating-point, simu-lated control system model, and corresponding fixed-pointhardware representation is then constructed using the XSGblockset, leading to a bit-accurate FPGA hardware model(Figure 4), and XSG generates synthesizable HDL targettingXilinx FPGAs. The XSG design, simulation, and test pro-cedure is briefly outlined below. Power systems includingmotor drives can be simulated using the SimPowerSystems(SPS) blockset in Simulink.

(1) Start by coding each system module individually withthe XSG blockset.

(2) Import any user-designed HDL cores.

(3) Adjust the fixed-point bit precisions (including bitwidths and binary point position) for each XSG blockof the system.

(4) Use the Xilinx Gateway blocks to interface a floating-point Simulink model with a fixed-point XSG design.The Gateway-in and Gateway-out blocks, respec-tively, convert inputs from Simulink to XSG andoutputs from XSG to Simulink.

(5) Test system response using the same input stimuli foran equivalent XSG design and Simulink model withautomatic comparision of their respective outputs.

Commonly, software simulation of a complete drivemodel, for a few seconds of results, could take a couple ofdays of computer time. Hardware/software cosimulation canbe used to accelerate the process of controller simulation,thus reducing the computing time to about a couple of hours.It also ensures that the design will respond correctly onceimplemented in hardware.

4. System-on-Chip Integration in XilinxPlatform Studio

FPGA design and the SoC architecture are managed withXilinx Platform Studio (XPS), targetting the AP1000. Wehave customized the CMC-modified Amirix baseline design

Page 14: FPGA Supercomputing Platfroms, Architecture, And Techniques

4 EURASIP Journal on Embedded Systems

Hardwarelibrary

component

BaselineSoC

architecture

Application-specification

hardwarecomponents

Hardwaredesign flow

Functionalsimulation

Integration of thecomponentsto the SoC

Controllersynthesisin XSG

Matlab/Simulinkmodeling

Hardware/softwareco-simulation

Softwarelibraries

and drivers

Source-levelintegration

Low-levelsoftware

simulation

Application-specific code

Embedded Linuxoperation system

Softwaredesign flow

FPGA prototype

Co-

sim

ula

tion

dat

a lin

k

Figure 4: Controller-on-chip design flow.

to support analog interfacing, user logic on the ProcessorLocal Bus (PLB), and communication with applicationsoftware under embedded Linux. XPS generates the corre-sponding .bin file, which is then transferred to the Flashconfiguration memory on the AP1000. The contents of thismemory is used to reconfigure the FPGA. We have foundan undocumented fact that, on the AP1000, this approachis the only practicable way to program the FPGA. JTAGprogramming is proved inconvenient, because it suppressesthe embedded Linux, which is essential to us for PCIinitialization. Once programmed, user logic awaits a startsignal from our application software following analog I/Ocard initialization.

To accelerate the logic synthesis process, the mapper andplace and route options are set to STD (standard) in theimplementation options file (etc/fast runtime.opt), found inthe Project Files menu. If the user wants a more aggressiveeffort, these options should be changed to HIGH, whichrequires much more time. Our experiments have shown thatit typically amounts to several hours.

4.1. Bus Interfacing. The busses implemented in FPGA logicfollow the IBM CoreConnect standard. It provides masterand slave operation modes for any instanciated hardwaremodule. The most important system busses are the ProcessorLocal Bus (PLB), and the On-chip Peripheral Bus (OPB).

The implementation of the vector control schemerequires much less of generality, and deletes some commu-nication stages that might be used in other applications. It iseasier to start from such a generic design, dropping unneededfeatures, than to start from scratch. This way, one can quicklyprogress from SoC architecture in XPS down to a workingcontroller on the AP1000.

4.1.1. Slave Model Register Read Mux. The baseline XPSdesign provides the developer with a slave model register

read multiplexer. This allows to decide which data is providedwhen a read request is sent to the user logic peripheral byanother peripheral in the system. While a greater numbermay be used, our pilot application, the vector control, onlyuse four slave registers. The user logic peripheral has aspecific base address (C BASEADDR), and the four 32-bit registers are accessed through C BASEADDR + registeroffset. In this example, C BASEADDR + 0x0 correspondsto the control and status register, which is composed of thefollowing bits:

0–7 : the DIP switches on the AP1000 fordebugging purposes,

8 : used by user software to reset, start, orstop the controller,

9–31 : reserved.

As for the other 3 registers, they correspond to

C BASEADDR + 0x4: Output to analogchannel 1

C BASEADDR + 0x8: Output to analogchannel 2

C BASEADDR + 0xC: Reserved (often used fordebugging purposes)

4.1.2. Master Model Control. The master model controlstate machine is used to control the requests and responsesbetween the user logic peripheral and the analog I/O card.The latter is used to read the input currents and voltagesfor vector control operation. The start signal previouslymentioned in slave register 0 is what gets the state machineout of IDLE mode, and thus starts the data acquisitionprocess. In this specific example, the I/O card is previouslyinitialized by the embedded application software, relievingthe state machine of any initialization code. Analog I/O

Page 15: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 5

initialization sets a lot of parameters, including how manyactive channels are to be read.

The state machine operates in the following way(Figure 5).

(1) The user logic waits for a start signal from the userthrough slave register 0.

(2) The different addresses to access the right AIO cardfields are set up, namely, the BCR and read buffer.

(3) A trigger is sent to the AIO card to buffer the valuesof all desired analog channels.

(4) A read cycle is repeated for the number of activechannels previously defined.

(5) Once all channels have been read, the state machinefalls back to trigger state, unless the user chooses tostop the process using slave register 0.

4.2. Creating or Importing User Cores. User-designed logicand other IPs can be created or imported into the XPS designfollowing this procedure.

(1) Select Create or Import Peripheral from the Hard-ware menu, and follow the wizard (unless otherwisestated below, the default options should be accepted).

(2) Choose the preferred bus. In the case of our vectorcontroller, it is connected to the PLB.

(3) For PLB interfacing, select the following IPIF ser-vices:

(a) burst and cacheline transaction support,

(b) master support,

(c) S/W register support.

(4) The User S/W Regitser data width should be 32.

(5) Accept the other wizard options as default, then clickFinish.

(6) You should find your newly created/imported core inthe Project Repository of the IP Catalog; right clickon it, and select Add IP.

(7) Finally go to the Assembly tab in the main SystemAssembly View, and set the base address (e.g.,0x2a001000), the memory size (e.g., 512), and the busconnection (e.g., plb bus).

4.3. Instantiating a Netlist Core. Using HDL generated bySystem Generator may be inconvenient for large controlsystems described with the XSG blockset, as it can requirea couple of days of synthesis time. System Generatorcan be asked to produce a corresponding NGC binarynetlist file instead, which is then treated as a black boxto be imported and integrated into an XPS project. Thisconsiderably reduces the synthesis time needed. The processof instantiating a Netlist Core in a custom peripheral (e.g.,user logic.vhd), performed following the steps documentedin XPS user guide.

IDLE

Adressessetup

AIOtrigger

PAUSE

Start signal

Trigger ACK

All channelsread

One active channelread

Read anotheractive channel

Stop signal

Readcycle

Setup completed

BCR andstatus

Figure 5: Master model state machine.

Table 1: The Two Intel StrataFlash Flash memory devices.

Bank Address Size Mode Description

1 0x20000000 0x1000000 (16 MB) 16 Program Flash

2 0x24000000 0x1000000 (16 MB) 8 Config. Flash

Table 2: AP1000 flash configurations.

Region Bank Sectors Description

0 2 0–39 Configuration 0

1 2 40–79 Configuration 1

2 2 80–127 Configuration 2 (Default Config.)

4.4. BIN File Generation and FPGA Configuration. To config-ure the FPGA, a BIN file must be generated from the XSGproject. Since JTAG programming disables the embeddedLinux, the BIN file must be downloaded directly to onboardFlash memory. There are two Intel Strataflash Flash memorydevices on the AP1000, one for the configuration, and onefor the U-boot bootstrap code (which should not be crushed)(Table 1).

The configuration memory (Table 2) is divided into threesections. Section 2 is the default Amirix configuration, andshould not be crushed. Downloading the BIN file to memoryis done through a network cable using the TFTP protocol.For this purpose, a TFTP server must be set up on thehost PC. The remote side of the protocol is managed byU-boot on the AP1000. Commands to U-boot to initiatethe transfer and to trigger FPGA reconfiguration from adesignated region are entered by the user through a serial linkterminal program. Here is the complete U-boot commandsequence:

setenv serverip 132.212.202.166setenv ipaddr 132.212.201.223erase 2 : 0–39Send tftp 00100000 download.binSend cp.b 00100000 24000000 00500000Send swrecon

Page 16: FPGA Supercomputing Platfroms, Architecture, And Techniques

6 EURASIP Journal on Embedded Systems

5. Application Software and Drivers

One of the main advantages of using an embedded Linuxsystem is the ability to perform the complex task of PCIinitialization. In addition, it allows for application softwareto provide elaborated interfacing and user monitoringthrough appropriate software drivers. Initialization of theanalog I/O card on the PMC site and controller setup areamong such tasks that are best performed by software.

5.1. Linux Device Drivers Essentials. Appropriate devicedrivers have to be written in order to use daughter cards(such as an analog I/O board) or custom hardware com-ponents on a bus internal to the SoC, and be able tocommunicate with them from the embedded Linux. Driversand application software for the AP1000 can be developedwith the free Embedded Linux Development Kit (ELDK)from DENX Software Engineering, Germany. The ELDKincludes the GNU cross development tools, along withprebuilt target tools and libraries to support the targetsystem. It comes with full source code, including all patches,extensions, programs, and scripts used to build the tools.A complete discussion on writing Linux device drivers isbeyond the scope of this paper, and this information may befound elsewhere, such as in [8]. Here, we only mention a fewimportant issues relevant to the pilot application.

To support all the required functions when creating aLinux device driver, the following includes are needed:

#include<linux/config.h>#include<linux/module.h>#include<linux/pci.h>#include<linux/init.h>#include<linux/kernel.h>#include<linux/slab.h>#include<linux/fs.h>#include<linux/ioport.h>#include<linux/ioctl.h>#include<linux/byteorder/

big endian.h>#include<asm/io.h>#include<asm/system.h>#include<asm/uaccess.h>

5.2. PCI Access to the Analog I/O Board . The pci finddevice() function begins or continues searching for a PCIdevice by vendor/device ID. It iterates through the list ofknown PCI devices, and if a PCI device is found witha matching vendor and device, a pointer to its devicestructure is returned. Otherwise, NULL is returned. Forthe PMC66-16AISS8A04, the vendor ID is 0x10e3, and thedevice ID is 0x8260. The device must then be initialized withpci initialize device() before it can be used by the driver.The start address of the base address registers (BARs) can beobtained using pci resource start(). In the example, we getBAR 2 which gives access to the main control registers of thePMC66-16AISS8A04.

volatile u32 ∗base addr;struct pci dev ∗dev;struc resource ∗ctrl res;

dev = pci find device(VENDORID,DEVICEID, NULL);

.

.

.pci enable device (dev);get revision (dev);base addr = (volatile u32 ∗)

pci resource start (dev, 2);ctrl res = request mem region (

(unsigned long)base addr,0x80L,"control");

bcr = (u32 ∗) ioremap nocache ((unsigned long)base addr,0x80L);

The readl() and writel() functions are defined to accessPCI memory space in units of 32 bits. Since the PowerPCis big-endian while the PCI bus is by definition little-endian, a byte swap occurs when reading and writing PCIdata. To ensure correct byte order, the le32 to cpu() andcpu to le32() functions are used on incoming and outgoingdata. The following code example defines some macros toread and write the Board Control Register, to read data fromthe analog input buffer, and to write to one of the four analogoutput channels.

volatile u32 ∗bcr;

#define GET BCR() (le32 to cpu(\\readl (bcr)))

#define SET BCR(x) writel(\\cpu to le32(x), bcr)

#define ANALOG IN()le32 to cpu(\\readl (&bcr[ANALOG INPUT BUF]))

#define ANALOG OUT(x,c) writel(\\cpu to le32(x), \\&bcr[ANALOG OUTPUT CHAN 00+c])

5.3. Cross-Compilation with the ELDK. To properly compilewith the ELDK, a makefile is required. Kernel source codeshould be available in KERNELDIR to provide for essentialincludes. The version of the preinstalled kernel on theAP1000 is Linux 1.4. Example of a minimal makefile:

TARGET= thetargetOBJS= myobj.o

#EDIT THE FOLLOWING TO POINT TO#THE TOP OF THE KERNEL SOURCE TREEKERNELDIR = ∼/kernel-sw-003996-01

CC = ppc 4xx−gccLD = ppc 4xx−ld

Page 17: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 7

DEFINES = −D—KERNEL—−DMODULE\\−DEXPORT SYMTAB

INCLUDES= −I$(KERNELDIR)/include\\−I$(KERNELDIR)/include/Linux\\

−I$(KERNELDIR)/include/asmFLAGS =−fno-strict-aliasing \\

−fno-common\\−fomit-frame-pointer\\−fsigned-char

CFLAGS = $(DEFINES) $(WARNINGS)\\$(INCLUDES) $(SWITCHES)\\$(FLAGS)

all: $(TARGET).o Makefile

$(TARGET).o: $(OBJS)$(LD) −r -o $@$∧

5.4. Software and Driver Installation on the AP1000. For easeof manipulation, user software and drivers are best carried ona CompactFlash card, which is then inserted in the back slotof the AP1000 and mounted into the Linux file system. Thedrivers are then intalled, and the application software started,as follows:

mount /dev/discs/disc0/part1 /mntinsmod /mnt/logic2/hwlogic.oinsmo /mnt/aio.ocd /devmknod hwlogic c 254 0mknod aio c 253 0/mnt/cvecapp

6. Application: AC Induction Motor Control

Given their well-known qualities of being cheap, highlyrobust, efficient, and reliable, AC induction motors currentlyconstitute the bulk of the motion industry park. From thecontrol point of view, however, these motors have highlynonlinear behavior.

6.1. FPGA-Based Induction Motor Vector Control. Theselected control algorithm for our pilot application is therotor-flux oriented vector control of a three-phase ACinduction motor of the squirrel-cage type. It is the firstmethod which makes it possible to artificially give somelinearity to the torque control of induction motors [9].

RFOC algorithm consists in partial linearization of thephysical model of the induction motor by breaking up thestator current is into its components in a suitable referenceframe (d, q). This frame is synchronously revolving alongwith the rotor flux space vector in order to get a separatecontrol of the torque and rotor flux. The overall strategythen consists in regulating the speed while maintainingthe rotor flux constant (e.g., 1 Wb). The RFOC algorithmis directly derived from the electromechanical model ofa three-phase, Y-connected, squirrel-cage induction motor.

This is described by equations in the synchronously rotatingreference frame (d, q) as

usd = Rsisd + σLsddtisd−σLsωisq +

M

Lr

ddtΨr ,

︸ ︷︷ ︸

Dd

usq = Rsisq + σLsddtisq+σLsωisd +

M

LrωΨr ,

︸ ︷︷ ︸

Dq

ddtΨr = Rr

Lr(Misd −Ψr),

ω = Ppωr +MRrΨrLr

isq,

dωrdt

= 32Pp

M

JLrΨr isq − D

Jωr − Tl

J,

(1)

where usd and usq are d and q components of stator voltageus, isd, and isq are d and q components of stator current is, Ψr

is the modulus of rotor flux modulus, and θ is the angularposition of rotor flux, ω is the synchronous angular speed ofthe (d, q) reference frame (ω = dθ/dt), and Ls, Lr , and Mare stator, rotor, and mutual inductances, Rs, Rr are statorand rotor resistances, σ is the leakage coefficient of the motor,and Pp is the number of pole pairs, ωr is the mechanical rotorspeed, D is damping coefficient, J is the inertial momentum,and Tl is torque load.

6.2. RFOC Algorithm. The derived expressions for each blockcomposing the induction motor RFOC scheme, as shown inFigure 6, are given as follows:

Speed PI Controller:

i∗sq = kpvεv + kiv

εv dt; εv = ω∗r − ωr. (2)

Rotor Flux PI Controller:

i∗sd = kp f ε f + ki f

ε f dt; ε f = Ψ∗r −Ψr . (3)

Rotor Flux Estimator:

Ψr =√

Ψ2rα +Ψ2

rβ, (4)

cos θ = Ψrα

Ψr, sin θ = Ψrβ

Ψr, (5)

with

Ψrα = LrM

(Ψsα − σLsisα), Ψrβ = LrM

(

Ψsβ − σLsisβ)

, (6)

Ψsα =∫

(usα − Rsisα), Ψsβ =∫(

usβ − Rsisβ)

, (7)

Page 18: FPGA Supercomputing Platfroms, Architecture, And Techniques

8 EURASIP Journal on Embedded Systems

V

v

v

u

uspsp

spsp

sp

u

u

i

sp

i

i

i

DC

sd

sq

sd

sqal

bh

ch

bl

cl

sa

sb

sa

ah

sb

sd

sq

Decoupling

Rotorflux

estimator

Speed PIcontroller

Rotor fluxPI controller

Q-currentPI controller

D-currentPI controller

Parktransform

Speed measure

Inversetransform

park

SVPWMmodulegating

Clarketransform

Clarketransform

IM

+

+

+

+

+ +

++

ω∗r

Ψ∗r

i∗sq

i∗sd

u∗sα

u∗sβ

cos θsin θΨr

ω

ω estimator

ωr

isαisβ

usαusβ

3-φ

voltagePWM

inverter

Figure 6: Conceptual block diagram of the system.

and using Clarke transformation

isα = isa, isβ = 1√3isa +

2√3isb, (8)

usα = usa, usβ = 1√3usa +

2√3usb. (9)

To be noticed that sine and cosine, of (5), sum up to adivision, and therefore do not have to be directly calculated.

Current PI Controller:

vsd = kpiεisd + kii

εisd dt; εisd = i∗sd − isd, (10)

vsq = kpiεisq + kii

εisq dt; εisq = i∗sq − isq. (11)

Decoupling:

usd = σLsvsd +Dd; usq = σLsvsq +Dq, (12)

with

Dd = −σLsωisq +M

Lr

ddtΨr , Dq = +σLsωisd +

M

LrωΨr .

(13)

Omega (ω) Estimator:

ω = Ppωr +MRrΨrLr

isq. (14)

Park Transformation:

isd

isq

⎦ =⎡

cos θ sin θ

− sin θ cos θ

isα

isβ

⎦. (15)

Inverse Park Transformation:

u∗sαu∗sβ

⎦ =⎡

cos θ − sin θ

sin θ cos θ

usd

usq

⎦. (16)

In the above equations, for x standing for any variablesuch as voltage us, current is or rotor flux Ψr , we have thefollowing.

(x∗) Input reference corresponding to x.

(εx) Error signal corresponding to x.

(kpx , kix ) Proportional and integral parameters corre-sponding to the PI controller of x.

(xa, xb, xc) a, b, and c three-phase components of x in thestationary reference frame.

(xα, xβ) α and β two-phase components of x in thestationary reference frame.

(xd, xq) d and q components of x in the synchronouslyrotating frame.

The RFOC scheme features vector transformations(Clarke and Park), 4 IP regulators, and space-vector PWMgenerator (SVPWM). This algorithm is of interest for itsgood performances, and because it has a fair level ofcomplexity which benefits from a very-high-performanceFPGA implementation. In fact, FPGAs make it possible toexecute the loop of a complicated control algorithm in amatter of a few microseconds. The first prototype of sucha controller has been developed using the method andplatform described here, and has been implemented entirelyin FPGA logic [10].

Commonly used mediums prior to the advent of today’slarge FPGAs, including the use of DSPs alone and/or special-ized microcontrollers, led to a total cycle time of more than100 μs for vector control. This lead to switching frequencies

Page 19: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 9

Dynamo withoptical speed

encoder

Encodercable

Resistive load

Powersupply

Cable interfaceto analog I/O card

Digital I/Ofrom FPGA

High-voltagepower module

Squirrel-cageinduction motor

Figure 7: Experimental setup with power electronics, inductionmotor, and loads.

in the range of 1–5 kHz, which produced disturbing noise inthe audible band. With today’s FPGAs, it becomes possible tofit a very large control system on a single chip, and to supportvery high switching frequencies.

6.3. Validation of RFOC Using Cosimulation with XSG. Astrong hardware/software cosimulation environment andmethodology is necessary to allow validation of the hardwaredesign against a theoretical control system model.

As mentioned is Section 3, the design flow which hasbeen adopted in this research uses the XSG blockset inMatlab/Simulink. XSG model of RFOC block is built upfrom (2) to (16) and the global system architecture is shownin Figure 8 where Gateway-in and Gateway-out blocks pro-vide the necessary interface between the fixed-point FPGAhardware that include the RFOC and Space Vector PulseWidth Modulation (SVPWM) algorithms and the floating-point Simulink blocksets mainly the SimPowerSystems (SPS)models. In fact to make the simulations more realistic, thethree-phase AC induction motor and the correspondingVoltage Source Inverter were modelled in Simulink usingthe SPS blockset, which is robust and well proven. To benoticed that SVPWM is a widely used technique for three-phase voltage-source inverters (VSI), and is well suited forAC induction motors.

At runtime, the hardware design (RFOC and SVPWM)is automatically downloaded into the actual FPGA device,and its response can then be verified in real-time against thatof the theoretical model simulation done with floating-pointSimulink blocksets. An arbitrary load is induced by varyingthe torque load variable Tl as a time function. SPS receivesa reference voltage from the control through the inversePark transformation module. This voltage consists of twoquadrature voltages (u∗sα, u∗sβ), plus the angle (sine/cosine)of the voltage phasor usd corresponding to the rotor fluxorientation (Figure 6).

6.4. Reducing Cosimulation Times. In a closed loop setting,such as RFOC, hardware acceleration is only possible as long

as the replaced block does not require a lot of steps forcompletion. If the XSG design requires more steps to processthe data which is sent than what is necessary for the next datato be ready for processing, a costly (time wise) adjustmenthas to be made. The Simulink period for a given simulatedFPGA clock (one XSG design step) must be reduced, whilethe rest of the Simulink system runs at the same speed asbefore. In a fixed step Simulink simulation environment, thismeans that the fixed step size must be reduced enough sothat the XSG system has plenty of time to complete betweentwo data acquisitions. Obviously, such lenghty simulationsshould only be launched once the debugging process isfinished and the controller is ready to be thouroughly tested.

Once the control algorithm is designed with XSG, theHW/SW cosimulation procedure consists of the following.

(1) Building the interface between Simulink and FPGA-Based Cosimulation board.

(2) Making a hardware cosimulation design.

(3) Executing hardware cosimulation.

When using Simulink environment for cosimulation, oneshould distinguish between the single-step and free-runningmodes, in order for debugging purposes, to get much shortersimulations times.

Single-step cosimulation can improve simulation timewhen replacing one part of a bigger system. This is espe-cially true when replacing blocks that cannot be nativelyaccelerated by Simulink, like embedded Matlab functions.Replacing a block with an XSG cosimulated design shifts theburden from Matlab to the FPGA, and the block no longerremains the simulation’s bottleneck.

Free-running cosimulation means that the FPGA willalways be running at full speed. Simulink will no longerbe dictating the speed of an XSG step as was the casein single-step cosimulation. With the Virtex-4 ML402 SXXtremeDSP Evaluation Platform, that step will now be a fixed10 nanoseconds. Therefore, even a very complicated systemrequiring many steps for completion should have ample timeto process its data before the rest of the Simulink system doesits work. Nevertheless, a synchronization mechanism shouldalways be used for linking the free-running cosimulationblock with the rest of the design to ensure an exterior startsignal will not be mistakenly interpreted as more than onestart pulse. Table 3 shows the decrease of simulation timeafforded by the free-running mode for the induction motorvector control. This has been implemented using XSG withthe motor and its SVPWM-based drive being modeled usingSPS blockset from Simulink. For the same precision and thesame amount of data to be simulated (speed variations overa period of 7 seconds), a single-step approach would require100.7 times longer to complete, thus being an ineffectiveapproach. A more complete discussion of our methodologyfor rapid testing of an XSG-based controller using free-running cosimulation and SPS, has been given in [11].

6.5. Timing Analysis. Before actually generating a BIT fileto reconfigure the FPGA, and whether the cosimulation isdone through JTAG or Ethernet, the design must be able to

Page 20: FPGA Supercomputing Platfroms, Architecture, And Techniques

10 EURASIP Journal on Embedded Systems

fu_in

fv_in

fw_in

fu_pul_ou

fu_pulbar_o

fv_pulbar_o

fw_pulbar_o

fv_pul_ou

fw_pul_ou

Gating

fu

fv

ua

ub

prediction_uab

fire_u

fire_v

fire_wSTART

vqs

vds

Power system blockset domain(floating point)

System generator blockset domain(fixed point)

Firing_Signals

start_contr

uA

uB

READY

spd_ref

flux_ref

isa

usa

usb

aq_done

m_w

Vector_control

Gateway Out6

Gateway Out9

Gateway Out17

Gateway Out10

Gateway Out13

Gateway Out11

Out

OutOut2

Out3Out

Out

Out

Out

vds_in

vds_in2

vds_in1

vqs_in

vqs_in1

In

In

In

In

In

wbar

vbar

ubar

Motor_Drive

is_abc

wm

Te>

volt_mea>

Sensors

In1

In2

Wref

Speed_Ref

phiref

Rotor_Flux_Ref

Syestemgenerator

Resourceestimator

Discrete,Ts = 2.5e-006 s

y

x

u

v

w

Out1

Figure 8: Indcution motor RFOC drive, as modelled with XSG and SPS blocksets.

Table 3: Simulation times and methods

Type of simulation Simulation time

Free-running cosimulation 1734 s

Single-step cosimulation 174610 s (48 hours)

run at 100 MHz (10 nanoseconds step time). As long as thedesign is running inside Simulink, there are never any issueswith meeting timing requirements for the XSG model. Oncecompleted, the design will be synthesized, and simulated onFPGA. If the user launches the cosimulation block generationprocess, the timing errors will be mentioned quite far intothe operation. This means that, after waiting for a relativelylong delay (sometimes 20–30 minutes depending on thecomplexity of a design and the speed of the host computer),the user notices the failure to meet timing requirements withno extra information to quickly identify the problem. Thisis why the timing analysis tool must always be run prior tocosimulation. While it might seem a bit time-consuming,this tool will not simply tell you that your design does notmeet requirements, but it will give you the insight requiredto fix the timing problems. The control algorithm once

being fully designed, analysed (timing wise), and debuggedthrough the aforementioned FPGA-in-the-loop simulationplatform, the corresponding NGC binary netlist file orVHDL/Verilog code are automatically generated. Thesecould then be integrated within the SoC architecture usingXilinx Platform Studio (XPS) and targetting the AP1000platform. Next section describes the related steps.

6.6. Experimental Setup. Figure 7 shows the experimentalsetup with power electronics, induction motor, and loads.The power supply is taken from a 220 V outlet. The highvoltage power module, from Microchip, is connected to theanalog I/O card through the rainbow flex cable, and tothe expansion digital I/Os of the AP1000 through anotherparallel cable. Signals from a 1000-line optical speed encoderare among the digital signals fed to the FPGA. As for theloads, there is both a manually-controlled resistive load box,and a dynamo coupled to the motor shaft.

From the three motor phases, three currents and threevoltages (all prefiltered and prescaled) are fed to the analogI/O board to be sampled. Samples are stored in an internalinput buffer until fetched by the controller on FPGA. Data

Page 21: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 11

exchange between the FPGA and the I/O board proceedsthrough the PLB and the Dual Processor PCI Bus Bridge toand from the PMC site.

The process of generating SVPWM signals continuouslyruns in parallel with controller logic, but the speed at whichthese signals are generated is greater than the speed requiredfor the vector control processing. As a consequence, thesetwo processes are designed and tested separately before beingassembled and tested together.

Power gating and motor speed decoding are continuousprocesses that have critical clocking constraints beyond thecapabilities of bus operation to and from the I/O board.Therefore, even though the PMC66-16AISS8A04 board alsoprovides digital I/O, both the PWM gating signals and theinput pulses from the optical speed encoder are directlypassed through FPGA pins to be processed by dedicatedhardware logic. This is done by plugging a custom-madeadapter card with Samtec CON 0.8 mm connectors into theexpansion site on the AP1000. While the vector control usesdata acquired from the AIO card through a state machine,the PWM signals are constantly fed to the power module(Figure 6). Those signals are sent directly through the generalpurpose digital outputs on the AP1000 itself instead of goingthrough the AIO card. This ensures complete control overthe speed at which these signals are generated and sentwhile targeting a specific operating frequency (16 kHz inour example). This way, the speed calculations required forthe vector control algorithm are done using precise clockingwithout adding to the burden of the state machine whichdictates the communications between FPGA and the AIOcard. The number of transitions found on the signal linesbetween the FPGA and speed encoder are used to evaluatethe speed at which the motor is operating.

6.7. Timing Issues. Completion of one loop cycle of our vec-tor control design, takes 122 steps leading to a computationtime of less than 1.5 μs. To be noticed that for a sampling rateof 32 kHz, the SVPWM signal has 100 divisions (two zonesdivided by 50), which has been chosen as a good compromisebetween precision and simulation time. The simulationfixed-step size is then 625 nanoseconds, which is alreadysmall enough to hinder the performance of simulating theSPS model. Since PWM signal generation is divided intotwo zones, for every 50 steps of Simulink operations (PWMsignal generation and SPS model simulation), the 122 vectorcontrol steps must complete. The period of the XSG—Simulink system must be adjusted in order for the XSGmodel to run 2.44 times faster than the other Simulinkcomponents. The simulation fixed-step size becomes 2.56nanoseconds, thus prolonging simulation time. In otherwords, since the SPS model and PWM signals generation takelittle time (in terms of steps) to complete whereas the vectorcontrol scheme requires numerous steps, the coupling of thetwo forces the use of a very small simulation fixedstep size.

7. Conclusion

In this paper, we have discussed our choice, adaptation,and use of a rapid prototyping platform and design flow

suitable for the design of on-chip motion controllers andother SoCs with a need for analog interfacing. It supportsembedded application software coupled with custom FPGAlogic and analog interfacing, and is very well suited to FPGA-in-the-loop control and SoC controller prototyping. Suchplatform is suitable for academia and research communautythat cannot afford the expensive commercial solutions forFPGA-in-the-loop simulation [12, 13].

A convenient FPGA design, simulation, and test proce-dure, suitable for advanced feedback controllers, has beenoutlined. It uses the Xilinx System Generator blockset inMatlab/Simulink and a simulated motor drive described withthe SPS blockset. SoC integration of the resulting controller isdone in Xilinx Platform Studio. Our custom SoC design hasbeen described, with highlights on the state machine for businterfacing, NGC file integration, BIN file generation, andFPGA configuration.

Application software and drivers development forembedded Linux are often needed to provide for PCI andanalog I/O card initialization, interfacing, and monitoring.We have provided here some pointers along with essentialinformation not easily found elsewhere. The proposed designflow and prototyping platform have been applied to theanalysis, design, and hardware implementation of a vectorcontroller for three-phase AC induction motors, with verygood performance results. The resulting computation times,of about 1.5 μs, can in fact be considered record-breaking forsuch a controller.

Acknowledgments

This research is funded by a Grant from the NationalSciences and Engineering Research Council of Canada(NSERC). CMC Microsystems provided development toolsand support through the System-on-Chip Research Network(SOCRN) program.

References

[1] “Accelerating Canadian competitiveness through microsys-tems: strategic plan 2005–2010,” Tech. Rep., CMC Microsys-tems, Kingston, Canada, 2004.

[2] J. J. Rodriguez-Andina, M. J. Moure, and M. D. Valdes,“Features, design tools, and application domains of FPGAs,”IEEE Transactions on Industrial Electronics, vol. 54, no. 4, pp.1810–1823, 2007.

[3] R. Dubey, P. Agarwal, and M. K. Vasantha, “Programmablelogic devices for motion control—a review,” IEEE Transactionson Industrial Electronics, vol. 54, no. 1, pp. 559–566, 2007.

[4] E. Monmasson and M. N. Cirstea, “FPGA design methodologyfor industrial control systems—a review,” IEEE Transactions onIndustrial Electronics, vol. 54, no. 4, pp. 1824–1842, 2007.

[5] D. Zhang, A stochastic approach to digital control design andimplementation in power electronics, Ph.D. thesis, Florida StateUniversity College of Engineering, Tallahassee, Fla, USA, 2006.

[6] Y.-Y. Tzou and H.-J. Hsu, “FPGA realization of space-vectorPWM control IC for three-phase PWM inverters,” IEEETransactions on Power Electronics, vol. 12, no. 6, pp. 953–963,1997.

Page 22: FPGA Supercomputing Platfroms, Architecture, And Techniques

12 EURASIP Journal on Embedded Systems

[7] A. de Castro, P. Zumel, O. Garcıa, T. Riesgo, and J. Uceda,“Concurrent and simple digital controller of an AC/DCconverter with power factor correction based on an FPGA,”IEEE Transactions on Power Electronics, vol. 18, no. 1, part 2,pp. 334–343, 2003.

[8] “Developing device drivers for Linux Kernel 1.4.,” Tech. Rep.,CMC Microsystems, Kingston, Canada, 2006.

[9] B. K. Bose, Power Electronics and Variable-Frequency Drives:Technology and Applications, IEEE Press, New York, NY, USA,1996.

[10] J.-G. Mailloux, Prototypage rapide de la commande vectoriellesur FPGA a l’aide des outils Simulink—System Generator, M.S.thesis, Universite du Quebec a Chicoutimi, Quebec, Canada,January 2008.

[11] J.-G. Mailloux, S. Simard, and R. Beguenane, “Rapid testingof XSG-based induction motor vector controller using free-running hardware co-simulation and SimPowerSystems,” inProceedings of the 5th International Conference on Comput-ing, Communications and Control Technologies (CCCT ’07),Orlando, Fla, USA, July 2007.

[12] C. Dufour, S. Abourida, J. Belanger, and V. Lapointe, “Real-time simulation of permanent magnet motor drive on FPGAchip for high-bandwidth controller tests and validation,” inProceedings of the 32nd Annual Conference on IEEE Indus-trial Electronics (IECON ’06), pp. 4581–4586, Paris, France,November 2006.

[13] National Instruments, “Creating Custom Motion Control andDrive Electronics with an FPGA-based COTS System,” 2006.

Page 23: FPGA Supercomputing Platfroms, Architecture, And Techniques

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 727965, 14 pagesdoi:10.1155/2009/727965

Research Article

Parallel Backprojection: A Case Study in High-PerformanceReconfigurable Computing

Ben Cordes and Miriam Leeser

Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115, USA

Correspondence should be addressed to Miriam Leeser, [email protected]

Received 22 June 2008; Accepted 18 December 2008

Recommended by Vinay Sriram

High-performance reconfigurable computing (HPRC) is a novel approach to provide large-scale computing power to modernscientific applications. Using both general-purpose processors and FPGAs allows application designers to exploit fine-grainedand coarse-grained parallelism, achieving high degrees of speedup. One scientific application that benefits from this technique isbackprojection, an image formation algorithm that can be used as part of a synthetic aperture radar (SAR) processing system. Wepresent an implementation of backprojection for SAR on an HPRC system. Using simulated data taken at a variety of ranges, ourimplementation runs over 200 times faster than a similar software program, with an overall application speedup better than 50x.The backprojection application is easily parallelizable, achieving near-linear speedup when run on multiple nodes of a clusteredHPRC system. The results presented can be applied to other systems and other algorithms with similar characteristics.

Copyright © 2009 B. Cordes and M. Leeser. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

In the continuing quest for computing architectures that arecapable of solving more computationally complex problems,a new direction of study is high-performance reconfigurablecomputing (HPRC). HPRC can be defined as a marriage oftraditional high-performance computing (HPC) techniquesand reconfigurable computing (RC) devices.

HPC is a well-known set of architectural solutionsfor speeding up the computation of problems that canbe divided neatly into pieces. Multiple general-purposeprocessors (GPPs) are linked together with high-speednetworks and storage devices such that they can share data.Pieces of the problem are then distributed to the individualprocessors and computed, and the answer is assembledfrom the pieces. Commonly available HPC systems includeBeowulf clusters and other supercomputers. Reconfigurablecomputing uses many of the same concepts as HPC, but at afiner grain. A special-purpose processor (SPP), often a field-programmable gate array (FPGA), is attached to a GPP andprogrammed to execute a useful function. Special-purposehardware computes the answer to the problem quickly byexploiting hardware design techniques such as pipelining, the

replication of small computation units, and high-bandwidthlocal memories.

Both of these computing architectures reduce compu-tation time by exploiting the parallelism inherent in theapplication. They rely on the fact that multiple parts of theoverall problem can be computed relatively independentlyof each other. Though HPC and RC act on different levelsof parallelism, in general, applications with a high degree ofparallelism are well-suited to these architectures.

The idea behind HPRC is to provide a computingarchitecture that takes advantage of both the coarse-grainedparallelism exploited by clustered HPC systems and the fine-grained parallelism exploited by RC systems. In theory, moreexploited parallelism means more speedup and faster com-putation times. In reality, factors such as communicationsbandwidth may prevent performance from improving asmuch as is desired.

In this paper, we examine one application that containsa very high degree of parallelism. The backprojection imageformation algorithm for synthetic aperture radar (SAR)systems is “embarrassingly parallel”, meaning that it canbe broken down and parallelized on many different levels.For this reason, we chose to implement backprojection on

Page 24: FPGA Supercomputing Platfroms, Architecture, And Techniques

2 EURASIP Journal on Embedded Systems

an HPRC machine at the Air Force Research Laboratoryin Rome, NY, USA, as part of an SAR processing system.We present an analysis of the algorithm and its inherentparallelism, and we describe the implementation processalong with the design decisions that went into the solution.

Our contributions are as follows.

(i) We implement the backprojection algorithm forSAR on an FPGA. Though backprojection has beenimplemented many times in the past (see Section 3),FPGA implementations of backprojection for SARare not well represented in the literature.

(ii) We further parallelize this implementation by devel-oping an HPC application that produces large SARimages on a multinode HPRC system.

The rest of this paper is organized as follows. Section 2provides some background information on the backpro-jection algorithm and the HPRC system on which weimplemented it. In Section 3, we discuss related research.Section 4 describes the backprojection implementation andhow it fits into the overall backprojection application. Theperformance data and results of our design experiments areanalyzed in Section 5. Finally, Section 6 draws conclusionsand suggests future directions for research.

Readers who are interested in more detail about this workare directed to the master thesis on which it is based [1].

2. Background

This section provides supporting information that is usefulto understanding the application presented in Section 4.Section 2.1 describes backprojection and SAR, highlightingthe mathematical function that we implemented in hard-ware. Section 2.2 presents details about the HPRC systemthat hosts our application.

2.1. Backprojection Algorithm. We briefly describe the back-projection algorithm in this section. Further details on theradar processing and signal processing aspects of this processcan be found in [2, 3].

Backprojection is an image reconstruction algorithm thatis used in a number of applications, including medical imag-ing (computed axial tomography, or (CAT)) and syntheticaperture radar (SAR). The implementation we describe isused in an SAR application. For both radar processingand medical imaging applications, backprojection providesa method for reconstructing an image from the data that arecollected by the transceiver.

SAR data are essentially a series of time-indexed radarreflections observed by a single transceiver. At each stepalong the synthetic aperture, or flight path, a pulse is emittedfrom the source. This pulse reflects off elements in the sceneto varying degrees, and is received by the transceiver. Theobserved response to a radar pulse is known as a “trace”.

SAR data can be collected in one of two modes,“strip-map” or “spotlight”. These modes describe the motionof the radar relative to the area being imaged. In thespotlight mode of SAR, the radar circles around the scene.

Our application implements the strip-map mode of SAR, inwhich radar travels along a straight and level path.

Regardless of mode, given a known speed at which theradar pulse travels, the information from the series of time-indexed reflections can be used to identify points of highreflectivity in the target area. By processing multiple tracesinstead of just one, a larger radar aperture is synthesized andthus a higher-resolution image can be formed.

The backprojection image formation algorithm has twoparts. First, the radar traces are filtered according to a lineartime-invariant system. This filter accounts for the fact thatthe airplane on which the radar dish is situated does notfly in a perfectly level and perfectly straight path. Second,after filtering, the traces are “smeared” across an image planealong contours that are defined by the SAR mode; in our case,the flight path of the plane carrying the radar. Coherently,summing each of the projected images provides the finalreconstructed version of the scene.

Backprojection is a highly effective method of processingSAR images. It is computationally complex, much like tradi-tional Fourier-based image formation techniques. However,backprojection contains a high degree of parallelism, whichmakes it suitable for the implementation on reconfigurabledevices.

The operation of backprojection takes the form of amapping from projection data p(t,u) to an image f (x, y). Asingle pixel of the image f corresponds to an area of groundcontaining some number of objects which reflect radar to acertain degree. Mathematically, this relationship is written as

f (x, y) =∫

p(

i(x, y,u),u)

du, (1)

where i(x, y,u) is an indexing function indicating, at a givenu, those t that play a role in the value of the image at location(x, y). For the case of SAR imaging, the projection datap(t,u) take the form of the filtered radar traces describedabove. Thus, the variable u corresponds to the slow-timelocation of the radar, and t is the fast-time index into thatprojection. Fast-time variables are related to the speed ofradar propagation (i.e., the speed of light), while slow-timevariables are related to the speed of the airplane carrying theradar. The indexing function i takes the following form forSAR:

i(x, y,u) = χ(y ± x tanφ)·√

x2 + (y − u)2

c, (2)

where c is the speed of light, φ the beamwidth of the radar,and χ(a, b) equal to 1 for a ≤ u ≤ b and 0 otherwise. x andy describe the two-dimensional offset between the radar andthe physical spot on the ground corresponding to a pixel, andcan thus be used in a simple distance calculation as seen inthe right-hand side of (2).

In terms of implementation, we work with a discretizedform of (1) in which the integral is approximated as aRiemann sum over a finite collection of projections uk, k ∈{1 · · ·K} and is evaluated at the centers of image pixels (xi,yj), i ∈ {1 · · ·N}, j ∈ {1 · · ·K}. Because the evaluationof the index function at these discrete points will generally

Page 25: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 3

not result in a value of t which is exactly at a sample location,interpolation could be performed to increase accuracy.

2.2. HPRC Architecture. This project aimed at exploiting thefull range of resources available on the heterogeneous high-performance cluster (HHPC) at the Air Force Research Lab-oratory in Rome, NY, USA [4]. Built by HPTi [5], the HHPCfeatures a Beowulf cluster of 48 heterogeneous computingnodes, where each node consists of a dual 2.2 GHz Xeon PCrunning Linux and an Annapolis Microsystems WildStar IIFPGA board.

The WildStar II features two VirtexII FPGAs and con-nects to the host Xeon general-purpose processors (GPPs) viathe PCI bus. Each FPGA has access to 6 MB of SRAM, dividedinto 6 banks of 1 MB each, and a single 64 MB SDRAM bank.The Annapolis API supports a master-slave paradigm forcontrol and data transfer between the GPPs and the FPGAs.Applications for the FPGA can be designed either throughtraditional HDL-based design and synthesis tools, as we havedone here, or by using Annapolis’s CoreFire [6] module-based design suite.

The nodes of the HHPC are linked together in threeways. The PCs are directly connected via gigabit Ethernetas well as through Myrinet MPI cards. The WildStar IIboards are also directly connected to each other througha low-voltage differential signaling (LVDS) I/O daughtercard, which provides a systolic interface over which eachFPGA board may talk to its nearest neighbor in a ring.Communication over Ethernet is supplied by the standardC library under Linux. Communication over Myrinet isachieved with an installation of the MPI message-passingstandard, though MPI can also be directed to use Ethernetinstead. Communicating through the LVDS interconnectinvolves writing communication modules for the FPGAmanually. In this project, we relied on Myrinet to movedata between nodes. This architecture represents perhaps themost direct method for adding reconfigurable resources toa supercomputing cluster. Each node architecture is similarto that of a single-node reconfigurable computing solution.Networking hardware which interfaces well to the LinuxPCs is included to create the cluster network. The abilityto communicate between FPGAs is included but remainsdifficult for the developer to employ. Other HPRC platforms,such as those developed by Cray and SRC, may employdifferent interconnection methods, programming methods,and communication paradigms.

3. Related Work

Backprojection itself is a well-studied algorithm. Mostresearchers have focused on implementing backprojectionfor computed tomography (CT) medical imaging applica-tions; backprojection for synthetic aperture radar (SAR) onFPGAs is not well-represented in the literature.

The precursor to this work is that of Coric et al. [7]Backprojection for CT uses the “spotlight” mode of imaging,

in which the sensing array is rotated around the targetarea. (Contrast this with the “strip-map” mode described inSection 2.1.) Other implementations of backprojection forCT on FPGAs have been published [8].

CT backprojection has also been implemented on severalother computing devices, including GPUs [9] and the cellbroadband engine [10]. Results are generally similar (withina factor of 2) to those achieved on FPGAs.

Of the implementations of backprojection for SAR,almost none has been designed for FPGAs. Soumekh et al.have published on implementations of SAR in general andbackprojection in particular [11], as well as the Soumekhreference book on the subject [2], but they do not examinethe use of FPGAs for computation. Some recent work onbackprojection for SAR on parallel hardware has come fromHalmstad University in Sweden [12, 13]; their publicationslay important groundwork but have not been implementedexcept in software and/or simulation.

Backprojection is not the only application that has beenmapped to HPRC platforms, though signal processing istraditionally a strength of RC and so large and complex signalprocessing applications like backprojection are common.With the emergence of HPRC, scientific applications arealso seeing significant research effort. Among these aresuch applications as hyperspectral dimensionality reduction[14], molecular dynamics [15, 16], and cellular automatasimulations [17].

Another direction of HPRC research has been the devel-opment of libraries of small kernels that are useful as buildingblocks for larger applications. The Vforce framework [18]allows for portable programming of RC systems using alibrary of kernels. Other developments include libraries offloating-point arithmetic units [19], work on FFTs [20], andlinear algebra kernels such as BLAS [21, 22].

Several survey papers [23, 24] address the trends thatcan be found among the reported results. The transferof data between GPP and FPGA can significantly impactperformance. The ability to determine and control thememory access patterns of the FPGA and the on-boardmemories is critical. Finally, sacrificing the accuracy of theresults in favor of using lighter-weight operations that can bemore easily implemented on an FPGA can be an effective wayof increasing performance.

4. Experimental Design

In this section, we describe an implementation of thebackprojection image formation algorithm on a high-performance reconfigurable computer. Our implementationhas been designed to provide high-speed image forma-tion services and support output data distribution via apublish/subscribe [25] methodology. Section 4.1 describesthe system on which our implementation runs. Section 4.2explores the inherent parallelism in backprojection anddescribes the high-level design decisions that steered theimplementation. Section 4.3 describes the portion of theimplementation that runs in software, and Section 4.4describes the hardware.

Page 26: FPGA Supercomputing Platfroms, Architecture, And Techniques

4 EURASIP Journal on Embedded Systems

Image formationFilter

Processor #1

Processor #2

Processor #24

...

8:24

eth

ern

etsw

itch

Fron

ten

dFP

GA

1:8

tim

eD

eMU

X

AD

C

2 GHzinput data

250 MHz× 8

Figure 1: Block diagram of Swathbuckler system. (Adapted from [26].)

4.1. System Background. In Section 2.2, we described theHHPC system. In this section, we will explore more deeplythe aspects of that system that are relevant to our experimen-tal design.

4.1.1. HHPC Features. Several features of the AnnapolisWildStar II FPGA boards are directly relevant to the design ofour backprojection implementation. In particular, the host-to-FPGA interface, the on-board memory bandwidth, andthe available features of the FPGA itself guided our designdecisions.

Communication between the host GPP and the WildStarII board is over a PCI bus. The HHPC provides a PCI busthat runs at 66 MHz with 64-bit datawords. The WildStar IIon-board PCI interface translates this into a 32-bit interfacerunning at 133 MHz. By implementing the DMA datatransfer mode to communicate between the GPP and theFPGA, the on-board PCI interface performs this translationinvisibly and without significant loss of performance. A133 MHz clock is also a good and achievable clock rate forFPGA hardware, so most of the hardware design can be rundirectly off the PCI interface clock. This simplifies the designsince there are fewer clock domains (see Section 4.4.1).

The WildStar II board has six on-board SRAM memories(1 MB each) and one SDRAM memory (64 MB). It isbeneficial to be able to read one datum and write one datumin the same clock cycle, so we prefer to use multiple SRAMsinstead of the single larger SDRAM. The SRAMs run at50 MHz and feature a 32-bit dataword (plus four parity bits),but they use a DDR interface. The Annapolis controller forthe SRAM translates this into a 50 MHz 72-bit interface. Bothfeatures are separately important: we will need to cross fromthe 50 MHz memory clock domain to the 133 MHz PCI clockdomain, and we will need to choose the size of our datasuch that they can be packed into a 72-bit memory word (seeSection 4.2.4).

Finally, the Virtex2 6000 FPGA on the Wildstar II hassome useful features that we use to our advantage. A largeamount of on-chip memory is available in the form ofBlockRAMs, which are configurable in width and depth butcan hold at most 2 KB of data each. One hundred forty fourof these dual-ported memories are available, each of whichcan be accessed independently. This makes BlockRAMs agood candidate for storing and accessing input projectiondata (see Sections 4.2.4 and 4.4.3.) BlockRAMs can also be

configured as FIFOs, and due to their dual-ported nature,can be used to cross clock domains.

4.1.2. Swathbuckler Project. This project was designed tofit in as part of the Swathbuckler project [26–28], animplementation of synthetic aperture radar created by ajoint program between the American, British, Canadian, andAustralian defense research project agencies. It encompassesthe entire SAR process including the aircraft and radar dish,signal capture and analog-to-digital conversion, filtering,and image formation hardware and software.

Our problem as posed was to increase the processingcapabilities of the HHPC by increasing the performance ofthe portions of the application seen on the right-hand side ofFigure 1. Given that a significant amount of work had goneinto tuning the performance of the software implementationof the filtering process [26], it remained for us to improvethe speed at which images could be formed. According tothe project specification, the input data are streamed into themicroprocessor main memory. In order to perform imageformation on the FPGA, it is then necessary to copy datafrom the host to the FPGA. Likewise, the output image mustbe copied from the FPGA memory to the host memoryso that it can be made accessible to the publish/subscribesoftware. These data transfer times are included in ourperformance measurements (see Section 5).

4.2. Algorithm Analysis. In this section, we dissect thebackprojection algorithm with an eye toward implementingit on an HPRC machine. There are many factors thatneed to be taken into account when designing an HPRCapplication. First and foremost, an application that doesnot have a high degree of parallelism is generally nota good candidate. Given a suitable application, we thendecide how to divide the problem along the available levelsof parallelism in order to determine what part of theapplication will be executed on each available processor.This includes GPP/FPGA assignment as well as dividingthe problem across the multiple nodes of the cluster. Forthe portions of the application run on the FPGAs, dataarrays must be distributed among the accessible memories.Next, we look at some factors to improve the performanceof the hardware implementation, namely, data formats andcomputation strength reduction. We conclude by examining

Page 27: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 5

the parameters of the data collection process that affect thecomputation.

4.2.1. Parallelism Analysis. In any reconfigurable applica-tion design, performance gains due to implementation inhardware inevitably come from the ability of reconfigurablehardware (and, indeed, hardware in general) to performmultiple operations at once. Extracting the parallelismin an application is thus critical to a high-performanceimplementation.

Equation (1) shows the backprojection operation interms of projection data p(t,u) and an output image f (x, y).That equation may be interpreted to say that for a particularpixel f (x, y), the final value can be found from a summationof contributions from the set of all projections p(t,u) whosecorresponding radar pulse covered that ground location.The value of t for a given u is determined by the mappingfunction i(x, y,u) according to (2). There is a large degree ofparallelism inherent in this interpretation.

(1) The contribution from projection p(u) to pixelf (x, y) is not dependent on the contributions fromall other projections p(u), u /= u to that same pixelf (x, y).

(2) The contribution from projection p(u) to pixelf (x, y) is not dependent on the contribution fromp(u) to all other pixels f (x, y), x /= x, y /= y.

(3) The final value of a pixel is not dependent on thevalue of any other pixel in the target image.

It can be said, therefore, that backprojection is an“embarrassingly parallel” application, which is to say thatit lacks any data dependencies. Without data dependencies,the opportunity for parallelism is vast and it is simply amatter of choosing the dimensions along which to dividethe computation that best matches the system on which thealgorithm will be implemented.

4.2.2. Dividing the Problem. There are two ways in whichparallel applications are generally divided across the nodesof a cluster.

(1) Split the data. In this case, each node performs thesame computation as every other node, but on asubset of data. There may be several different waysthat the data can be divided.

(2) Split the computation. In this case, each node per-forms a portion of the computation on the entiredataset. Intermediate sets of data flow from one nodeto the next. This method is also known as task-parallel or systolic computing.

While certain supercomputer networks may make thetask-parallel model attractive, our work with the HHPC indi-cates that its architecture is more suited to the data-parallelmode. Since internode communication is accomplished overa many-to-many network (Ethernet or Myrinet), passing

Flig

ht

path

ofra

dar

u

‘Azi

mu

th’/y

dim

ensi

on

u +w

u

‘Range’/x dimension

Beamwidth 2φ

Subimage

xc − x0 xi xc xj xc + x0

Projection data p(t,u)

t = 0 ti t j

Target imagef (x, y)

Figure 2: Division of target image across multiple nodes.

data from one node to the next as implied by the task-parallel model will potentially hurt performance. A task-parallel design also implies that a new FPGA design must becreated for each FPGA node in the system, greatly increasingdesign and verification time. Finally, the number of tasksavailable in this application is relatively small and would notoccupy the number of nodes that are available to us.

Given that we will create a data-parallel design, thereare several axes along which we considered splitting thedata. One method involves dividing the input projectiondata p(t,u) among the nodes along the u dimension. Eachnode would hold a portion of the projections p(t, [ui,uj])and calculate that portion contribution to the final image.However, this implies that each node must hold a copy ofthe entire target image in memory, and furthermore, that allof the partial target images would need to be added togetherafter processing before the final image could be created. Thisextra processing step would also require a large amount ofdata to pass between nodes. In addition, the size of the finalimage would be limited to that which would fit on a singleFPGA board.

Rather than dividing the input data, the preferredmethod divides the output image f (x, y) into pieces alongthe range (x) axis (see Figure 2). In theory, this requires thatevery projection be sent to each node; however, since onlya portion of each projection will affect the slice of the finalimage being computed on a single node, only that portionmust be sent to that node. Thus, the amount of input databeing sent to each node can be reduced to p([ti, t j],u). Werefer to the portion of the final target image being computedon a single node, f ([xi, xj], y), as a “subimage”.

Figure 2 shows that t j is slightly beyond the time indexthat corresponds to xj . This is due to the width of the cone-shaped radar beam. The dotted line in the figure shows asingle radar pulse taken at slow-time index y = u. Theminimum distance to any part of the subimage is at thepoint (xi,u), which corresponds to fast-time index ti in theprojection data. The maximum distance to any part of thesubimage, however, is along the outer edge of the cone tothe point (xj ,u ± w), where w is a factor calculated fromthe beamwidth angle of the radar and xj . Thus, the fast-timeindex t j is calculated relative to xj and w instead of simplyxj . This also implies that the [ti, t j] range for two adjacent

Page 28: FPGA Supercomputing Platfroms, Architecture, And Techniques

6 EURASIP Journal on Embedded Systems

nodes will overlap somewhat, or (equivalently) that someprojection data will be sent to more than one node.

Since the final value of a pixel does not depend on thevalues of the pixels surrounding it, each FPGA needs holdonly the subimage that it is responsible for computing. Thatportion is not affected by the results on any other FPGA,which means that the postprocessing accumulation stage canbe avoided. If a larger target image is desired, subimages canbe “stitched” together simply by concatenation.

In contrast to the method where input data are dividedalong the u dimension, the size of the final target image isnot restricted by the amount of memory on a single node,and furthermore, larger images can be processed by addingnodes to the cluster. This is commonly referred to as coarse-grained parallelism, since the problem has been divided intolarge-scale independent units. Coarse-grained parallelism isdirectly related to the performance gains that are achieved byadapting the application from a single-node computer to amultinode cluster.

4.2.3. Memory Allocation. The memory devices used tostore the input and output data on the FPGA board maynow be determined. We need to store two large arraysof information: the target image f (x, y) and the inputprojection data p(t,u). On the Wildstar II board, there arethree options: an on-board DRAM, six on-board SRAMs,and a variable number of BlockRAMs which reside insidethe FPGA and can be instantiated as needed. The on-boardDRAM has the highest capacity (64 MB) but is the mostdifficult to use and only has one read/write port. BlockRAMsare the most flexible (two read/write ports and a flexiblegeometry) and simple to use, but have a small (2 KB)capacity.

For the target image, we would like to be able to both readand write one target pixel per cycle. It is also important thatthe size of the target image stored on one node be as large aspossible, so memories with larger capacity are better. Thus,we will use multiple on-board SRAMs to store the targetimage. By implementing a two-memory storage system, wecan provide two logical ports into the target image array.During any given processing step, one SRAM acts as thesource for target pixels, and the other acts as the destinationfor the newly computed pixel values. When the next set ofprojections is sent to the FPGA, the roles of the two SRAMsare reversed.

Owing to the 1 MB size of the SRAMs in which we storethe target image data, we are able to save 219 pixels. We chooseto arrange this into a target image that is 1024 pixels in theazimuth dimension and 512 in the range dimension. Usingpower-of-two dimensions allows us to maximize our use ofthe SRAM, and keeping the range dimension small allowsus to reduce the amount of projection data that must betransferred.

For the projection data, we would like to have many smallmemories that can each feed one of the projection adderunits. BlockRAMs allow us to instantiate multiple smallmemories in which to hold the projection data; each memoryhas two available ports, meaning that two adders can be

supported in parallel. Each adder reads from one SRAM andwrites to another; since we can support two adders, we couldpotentially use four SRAMs.

4.2.4. Data Formats. Backprojection is generally accom-plished in software using a complex (i.e., real and imaginaryparts) floating-point format. However, since the result of thisapplication is an image which requires only values from 0to 255 (i.e., 8-bit integers), the loss of precision inherentin transforming the data to a fixed-point/integer format isnegligible. In addition, using an integer data format allowsfor much simpler functional units.

Given an integer data format, it remains to determinehow wide the various datawords should be. We base ourdecision on the word width of the memories. The SRAMinterface provides 72 bits of data per cycle, comprised oftwo physical 32-bit datawords plus four bits of parity each.The BlockRAMs are configurable, but generally can providepower-of-two sized datawords.

Since backprojection is in essence an accumulation oper-ation, it makes sense for the output data (target image pixels)to be wider than the input data (projection samples). Thisreduces the likelihood of overflow error in the accumulation.We, therefore, use 36-bit complex integers (18-bit real and18-bit imaginary) for the target image, and 32-bit complexintegers for the projection data.

After backprojection, a complex magnitude operator isneeded to reduce the 36-bit complex integers to a single 18-bit real integer. This operator is implemented in hardware,but the process of scaling data from 18-bit integer to 8-bitimage is left to the software running on the GPP.

4.2.5. Computation Analysis. The computation to be per-formed on each node consists of three parts. The summationfrom (1) and the distance calculation from (2) represent thebackprojection work to be done. The complex magnitudeoperation is similar to the distance calculation.

While adders are simple to replicate in large numbers,the hardware required to perform multiplication and squareroot is more costly. If we were using floating-point dataformats, the number of functional units that could beinstantiated would be very small, reducing the parallelismthat we can exploit. With integer data types, however, theseunits are relatively small, fast, and easily pipelined. Thisallows us to maintain a high clock rate and one-result-per-cycle throughput.

4.2.6. Data Collection Parameters. The conditions underwhich the projection data are collected affect certain aspectsof the backprojection computation. In particular, the spacingbetween samples in the p(t,u) array and the spacing betweenpixels in the f (x, y) array imply constant factors that must beaccounted for during the distance-to-time index calculation(see Section 4.4.3).

For the input data, Δu indicates the distance (in meters)between samples in the azimuth dimension. This is equiv-alent to the distance that the plane travels between eachoutgoing pulse of the radar. Often, due to imperfect flight

Page 29: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 7

paths, this value is not regular. The data filtering that occursprior to backprojection image formation is responsible forcorrecting for inaccuracies due to the actual flight path, sothat a regular spacing can be assumed.

As the reflected radar data are observed by the radarreceiver, they are sampled at a particular frequency ω. Thatfrequency translates to a range distance Δt between samplesequal to c/2ω, where c is the speed of light. The additionalfactor of 1/2 accounts for the fact that the radar pulse travelsthe intervening distance, is reflected, and travels the samedistance back. Owing to the fact that the airplane is not flyingat ground level, there is an additional angle of elevation thatis included to determine a more accurate value for Δt.

For the target image (output data), Δx and Δy simplycorrespond to the real distance between pixels in the rangeand azimuth dimensions, accordingly. In general, Δx andΔy are not necessarily related to Δu, and Δt and can bechosen at will. In practice, setting Δy = Δu makes thealgorithm computation more regular (and thus more easilyparallelizable). Likewise, setting Δx = Δt reduces the needfor interpolation between samples in the t dimension sincemost samples will line up with pixels in the range dimension.Finally, setting Δx = Δy provides for square pixels and aneasier-to-read aspect ratio in the output image.

The final important parameter is the minimum rangefrom the radar to the target image, known as Rmin. This isrelated to the ti parameter, and is used by the software todetermine what portion of the projection data is applicableto a particular node.

4.3. Software Design. We now describe the HPRC imple-mentation of backprojection. As with most FPGA-basedapplications, the work that makes up the application isdivided between the host GPP and the FPGA. In this section,we will discuss the work done on the GPP; in Section 4.4, wecontinue with the hardware implemented on the FPGA.

The main executable running on the GPP begins byusing the MPI library to spawn processes on several of theHHPC nodes. Once all MPI jobs have started, the host codeconfigures the FPGA with the current values of the flightparameters from Section 4.2.6. In particular, the values ofΔx, Δy, and Rmin (the minimum range) are sent to theFPGA. However, in order to avoid the use of fractionalnumbers, all of these parameters are normalized such thatΔt = 1. This allows the hardware computation to be interms of fast-time indices in the t domain instead of grounddistances.

Next, the radar data is read. In the Swathbuckler system,this input data would be streamed directly into memory andno separate “read” step would be required. Since we are notable to integrate directly with Swathbuckler, our host codereads the data from a file on the shared disk. These data aretranslated from complex floating-point format to integers.The host code also determines the appropriate range of t thatis relevant to the subimage being calculated by this node (seeSection 4.2.2).

The host code then loops over the u domain of theprojection data. A chunk of the data is sent to the FPGA

On-boardmemories

Targ

etm

emor

ies

(4×

1M

BSR

AM

)

On

-boa

rdP

CI

inte

rfac

e

Control +status regs

DMA recv.controller

DMA Xmit.controller

Addressgenerators

Projection adders

Complexmagnitude

Virtex×6000 FPGA

DatapathPCI & controller

Figure 3: Block diagram of backprojection hardware unit.

and processed. The host code waits until the FPGA signalsprocessing is complete, and then transmits the next chunkof data. When all projection data have been processed, thehost code requests that the final target image be sent from theFPGA. The pixels of the target image are scaled, rearrangedinto an image buffer, and an image file is optionally producedusing a library call to the GTK+ library [29].

After processing, the target subimages are simply held inthe GPP memory. In the Swathbuckler system, subimagesare distributed to consumers via a publish/subscribe mech-anism, so there is no need to assemble all the subimages intoa larger image.

4.3.1. Configuration Parameters. Our backprojection imple-mentation can be configured using several compile-timeparameters in both the host code and the VHDL code thatdescribes the hardware. In software, the values of Δx and Δyare set in the header file and compiled in. The value of Rmin

is specific to a dataset, so it is read from the file that containsthe projection data.

It is also possible to set the dimensions of the subimage(1024× 512 by default), though the hardware would requiresignificant changes to support this.

The hardware VHDL code allows two parameters to beset at compile time (see Section 4.4.3). N is the number ofprojection adders in the design, and R is the size of theprojection memories (R × 1024 words). Once compiled, thevalue of these parameters can be read from the FPGA by thehost code.

4.4. FPGA Design. The hardware that is instantiated onthe FPGA boards runs the backprojection algorithm andcomputes the values of the pixels in the output image. Ablock diagram of the design is shown in Figure 3. Referencesto blocks in this figure are printed in monospace.

4.4.1. Clock Domains. In general, using multiple clockdomains in a design adds complexity and makes verificationsignificantly more difficult. However, the design of theAnnapolis Wildstar II board provides for one fixed-rate clockon the PCI interface, and a separate fixed-rate clock on the

Page 30: FPGA Supercomputing Platfroms, Architecture, And Techniques

8 EURASIP Journal on Embedded Systems

SRAM memories. This is a common attribute of FPGA-basedsystems.

To simplify the problem, we run the bulk of our designat the PCI clock rate (133 MHz). Since Annapolis VHDLmodules refer to the PCI interface as the “LAD bus”, wecall this the L-clock domain. Every block in Figure 3, withthe exception of the SRAMs themselves and their associatedAddress Generators, is run from the L-clock.

The SRAMs are run from the memory clock, or M-clock, which is constrained to run at 50 MHz. Between theTarget Memories and the Projection Adders, there issome interface logic and an FIFO. This is not shown inFigure 3, but exists to cross the M-clock/L-clock domainboundary.

BlockRAM-based FIFOs, available as modules in theXilinx CORE Generator [30] library, are used to crossclock domains. Since each of the ports on the dual-portedBlockRAMs is individually clocked, the read and write canhappen in different clock domains. Control signals areautomatically synchronized to the appropriate clock, that is,the “full” signal is synchronized to the write clock and the“empty” signal to the read clock. Using FIFOs whenever clockdomains must be crossed provides a simple and effectivesolution.

4.4.2. Control Registers and DMA Input. The Annapolis API,like many FPGA control APIs, allows for communicationbetween the host PC and the FPGA with two methods:“programmed” or memory-mapped I/O (PIO), which is bestfor reading and writing one or two words of data at a time;direct memory access (DMA), which is best for transferringlarge blocks of data.

The host software uses PIO to set control registerson the FPGA. Projection data is placed in a speciallyallocated memory buffer, and then transmitted to the FPGAvia DMA. On the FPGA, the DMA Receive Controllerreceives the data and arranges it in the BlockRAMs inside theProjection Adders.

4.4.3. Datapath. The main datapath of the backprojec-tion hardware is shown in Figure 4. It consists of fiveparts: the Target Memory SRAMs that hold the targetimage, the Distance-To-Time Index Calculator (DIC),the Projection Data BlockRAMs, the Adders to performthe accumulation operation, and the Address Generatorsthat drive all of the memories. These devices all operate ina synchronized fashion, though there are FIFOs in severalplaces to temporally decouple the producers and consumersof data, as indicated in Figure 4 with segmented rectangles.

Address Generators. There are three data arrays that must bemanaged in this design: the input target data, the out-put target data, and the projection data. The pixelindices for the two target data arrays (Target MemoryA and Target Memory B in Figure 4) are managed directlyby separate address generators. The address generator for theProjection Data BlockRAMs also produces pixel indices;

Target memory A(1 MB SRAM)

Addressgenerator

Addressgenerator

Projection adder 1

Projection adder 2

... ......

Projection adder N

Target memory B(1 MB SRAM)

Addressgenerator

FIFO

+Distance-to-timecalculator

Projection data(2 KB BlockRAM)

Figure 4: Block diagram of hardware datapath.

the DIC converts the pixel index into a fast-time index that isused to address the BlockRAMs.

Because a single read/write operation to the SRAMsproduces/consumes two pixel values, the address generatorsfor the SRAMs run for half as many cycles as the addressgenerator for the BlockRAMs. However, address generatorsrun in the clock domain relevant to the memory that theyare addressing, so n/2 SRAM addresses take slightly longer togenerate at 50 MHz than n BlockRAM addresses at 133 MHz.

Because of the use of FIFOs between the memories andthe adders, the address generators for Target Memory Aand the Projection Data BlockRAMs can run freely. FIFOcontrol signals ensure that an address generator is paused intime to prevent it from overflowing the FIFO. The addressgenerator for Target Memory B is incremented wheneverdata are available from the output FIFO.

Distance-To-Time Index Calculator. The Distance-To-Time Index Calculator (DIC) implements (2), which iscomprised of two parts. At first glance, each of these partsinvolves computation that requires large amount of hardwareand/or time to calculate. However, a few simplifying assum-ptions make this problem easier and reduce the amount ofneeded hardware.

Rather than implementing a tangent function in hard-ware, we rely on the fact that the beamwidth φ of the radaris a constant. The host code performs the tanφ function andsends the result to the FPGA, which is then used to calculateχ(a, b). This value is used both on a coarse-grained levelto narrow the range of pixels which are examined for eachprocessing step, and on a fine-grained level to determinewhether or not a particular pixel is affected by the currentprojection (see Figure 2).

Page 31: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 9

The right-hand side of (2) is a distance function

(√

x2 + y2) and a division. The square root function isexecuted using an iterative shift-and-subtract algorithm. Inhardware, this algorithm is implemented with a pipelineof subtractors. Two multiplication units handle the x2 andy2 functions. Some additional adders and subtractors arenecessary to properly align the input data to the outputdata according to the data collection parameters discussedin Section 4.2.6. We used pipelined multipliers and divisionunits from the Xilinx CORE Generator library; adders andsubtractors are described with VHDL arithmetic operators,allowing the synthesis tools to generate the appropriatehardware.

The distance function and computation of χ(·) occurin parallel. If the χ(·) function determines that the pixel isoutside the affected range, the adder input is forced to zero.

Projection Data BlockRAMs. The output of the DIC is a fast-time index into the p(t,u) array. Each Projection DataBlockRAM holds the data for a particular value of u. Thefast-time index t is applied to retrieve a single value of p(t,u)that corresponds to the pixel that was input by the addressgenerator. This value is stored in an FIFO, to be synchronizedwith the output of the Target Memory A FIFO.

Projection Data memories are configured to hold 2 kdatawords by default, which should be sufficient for a 1 krange pixel image. This number is a compile-time parameterin the VHDL source and can be changed. The resourceconstraint is the number of available BlockRAMs.

Projection Adder. As the FIFOs from the ProjectionData memories and the Target Memory are filled, theProjection Adder reads datawords from both FIFOs, addsthem together, and passes them to the next stage in thepipeline (see Figure 4).

The design is configured with eight adder stages, meaningeight projections can be processed in one step. This numberis a compile-time parameter in the VHDL source andcan be changed. The resource constraint is a combina-tion of the number of available BlockRAMs (because theProjection Data BlockRAMs and FIFO are duplicated)and the amount of available logic (to implement the DIC).

The number of adder stages implemented directlyimpacts the performance of our application. By computingthe contribution of multiple projections in parallel, weexploit the fine-grained parallelism inherent in the backpro-jection algorithm. Fine-grained parallelism is directly relatedto the performance gains achieved by implementing theapplication in hardware, where many small execution unitscan be implemented that all run at the same time on differentpieces of data.

4.4.4. Complex Magnitude and DMA Output. When allprojections have been processed, the final target image datareside in one of the Target Memory SRAMs. The host codethen requests that the image data be transferred via DMA tothe host memory. This process occurs in three steps.

First, an Address Generator reads the data out ofthe SRAM in the correct order. Second, the data areconverted from complex to real. The Complex Magnitudeoperator performs this function with a distance calculation(√

re2 + im2). We instantiate another series of multipliers,adders, and subtractors (for the integer square root) toperform this operation. Third, the real-valued pixels arepassed to the DMA Transmit Controller, where they aresent from the FPGA to the host memory.

5. Experimental Results

After completing the design, we conducted a series ofexperiments to determine the performance and accuracyof the hardware. When run on a single node, a detailedprofile of the execution time of both the software andhardware programs can be determined, and the effects ofreconfigurable hardware design techniques can be studied.Running the same program on multiple nodes shows howwell the application scales take advantage of the processingpower available on HPC clusters. In this section, we describethe experiments and analyze the collected results.

5.1. Experimental Setup. Our experiments consist of runningprograms on the HHPC and measuring the run time ofindividual components as well as the overall execution time.There are two programs: one which forms images by runningbackprojection on the GPP (the “software” program), andone which runs it in hardware on an FPGA (the “hardware”program).

We are concerned with two factors: speed of executionand accuracy of solution. We will consider not only theexecution time of the backprojection operation by itself,but also the execution time of the entire program includingperipheral operations such as memory management and datascaling. In addition, we examine the full application runtime.

In terms of accuracy, the software program computesits results in floating point while the hardware uses integerarithmetic. We will examine the differences between theimages produced by these two programs in order to establishthe error introduced by the data format conversion.

The ability of our programs to scale across multiplenodes is an additional performance metric. We measurethe effectiveness of exploiting coarse-grained parallelism bycomparing the relative performance of both the softwareprogram and the hardware implementation when run on onenode and when run on many nodes.

5.1.1. Software Design. All experiments were conducted onthe HHPC system as described in Section 2.2. Nodes on theHHPC run the Linux operating system, RedHat release 7.3,using kernel version 2.4.20.

Both the software program and the calling framework forthe hardware implementation are written in C and producean executable that is started from the Linux shell. Thesoftware program executes entirely on the GPP: memorybuffers are established, projection data are read from disk,

Page 32: FPGA Supercomputing Platfroms, Architecture, And Techniques

10 EURASIP Journal on Embedded Systems

the backprojection algorithm is run, and output data aretransformed from complex to real. The final step involvesrearranging the output data and scaling it, then writing aPNG image to disk.

The hardware program begins by establishing memorybuffers and initializing the FPGA. Projection data are readfrom disk into the GPP memory. Those data are thentransferred to the FPGA, where the backprojection algorithmis run in hardware. Output data are transformed fromcomplex to real on the FPGA, then transferred to the GPPmemory. The GPP then executes the same rearrangementand scaling step as the software program.

To control the FPGA, the hardware program uses an APIthat is provided by Annapolis to interface to the WildStarII boards. The FPGA hardware was written in VHDL andsynthesized using version 8.9 of Synplify Pro. Hardware placeand route used the Xilinx ISE 9.1i suite of CAD tools.

The final step of both programs, where the target data arewritten as a PNG image to disk by the GPP, uses the GTK+[29] library version 2.0.2. For the multinode experiments,version 1.2.1.7b of the MPICH [31] implementation of MPIis used to handle internode startup, communication, andsynchronization.

The timing data presented in this section were collectedusing timing routines that were inserted into the C code.These routines use Linux system calls to display timinginformation. The performance of the timing routines wasdetermined by running them several times in successionwith no code in between. The overhead of the performanceroutines was shown to be less than 100 microseconds , sotiming data are presented as accurate to the millisecond.Unless noted otherwise, applications were run five timesand an average (arithmetic mean) was taken to arrive at thepresented data.

We determine accuracy both qualitatively (i.e., by exam-ining the images with the human eye) and quantitatively bycomputing the difference in pixel values between the twoimages.

5.1.2. Test Data. Four sets of data were used to test ourprograms. The datasets were produced using a MATLABsimulation of SAR taken from the Soumekh book [2]. ThisMATLAB script allows the parameters of the data collectionprocess to be configured (see Section 4.2.6). When run, itgenerates the projection data that would be captured by anSAR system imaging that area. A separate C program takesthe MATLAB output and converts it to an optimized file thatcan be read by the backprojection programs.

Each dataset contains four point source (i.e., 1 × 1 pixelin size) targets that are distributed randomly through theimaged area. The imaged area for each set is of a similar size,but situated at a different distance from the radar. Targets arealso assigned a random reflectivity value that indicates howstrongly they reflect radar signals.

5.2. Results and Analysis. In general, owing to the high degreeof parallelism inherent in the backprojection algorithm, we

Table 1: Single-node experimental performance.

Component Software Hardware Ratio

Backprojection 76.4 s 351 ms 217:1

Complex magnitude 73 ms 15 ms 4.9:1

Form image (software) 39 ms 340 ms 1:8.7

Total 76.5 s 706 ms 108:1

Table 2: Single-node backprojection performance by dataset.

Dataset Software Hardware BP speedup App speedup

1 24.5 s 146 ms 167.4 49.8

2 30.4 s 169 ms 179.5 61.9

3 47.7 s 268 ms 177.5 75.5

4 76.5 s 351 ms 217.6 108.4

expect a considerable performance benefit from implemen-tation in hardware even on a single node. For the multinodeprogram, the lack of need to transfer data between the nodesimplies that the performance should scale in a linear relationto the number of nodes.

5.2.1. Single Node Performance. The first experiment involvesrunning backprojection on a single node. This allows us toexamine the performance improvement due to fine-grainedparallelism, that is, the speedup that can be gained by imple-menting the algorithm in hardware. For this experiment, weran the hardware and software programs on all four of thedatasets. Table 1 shows the timing breakdown of dataset no.4; Table 2 shows the overall results for all four datasets. Notethat dataset no. 1 is closest to the radar, and dataset no. 4 isfurthest away.

In Table 1, Software and Hardware refer to the run timeof a particular component; Ratio is the ratio of softwaretime to hardware time, showing the speedup or slowdownof the hardware program. Backprojection is the running ofthe core algorithm. Complex Magnitude transforms the datafrom complex to real integers. Finally, Form Image scales thedata to the range [0:255] and creates the memory buffer thatis used to create the PNG image.

There are a number of significant observations that canbe made from the data in Table 1. Most importantly, theprocess of running the backprojection algorithm is greatlyaccelerated in hardware, running over 200x faster than oursoftware implementation. It is important to emphasize thatthis includes the time required to transfer projection datafrom the host to the FPGA, which is not required by thesoftware program. Many of the applications discussed inSection 3 exhibit only modest performance gains due to theconsiderable amount of time spent transferring data. Here,the vast degree of fine-grained parallelism in the backpro-jection algorithm that can be exploited by FPGA hardwareallows us to achieve excellent performance compared to aserial software implementation.

The complex magnitude operator also runs about 5xfaster in hardware. In this case, the transfer of the output

Page 33: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 11

data from the FPGA to the host is overlapped with thecomputation of the complex magnitude. This commonlyused technique allows the data transfer time to be “hidden”,preventing it from affecting overall performance.

However, the process of converting the backprojectionoutput into an image buffer that can be converted to a PNGimage (Form Image) runs faster when executed as part of thesoftware program. This step is performed in software regard-less of where the backprojection algorithm was executed.The difference in run time can be attributed to memorycaching. When backprojection occurs in software, the resultdata lie in the processor cache. When backprojection occursin hardware, the result data are copied via DMA into theprocessor main memory, and must be loaded into the cachebefore the Form Image step can begin.

We do not report the time required to initialize eitherthe hardware or software program, since in the Swathbucklersystem it is expected that initialization can be completedbefore the input data become available.

Table 2 shows the single-node performance of bothprograms on all four datasets. Note that the reported runtimes are only the times required by the backprojectionoperation. Thus, column four, BP Speedup, shows thefactor of speedup (software:hardware ratio) for only thebackprojection operation. Column five, App Speedup, showsthe factor of speedup for the complete application includingall of the steps shown in Table 1.

These results show that the computation time of thebackprojection algorithm is data dependent. This is directlyrelated to the minimum range of the projection data.According to Figure 2, as the subimage gets further awayfrom the radar, the width of the radar beam is larger. Thisis reflected in the increased limits of the χ(a, b) term of (2),which are a function of the tangent of the beamwidth φ andthe range. A larger range implies more pixels are impacted byeach projection, resulting in an increase in processing time.The hardware and software programs scale at approximatelythe same rate, which is expected since they are processing thesame amount of additional data at longer ranges.

More notable is the increase in application speedup;this can be explained by considering that the remainder ofthe application is not data dependent and stays relativelyconstant as the minimum range varies. Therefore, as therange increases and the amount of data to process increases,the backprojection operation takes up a larger percentageof the run time of the entire application. For software, thisincrease in proportion is negligible (99.5% to 99.8%), butfor the hardware, it is quite large (12.6% to 25.0%). As thebackprojection operator takes up more of the overall runtime, the relative gains from implementing it in hardwarebecome larger, resulting in the increasing speedup numbersseen in the table.

5.2.2. Single Node Accuracy. Qualitatively, the hardwareimages look very similar, with the hardware images perhapsslightly darker near the point source target. This is due tothe quantization imposed by using integer data types. Asdiscussed in Section 5.2.1, a longer range implies a wider

Table 3: Image accuracy by dataset.

Dataset Errors Max error Mean error

1 4916 0.9% 18 7.0% 1.55

2 13036 2.5% 19 7.4% 1.73

3 18706 3.6% 12 4.6% 1.56

4 29093 5.6% 16 6.2% 1.64

radar beam. The χ(a, b) function from (2) determines howwide the beam is at a given range. When computed in fixedpoint for the hardware program, y ± x tanφ returns slightlydifferent values than when the software program computes itin floating point. Thus, there is a slight smearing or blurringof the point source. Recall that dataset no. 4 has a longerrange than the other datasets; appropriately, the smearing ismost notable in that dataset.

Quantitatively, the two images can be compared pixel-by-pixel to determine the differences. For each dataset, Table 3presents error in terms of the differences between the imageproduced by the software program and the image producedby the hardware program.

The second column shows the number of pixels that aredifferent between the two images. There are 1024×512 pixelsin the image, so the third column shows the percent of overallimage pixels that are different. The maximum and arithmeticmean error are shown in the last two columns. Recall that ouroutput images are 256 gray scale PNG files; the magnitude oferror is given by err(x, y) = |hw(x, y)− sw(x, y)|.

Again, errors can be attributed to the difference in thecomputed width of the radar beam between the softwareand hardware programs. For comparison, a version of eachprogram was written that does not include the χ(a, b) func-tion and instead assumes that every projection contributesto every pixel (i.e., an infinite beamwidth). In this case, theimages are almost identical; the number of errors drops to0.1%, and the maximum error is 1. Thus, the error is notdue to quantization of processed data; the computation ofthe radar beamwidth is responsible.

5.2.3. Multinode Performance. The second experiment in-volves running backprojection on multiple nodes simul-taneously, using the MPI library to coordinate. These resultsshow how well the application scales due to coarse-grainedparallelism, that is, the speedup that can be gained bydividing a large problems into smaller pieces and runningeach piece separately. For this experiment, we create anoutput target image that is 64 times the size of the imagecreated by a single node. Thus, when run on one node,64 iterations are required; for two nodes, 32 iterations arerequired, and so on. Table 4 shows the results for a singledataset.

For both the software and hardware programs, fivetrials were run. For each trial, the time required to runbackprojection and form the resulting image on each nodewas measured, and the maximum time reported. Thus, theoverall run time is equal to the run time of the slowest node.The arithmetic mean of the times (in seconds) from the five

Page 34: FPGA Supercomputing Platfroms, Architecture, And Techniques

12 EURASIP Journal on Embedded Systems

Table 4: Multinode experimental performance.

NodesSoftware Hardware

Mean Standard deviation Speedup Mean Standard deviation Speedup

1 1943.2 6.15 1.0 25.0 .01 1.0

2 983.2 10.46 2.0 13.4 .02 1.9

4 496.0 4.60 3.9 7.8 .02 3.9

8 256.5 5.85 7.6 4.0 .06 6.0

16 128.4 1.28 15.1 — — —

trials are presented, with standard deviation. The mean runtime is compared to the mean run time for one node in orderto show the speedup factor.

Results are not presented for a 16-node trial of thehardware program. During our testing, it was not possibleto find 16 nodes of the HHPC that were all capable ofrunning the hardware program at once. This was due tohardware errors on some nodes, and inconsistent systemsoftware installations on others.

The mostly linear distribution of the data in Table 4shows that for the backprojection application, we haveachieved a nearly ideal parallelization. This can be attributedto the lack of data passing between nodes, combined withan insignificant amount of overhead involved in runningthe application in parallel with MPI. The hardware programshows a similar curve, except for N = 8 nodes, wherethe speedup drops off slightly. At run times under fiveseconds, the MPI overhead involved in synchronizing thenodes between each processing iteration becomes significant,resulting in a slight slowdown (6x speedup compared to theideal 8x).

The speedup provided by the hardware program isfurther described in Table 5. Compared to one node runningthe hardware program, we have already seen the nearly linearspeedup. Compared to an equal number of nodes runningthe software program, the hardware consistently performsaround 75x faster. Again, for N = 8, there is a slight dropoff in speedup owing to the MPI overhead for short runtimes. Finally, we show that when compared to a singlenode running the software program, the combination offine- and coarse-grained parallelism results in a very largeperformance gain.

6. Discussion

The results from Section 5.2.3 show that excellent speedupcan be achieved by implementing the backprojection algo-rithm on an HPRC machine. As HPRC architectures improveand more applications are developed for them, designers willcontinue to search for ways to carve out more and moreperformance. Based on the lessons learned in Section 3 andour work on backprojection, in this section we suggest somedirections for future research.

6.1. Future Backprojection Work. This project was devel-oped with an eye toward implementation as a part of

Table 5: Speedup factors for hardware program.

NodesRatio compared to

1 hardware N software 1 software

1 1.0 77.8 77.8

2 1.9 75.8 149.8

4 3.9 76.5 299.8

8 6.0 61.1 463.0

the Swathbuckler SAR system (see Section 4.1.2). Owingto the classified nature of that project, additional workbeyond the scope of this project is required to integrateour backprojection implementation into that project. Todetermine the success of this aspect of the project, we wouldneed to compare backprojection to the current Swathbucklerimage formation algorithm, both in terms of run time as wellas image quality.

Despite excellent speedup results, there are furtheravenues for improvement of our hardware. The Wildstar IIboards feature two identical FPGAs, so it may be possible toprocess two images at once. If the data transfer to one FPGAcan be overlapped with computation on the other, significantspeedup is possible. It may also be possible for each FPGAto create a larger target image using more of the on-boardSRAMs.

An interesting study could be performed by portingbackprojection to several other HPRC systems, some ofwhich can be targeted by high-level design languages. Thiswould be the first step toward developing a benchmark suitefor testing HPRC systems; however, without significant toolsupport-to-support application portability between HPRCplatforms, this process would be daunting.

6.2. HPRC Systems and Applications. One common themeamong the FPGA applications mentioned in Section 3 is datatransfer. Applications that require a large amount of data tobe moved between the host and the FPGA can eliminate mostof the gains provided by increased parallelism. Backprojec-tion does not suffer from this problem because the amountof parallelism exploited is so high that the data transfer is arelatively small portion of the run time, and some of the datatransfers can be overlapped with computation. These arecommon and well-known techniques in HPRC applicationdesign.

Page 35: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 13

This leads us to two conclusions. First, when consideringporting an application to an HPRC system, it is importantto consider whether the amount of available parallelism issufficient to provide good speedup. Tools that can analyze anapplication to aid designers in making this decision are notgenerally available.

Second, it is crucial for the speed of the data transfersto be as high as possible. Early HPRC systems such as theHHPC use common bus architectures like PCI, which do notprovide very high bandwidth. This limits the effectiveness ofmany applications. More recent systems such as the SRC-7have included significantly higher bandwidth interconnect,leading to improved data transfer performance and increas-ing the number of applications that can be successfullyported. Designers of future HPRC systems must continue tofocus on ways to improve the speed of these data transfers.

It is also noteworthy that the backprojection applicationpresented here was developed using hand-coded VHDL, withsome functional units from the Xilinx CoreGen library [30].Writing applications in an HDL provides the highest amountof flexibility and customization, which generally implies thehighest amount of exploited parallelism. However, HDLdevelopment time tends to be prohibitively high. Recentresearch has focused on creating programming languagesand tools that can be used to increase programmer produc-tivity, but applications developed with these tools have notprovided speedups comparable to those of hand-coded HDLapplications. The HPRC community would benefit from thecontinued improvement of development tools such as these.

Finally, each HPRC system has its own programmingmethod that is generally incompatible with other systems.The standardization of programming interfaces would makethe development of design tools easier, and would alsoincrease application developer productivity when movingfrom one machine to the next. Alternately, tools to supportportability of HPRC applications such as the VForce project[18] would also help HPRC developers.

7. Conclusions

In conclusion, we have shown that backprojection is anexcellent choice for porting to an HPRC system. Throughthe design and implementation of this application, we haveexplored the benefits and difficulties of HPRC systemsin general, and identified several important features ofboth these systems and applications that are candidatesfor porting. Backprojection is an example of the class ofproblems that demand larger amounts of computationalresources than can be provided by desktop or single-nodecomputers. As HPRC systems and tools mature, they willcontinue to help in meeting this demand and making newcategories of problems tractable.

Acknowledgments

This work was supported in part by the Center for SubsurfaceSensing and Imaging Systems (CenSSIS) under the Engi-neering Research Centers Program of the National Science

Foundation (Award no. EEC-9986821) and by the DODHigh Performance Computer Modernization Program. Theauthors would also like to thank Dr. Richard Linderman andProf. Eric Miller for their input to this project as well as Xilinxand Synplicity Corporations for their generous donations.

References

[1] B. Cordes, Parallel backprojection: a case study in high-performance reconfigurable computing, M.S. thesis, Depart-ment of Electrical and Computer Engineering, NortheasternUniversity, Boston, Mass, USA, 2008.

[2] M. Soumekh, Synthetic Aperture Radar Signal Processing withMATLAB Algorithms, John Wiley & Sons, New York, NY, USA,1999.

[3] A. C. Kak and M. Slaney, Principles of Computerized Tomo-graphic Imaging, IEEE Press, New York, NY, USA, 1988.

[4] V. W. Ross, “Heterogeneous high performance computer,” inProceedings of the High Performance Computing ModernizationProgram Users Group Conference (HPCMP ’05), pp. 304–307,Nashville, Tenn, USA, June 2005.

[5] High Performance Technologies Inc., Cluster Computing,January 2008, http://www.hpti.com/.

[6] Annapolis Microsystems, Inc., CoreFire FPGA Design Suite,January 2008, http://www.annapmicro.com/corefire.html.

[7] S. Coric, M. Leeser, E. Miller, and M. Trepanier, “Parallel-beam backprojection: an FPGA implementation optimizedfor medical imaging,” in Proceedings of the 10th ACM/SIGDAInternational Sysmposium on Field-Programmable Gate Arrays(FPGA ’02), pp. 217–226, Monterey, Calif, USA, February2002.

[8] N. Gac, S. Mancini, and M. Desvignes, “Hardware/software2D-3D backprojection on a SoPC platform,” in Proceedingsof the ACM Symposium on Applied Computing (SAC ’06), pp.222–228, Dijon, France, April 2006.

[9] X. Xue, A. Cheryauka, and D. Tubbs, “Acceleration of fluoro-CT reconstruction for a mobile C-arm on GPU and FPGAhardware: a simulation study,” in Medical Imaging 2006:Physics of Medical Imaging, M. J. Flynn and J. Hsieh, Eds., vol.6142 of Proceedings of SPIE, pp. 1494–1501, San Diego, Calif,USA, February 2006.

[10] O. Bockenbach, M. Knaup, and M. Kachelrieß, “Implemen-tation of a cone-beam backprojection algorithm on the cellbroadband engine processor,” in Medical Imaging 2007: Physicsof Medical Imaging, vol. 6510 of Proceedings of SPIE, pp. 1–10,San Diego, Calif, USA, February 2007.

[11] L. Nguyen, M. Ressler, D. Wong, and M. Soumekh, “Enhance-ment of backprojection SAR imagery using digital spotlightingpreprocessing,” in Proceedings of the IEEE Radar Conference,pp. 53–58, Philadelphia, Pa, USA, April 2004.

[12] A. Hast and L. Johansson, Fast factorized back-projection in anFPGA, M.S. thesis, Halmstad University, Halmstad, Sweden,2006, http://hdl.handle.net/2082/576.

[13] A. Ahlander, H. Hellsten, K. Lind, J. Lindgren, and B.Svensson, “Architectural challenges in memory-intensive, real-time image forming,” in Proceedings of the 36th InternationalConference on Parallel Processing (ICPP ’07), p. 35, Xian,China, September 2007.

[14] E. El-Ghazawi, E. El-Araby, A. Agarwal, J. LeMoigne, andK. Gaj, “Wavelet spectral dimension reduction of hyperspec-tral imagery on a reconfigurable computer,” in Proceedings

Page 36: FPGA Supercomputing Platfroms, Architecture, And Techniques

14 EURASIP Journal on Embedded Systems

of the International Conference on Military and AerospaceProgrammable Logic Devices (MAPLD ’04), Washington, DC,USA, September 2004.

[15] S. R. Alam, P. K. Agarwal, M. C. Smith, J. S. Vetter, andD. Caliga, “Using FPGA devices to accelerate biomolecularsimulations,” Computer, vol. 40, no. 3, pp. 66–73, 2007.

[16] J. S. Meredith, S. R. Alam, and J. S. Vetter, “Analysis of a com-putational biology simulation technique on emerging pro-cessing architectures,” in Proceedings of the 21st InternationalParallel and Distributed Processing Symposium (IPDPS ’07), pp.1–8, Long Beach, Calif, USA, March 2007.

[17] J. L. Tripp, M. B. Gokhale, and A. A. Hansson, “A case studyof hardware/software partitioning of traffic simulation on thecray XD1,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 16, no. 1, pp. 66–74, 2008.

[18] N. Moore, A. Conti, M. Leeser, and L. S. King, “Vforce:an extensible framework for reconfigurable supercomputing,”Computer, vol. 40, no. 3, pp. 39–49, 2007.

[19] X. Wang, S. Braganza, and M. Leeser, “Advanced componentsin the variable precision floating-point library,” in Proceedingsof the 14th Annual IEEE Symposium on Field-ProgrammableCustom Computing Machines (FCCM ’06), pp. 249–258, Napa,Calif, USA, April 2006.

[20] K. D. Underwood, K. S. Hemmert, and C. Ulmer, “Architec-tures and APIs: assessing requirements for delivering FPGAperformance to applications,” in Proceedings of the ACM/IEEEConference on Supercomputing (SC ’06), Tampa, Fla, USA,November 2006.

[21] M. C. Smith, J. S. Vetter, and S. R. Alam, “Scientificcomputing beyond CPUs: FPGA implementations of commonscientific kernels,” in Proceedings of the 8th InternationalConference on Military and Aerospace Programmable LogicDevices (MAPLD ’05), Washington, DC, USA, September2005.

[22] L. Zhuo and V. K. Prasanna, “High performance linear algebraoperations on reconfigurable systems,” in Proceedings of theACM/IEEE Conference on Supercomputing (SC ’05), p. 2, IEEEComputer Society, Seatle, Wash, USA, November 2005.

[23] M. Gokhale, C. Rickett, J. L. Tripp, C. Hsu, and R. Scrofano,“Promises and pitfalls of reconfigurable supercomputing,” inProceedings of the International Conference on Engineering ofReconfigurable Systems and Algorithms (ERSA ’06), pp. 11–20,Las Vegas, Nev, USA, June 2006.

[24] M. C. Herbordt, T. VanCourt, Y. Gu, et al., “Achieving highperformance with FPGA-based computing,” Computer, vol.40, no. 3, pp. 50–57, 2007.

[25] P. Th. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermar-rec, “The many faces of publish/subscribe,” ACM ComputingSurveys, vol. 35, no. 2, pp. 114–131, 2003.

[26] S. Rouse, D. Bosworth, and A. Jackson, “Swathbuckler widearea SAR processing front end,” in Proceedings of the IEEERadar Conference, pp. 1–6, New York, NY, USA, April 2006.

[27] R. W. Linderman, “Swathbuckler: wide swath SAR systemarchitecture,” in Proceedings of the IEEE Radar Conference, pp.465–470, Verona, NY, USA, April 2006.

[28] S. Tucker, R. Vienneau, J. Corner, and R. W. Linderman,“Swathbuckler: HPC processing and information exploita-tion,” in Proceedings of the IEEE Radar Conference, pp. 710–717, New York, NY, USA, April 2006.

[29] GTK+ Project, March 2008, http://www.gtk.org/.[30] Xilinx, Inc., CORE Generator, March 2008, http://www.xilinx

.com/products/design tools/logic design/design entry/core-generator.htm.

[31] Argonne National Laboratories, MPICH, March 2008,http://www.mcs.anl.gov/research/projects/mpich2/.

Page 37: FPGA Supercomputing Platfroms, Architecture, And Techniques

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 258921, 11 pagesdoi:10.1155/2009/258921

Research Article

Performance Analysis of Bit-Width ReducedFloating-Point Arithmetic Units in FPGAs:A Case Study of Neural Network-Based Face Detector

Yongsoon Lee,1 Younhee Choi,1 Seok-Bum Ko,1 and Moon Ho Lee2

1 Electrical and Computer Engineering Department, University of Saskatchewan, Saskatoon, SK, Canada S7N 5A92 Institute of Information and Communication, Chonbuk National University, Jeonju, South Korea

Correspondence should be addressed to Seok-Bum Ko, [email protected]

Received 4 July 2008; Revised 16 February 2009; Accepted 31 March 2009

Recommended by Miriam Leeser

This paper implements a field programmable gate array- (FPGA-) based face detector using a neural network (NN) and the bit-width reduced floating-point arithmetic unit (FPU). The analytical error model, using the maximum relative representation error(MRRE) and the average relative representation error (ARRE), is developed to obtain the maximum and average output errorsfor the bit-width reduced FPUs. After the development of the analytical error model, the bit-width reduced FPUs and an NNare designed using MATLAB and VHDL. Finally, the analytical (MATLAB) results, along with the experimental (VHDL) results,are compared. The analytical results and the experimental results show conformity of shape. We demonstrate that incrementedreductions in the number of bits used can produce significant cost reductions including area, speed, and power.

Copyright © 2009 Yongsoon Lee et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Neural networks have been studied and applied in variousfields requiring learning, classification, fault tolerance, andassociate memory since the 1950s. The neural networks arefrequently used to model complicated problems which aredifficult to make equations by analytical methods. Applica-tions include pattern recognition and function approxima-tion [1]. The most popular neural network is the multilayerperceptron (MLP) trained using the error back propagation(BP) algorithm [2]. Because of the slow training in MLP-BP, however, it is necessary to speed up the training time.The very attractive solution is to implement it on fieldprogrammable gate arrays (FPGAs).

For implementing MLP-BP, each processing elementmust perform multiplication and addition. Another impor-tant calculation is an activation function, which is usedto calculate the output of the neural network. One of themost important considerations for implementing a neuralnetwork on FPGAs is the arithmetic representation format.It is known that floating-point (FP) formats are morearea efficient than fixed-point ones to implement artificial

neural networks with the combination of addition andmultiplication on FPGAs [3].

The main advantage of the FP format is its wide range.The feature of the wide range is good for neural networksystems because the system requires the big range when thelearning weight is calculated or changed [4]. Another advan-tage of the FP format is the ease of use. A personal computeruses the floating-point format for its arithmetic calculation.If the target application uses the FP format, the effort ofconverting to other arithmetic format is not necessary.

FP hardware offers a wide dynamic range and highcomputation precision, but it occupies large fractions of totalchip area and energy consumption. Therefore, its usage isvery limited. Many embedded microprocessors do not eveninclude a floating-point unit (FPU) due to its unacceptablehardware cost.

A bit-width reduced FPU solves this complexity problem[5, 6]. An FP bit-width reduction can provide a significantsaving of hardware resources such as area and power. It isuseful to understand the loss in accuracy and the reduction incosts as the number of bits in an implementation of floating-point representation is reduced. Incremented reductions in

Page 38: FPGA Supercomputing Platfroms, Architecture, And Techniques

2 EURASIP Journal on Embedded Systems

the number of bits used can produce useful cost reductions.In order to determine the required number of bits in the bit-width reduced FPU, analysis of the error caused by a reduced-precision is essential. Precision reduced error analysis forneural network implementations was introduced in [7]. Aformula that estimates the standard deviation of the outputdifferences of fixed-point and floating-point networks wasdeveloped in [8]. Previous error analyses are useful toestimate possible errors. However, it is necessary to know themaximum and average possible errors caused by a reduced-precision FPU for a practical implementation.

Therefore, in this paper, the error model is developedusing the maximum relative representation error (MRRE)and average relative representation error (ARRE) which arerepresentative indices to examine the FPU accuracy.

After the error model for the reduced precision FPUis developed, the bit-width reduced FPUs and the neuralnetwork for face detection are designed using MATLAB andVery high speed integrated circuit Hardware DescriptionLanguage (VHDL). Finally the analytical (MATLAB) resultsare compared with the experimental (VHDL) results.

Detecting a face in an image means to find its positionin the image plane and its size. There has been extensiveresearch in the field, ranging mostly in the software domain[9, 10]. There have been a few researches for hardwareface detector implementations on FPGAs [11, 12], butmost of the proposed solutions are not very compact andthe implementations are not purely on hardware. In ourprevious work, the FPGA-based stand-alone face detector tosupport a face recognition system was suggested and showedthat an embedded system could be made [13].

Our central contribution here is to examine how neuralnetwork-based face detector can employ the minimal num-ber of bits in an FPU to reduce hardware resources, yetmaintain a face detector’s overall accuracy.

This paper is outlined as follows. In Section 2, the FPGAimplementation of the neural network face detector using thebit-width reduced FPUs is described. Section 3 explains howrepresentation errors theoretically affect a detection rate inorder to determine the required number of bits for the bit-width reduced FPUs. In Section 4, the experimental resultsare presented, and then they are compared to the analyticalresults to verify if both results match closely. Section 5 drawsconclusions.

2. A Neural Network-Based Face Detector Usinga Bit-Width Reduced FPU in an FPGA

2.1. General Review on MLP. A neural network model canbe categorized into two types: single layer perceptron andmultilayer perceptron (MLP). A single layer perceptron hasonly two layers: the input layer and the output layer. Eachlayer contains a certain number of neurons. The MLP is aneural network model that contains multiple layers, typicallythree or more layers including one or more hidden layers.The MLP is a representative method of supervized learning.

Each neuron in one layer receives its input from theneurons in the previous layer and broadcasts its output to

Hidden node (300)

Weights 12

Activationfunction

Weights 01

Input node(400)

y1

Layer 1 Layer 2

F

F

F

F

Figure 1: A two-layer MLP architecture.

the neurons in the next layer. Every processing node in oneparticular layer is usually connected to every node in theprevious layer and the next layer. The connections carryweights, and the weights are adjusted during training. Theoperation of the network consists of two stages: forward passand backward pass or back-propagation. In the forward pass,an input pattern vector is presented to the network and theoutput of the input layer nodes is precisely the componentsof the input pattern. For successive layers, the input to eachnode is then the sum of the products of the incoming vectorcomponents with their respective weights.

The input to a node j is given by simply

input j =∑

i

wjiouti, (1)

where wji is the weight connecting node i to node j and outiis the output from node i.

The output of a node j is simply

out j = f(

input j)

, (2)

which is then sent to all nodes in the following layer. Thiscontinues through all the layers of the network until theoutput layer is reached and the output vector is computed.The input layer nodes do not perform any of the abovecalculations. They simply take the corresponding valuefrom the input pattern vector. The function f denotes theactivation function of each node, and it will be discussed inthe following section.

It is known that 3 layers having 2-hidden layers arebetter than 2 layers to approximate any given function [14].However, a 2-layer MLP is used in this paper, as shown inFigure 1. The output error equation of the first layer (15)and the error equation of the second layer (21) are different.However, the error equation of the second layer (21) and theerror equation of the other layers (22) are the same form.

Page 39: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 3

−3 −2 −1 0 1 2 3

x

−3

−2

−1

0

1

2

3

f(x)

f (x) = 2(1 + e−2x)

− 1

f (x) = 0.75x

Figure 2: Estimation (5) of an activation function (3).

Multiplication and accumulation

(MAC)

FPU multiplication

Neural networktop module

FPU addition(modified from Leon

processor FPU)

FPGA multiplier(H/W IP)

Figure 3: Block diagram of the neural network in an FPGA.

Therefore, a 2-layer MLP is enough to be examined in thispaper. The number of neurons of 400 and 300 were used forinput and first layer respectively in this experiment.

After the face data enters the input node, it is calcu-lated by the multiplication-and-accumulation (MAC) withweights. Face or non-face data is determined by comparingoutput results with the thresholds. For example, if the outputis larger than the threshold, it is considered as a face data.Here, on the FPGA, this decision is easily made by checking asign bit after subtracting the output results and the threshold.

2.2. Estimation of Activation Function. An activation func-tion is used to calculate the output of the neural network.The learning procedure of the neural network requiresthe differentiation of the activation function to renew theweights value. Therefore, the activation function has to bedifferentiable. A sigmoid function, having an “S” shape, isused for the activation function, and a logistic or a hyperbolictangent function is commonly used as the sigmoid function.The hyperbolic tangent function and its antisymmetricfeature were better than the logistic function for learningability in our experiment. Therefore, hyperbolic tangentsigmoid transfer function was used, as shown in (3). Thefirst-order derivative of the hyperbolic tangent sigmoid

FPU multiplication

Datafetch

Stage1 S2 S3 S4 S5

Pre-norm

Post-norm

Round/norm

Adder

Figure 4: Block diagram of 5 stage-pipelined FPU.

Table 1: MRRE and ARRE of Five Different FBUs.

Bit-widthUnit

β, e, m∗ Range MRRE(ulp) ARRE

FPU32 2, 8, 23 228−1 = 2255 2−23 0.3607× 2−23

FPU24 2, 6, 17 2−17 0.3607× 2−17

FPU20 2, 6, 13 2−13 0.3607× 2−13

FPU16 2, 6, 9 226−1 = 263 2−9 0.3607× 2−9

FPU12 2, 6, 5 2−5 0.3607× 2−5

∗β: radix, e: exponent, m: mantisa .

Table 2: Timing results of the neural network-based FPGA facedetector by different FPUs.

Bit-widthMax. Clock(MHz)

1/f(ns) Time∗∗/1frame (ms)

Frame rate∗∗∗

FPU64∗ 8.5 117 50 20

FPU32 48 21.7 8.7 114.4

FPU24 58 (+21%) 17.4 7.4 135.9

FPU20 77 (+60%) 13 5.5 182.1

FPU16 80 (+67%) 12.5 5.3 189.8

FPU12 85 (+77%) 11.7 5 201.8∗General PCs use the 64-bit FPU,∗∗ operating time = [(1 / Max. Clock) ×423,163 (total cycle)],∗∗∗ frame rate = 1000/Operating time.

transfer function can be easily obtained as (4). MATLABprovides the commands, “tansig” and “dtansig”:

f (x) = 2(1 + e−2x)

− 1 = 1− e−2x

1 + e−2x, (3)

f ′(x) = 1− f (x)2 , (4)

where x =input j in (2).The activation function can be estimated by using

different methods. The Taylor and polynomial methods areeffective, and guarantee the highest speed and accuracyamong these methods.

The polynomial method is used to estimate the activationfunction in this paper as seen in (5) and (6) because it issimpler than the Taylor approximation.

A first-degree polynomial estimation of an activationfunction is

f (x) = 0.75x. (5)

Page 40: FPGA Supercomputing Platfroms, Architecture, And Techniques

4 EURASIP Journal on Embedded Systems

Pre-processing

Saveweights

Data(face andnon-face)

Memory(weights)

Memory(input data)

MATLAB (learning) MATLAB (detection)

Performacetest and

verification

NNdetector

NNdetector

MODELSIM(simulation)

Xilinx ISE(design and synthesis)

NNlearningprogram

Pre-processing

Figure 5: Block diagram of the design environment.

A first-order derivative is

f ′(x) = 0.75. (6)

Figure 2 shows the estimation (5) of an activation func-tion (3).

2.3. FPU Implementation. Figure 3 shows the simplifiedblock diagram of the implemented neural network in anFPGA. The module consists of control logic and an FPU.

The implemented FPU is IEEE 754 Standard [15]compliant. The FPU in this system has two modules: mul-tiplication and addition. Figure 4 shows the block diagramof the 5 stage-pipelined FP addition and multiplication unitimplemented in this paper. A commercial IP core, an FPadder of the LEON processor [16] is used and modified tomake the bit size adjustable. A bit-width reduced floating-point multiplication unit is designed using a multiplier anda hard intellectual property (IP) core in an FPGA to improvespeed. Consequently, the multiplication was performedwithin 2 cycles of total stages as shown in Figure 4.

2.4. Implementation of the Neural Network-Based FPGA FaceDetector Using MATLAB and VHDL. Figure 5 shows the totaldesign flow using MATLAB and VHDL. MATLAB programconsists of two parts: learning and detection. After thelearning procedure, weights data are fixed and saved to a file.The weights file is saved to a memory model file for FPGAand VHDL simulation. The MATLAB also provides input testdata to the VHDL program and analyzes the result from theresult file of MODELSIM simulation program. Preprocessingincludes mask, resizing, and normalization.

The Olivetti face database [17] is chosen for this study.The Olivetti face database consists of mono-color face and

0 50 100 150 200 250

The number of input data

−0.5

0

0.5

1

1.5

Neu

raln

etw

ork

outp

ut

(th

resh

old)

Non-face data(160: #61 ∼ #220)

Face data(60: #1 ∼ #60)

Figure 6: Data classification result of the neural network.

1

2

p

1

2

o

1

2

m

1

2

n

· · ·

· · ·· · ·· · ·· · ·

......

... ...

Node: i j k l

Data: Xi Oj Ok Ol

Weights: Wij Wjk Wkl

Error: ε j εk εl

Figure 7: Error model for general neural network.

Page 41: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 5

0 1 2 3 4 5 6 7

x

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f′(x

)

f ′(x) = 1−[

21 + e−2x − 1

]2

−30 −20 −10 0 10 20 30

x

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f′(x

)

Figure 8: First derivative graph of the activation function.

Table 3: Area results of the neural network-based FPGA facedetector by different FPUs.

Bit-width No. of Slices No. of FFs No. of LUTs

FPU32 1077 771 1952

FPU24 878 (–18.5%) 637 1577

FPU20 750 (–30.4%) 569 1356

FPU16 650 (–39.7%) 501 1167

FPU12 556 (–48.4%) 433 998

Table 4: Area results of 32/24/20/16/12-Bit FP adders.

FP Adder Bit-width

Memory (Kbits)NNArea (Slices)

FPAdder area(Slices)

32 3760 1077 486

24 2820 (−25%) 878 403 (−17%)

20 2350 (−37%) 750 300 (−38%)

16 1880 (−50%) 650 250 (−49%)

12 1410 (−63%) 556 173 (−64%)

non-face image so it is easy to use. Some other databaseswhich have large size, color, mixed with other pictures aredifficult for this error analysis purpose due to the necessityof more preprocessing like cropping, data classification, andcolor model change.

Figure 6 shows the classification result of 220 face andnon-face data. X-axis shows the data number of face data

Table 5: Power consumption of the neural network-based FPGAface detector by the different FPUs (unit: Mw).

Bitwidth CLBs RAM (Width) Multiplier(Block)

I/O Power∗

FPU32 2 17 ( 36) 9 ( 5) 67 306

FPU24 2 17 ( 36) 7 ( 4) 49 286 (–6.5%)

FPU20 2 17 ( 36) 4 ( 2) 45 279 (–8.8%)

FPU16 2 8 ( 18) 4 ( 2) 36 261 (–14.7%)

FPU12 1 8 ( 18) 4 ( 2) 29 253 (–17.3%)∗Total power = sub sum + 211 mW (basic power consumption of theFPGA)

Table 6: Comparison of different FP adder architectures (5 pipelinestages).

Adder type Slices FFs LUTs Max. freq. (MHz)

LEON IP 486 269 905 71.5

LOP 570 (+17%) 294 1052 102 (+42.7%)

2-path 1026(+111%) 128 1988 200 (+180%)

Table 7: Specifications of neural network-based FPGA face detector

Feature Specification

FPU Bit-width 32, 24, 20, 16, 12

Frequency 48/58/77/80/85 MHz

Slices (Xilinx Spartan) 1077/878/750/650/556(FPU32/FPU16)

Arithmetic unit IEEE 754 single precision withbit-width reduced FPU

Networks 2 Layers (400/300/1 node)

Input Data Size 20×20 (400 pixel image)

Operating Time 8.7/7.4/5.5/5.3/5 ms/frame

Frame Rate 114/136/182/190/201 seconds

from 1 to 60, and non-face data from 61 to 220. Y-axis showsthe output value of the neural network. The neural networkis learned to pursue the desired value “1” for face and “–1”for non-face.

3. Error Analysis of the Neural Network Causedby Reduced-Precision FPU

3.1. MRRE and ARRE. The number of bits in the FPU isimportant for the area, operating speed, and power [13].Therefore, it is important to decide the minimal numberof bits in floating-point hardware to reduce hardware costs,yet maintain an application’s overall accuracy. A maximumrelative representation error (MRRE) [18] is used as one ofthe indices of floating-point arithmetic accuracy, as shownin Table 1. Note that “e” and “m” represent exponent andmantissa, respectively. The MRRE can be obtained as follows:

MRRE = 12× ulp× β, (7)

where ulp is a unit in the last position and β is the exponentbase.

Page 42: FPGA Supercomputing Platfroms, Architecture, And Techniques

6 EURASIP Journal on Embedded Systems

Table 8: Difference between (3) and (5) in face detection rate (MATALAB).

Threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Tansig (3) 34.09 34.55 37.27 45.91 53.64 61.36 73.09 77.73 75 72.73

Poly (5) 35 39.09 45.91 53.64 62.73 70 72.27 77.27 78.18 77.27

Abs diff 0.91 4.54 8.64 7.73 9.09 8.64 0.82 0.46 3.18 4.54

Avg. error 4.9

Table 9: Detection rate of PC software face detector.

Threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Face 60 60 60 53 50 43 29 21 17 10

Rate 100 100 100 88.33 83.33 71.67 48.33 35 28.33 16.67

Nface 17 26 41 65 88 111 130 149 155 160

Rate 10.625 16.25 25.625 40.625 55 69.375 81.25 93.125 96.875 100

Total 35 39.09 45.91 53.64 62.73 70 72.27 77.27 78.18 77.27

An average relative representation error (ARRE) can beconsidered for practical use:

ARRE = β − 1lnβ

× 14× ulp. (8)

3.2. Output Error Estimation of the Neural Network. The FPUrepresentation error increases with repetitive multiplicationand addition in the neural network. The difference in outputcan be expressed using the following equations with thenotation depicted in Figure 7.

The error of the 1st layer is the difference between the

output by a finite precision arithmetic (Ofj ) and the ideal

output (Oj), and it can be described as

εj = Ofj −Oj

= f

n∑

i=1

W fijX

fi j

⎠ + εΦ − f

n∑

i=1

WijXi

⎠,(9)

where εj represents the hidden layer error (εk represents totalerror generated between hidden layer and output layer onFigure 7), W represents the weights, X represents input data,and O represents the output of the hidden layer. εΦ is thesummation of other possible errors, and defined by

εφ = ε f + Multiplication Error (ε∗)

+ Summation Error(ε+)

+ Other Calculation Errors

(10)

ε f is the nonlinear function error by Taylor estimation; ε f isvery small and negligible. Therefore, ε f becomes 0.

Other calculation errors occur when the differential ofactivation is calculated (i.e., f ′(x) = 0.75 × sum), and thefinal face determination is calculated as follows: f(x) = f ′(x)+ (-0.5).

The multiplication error, ε∗, is not considered in thispaper. The multiplication unit assigns twice the size of thebits to save the result data. For example, multiplication of16 bits × 16 bits needs 32 bits. This large size register reducesthe error, thus the ε∗ error is negligible.

However, the summation error, ε+ is not negligible andadded to the error term, εΦ. The multiplication error (ε∗)and the addition error (ε+) are bounded by the MRRE(assuming rounding mode = truncation) as given by (11)and (12):

ε∗ <∣

∣Multiplication Result× (−MRRE)∣

∣, (11)

where negative sign (–) describes the direction.For example, the ε∗of “4× 5 = 20”:ε∗ = 20× (−MRRE):

ε+ < |Addition Result× (−MRRE)|. (12)

For example, the ε+of “4 + 5 = 9”: ε+ = 9× (−MRRE).Note that the maximum error caused by truncation of

rounding scheme is bounded as

εt <∣

∣x ×(

−2−ulp)∣

∣ = |x × (−MRRE)|. (13)

The error caused by round-to-the-nearest scheme isbounded as

εn <∣

∣x ×(

2−ulp−1)∣

∣ =∣

∣x × 1

2× (−MRRE)

∣. (14)

The truncation of rounding scheme creates a negativeerror and a round-to-nearest scheme creates a positive error.The total error can be reduced by almost 50% by round-to-nearest scheme [18].

From (9), the terms W f and X f are weights data andinput data, respectively, including the reduced-precisionerror. They are described by W f =W +εW and X f = X +εX .Therefore, the multiplication of weights and input data aredenoted byW f X f = (W + εW )(X + εX).

Equations (16) and (18) are obtained by applying thefirst-order Taylor’s series approximation as given by [7, 8]

f (x + h)− f (x) = h f ′(x). (15)

From (9), the error of the first layer, εj , is given by

εj = f

n∑

i=1

WijXi + h1

⎠− f

n∑

i=1

WijXi

⎠ + ε+

= h1 × f ′⎛

n∑

i=1

WijXi

⎠ + ε+,

(16)

Page 43: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 7

Table 10: Detection rate of reduced-precision FPUs (VDHL).

Threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Avg. detection rate error

FPU64 (PC) 35 39.09 45.91 53.64 62.73 70 72.27 77.27 78.18 77.27

FPU32 NN 35 39.09 45.91 53.64 62.73 70 72.27 76.82 78.18 77.27 0

FPU24 NN 35 39.09 45.91 53.64 62.73 70 72.27 76.82 78.18 77.27 0

FPU20 NN 35 39.09 46.36 53.64 63.18 70 73.64 76.82 77.73 76.82 0.36

FPU18 NN 35 41.36 47.73 56.82 65.46 69.55 74.55 77.73 77.27 74.09 1.73

FPU16 NN 35.91 44.55 53.18 66.36 70.46 76.36 78.18 74.55 72.73 72.73 5.91

|FPU64-FPU16| 0.91 5.45 7.27 12.73 7.73 6.36 5.91 2.73 5.46 4.55 5.91

where

h1 =n∑

i=1

(

εXiXi + εWi jWij + εWi jεXi)

. (17)

The error of the second layer can also be found as

εk = Ofk −Ok

= f

m∑

j=1

Ofj W

fjk

⎠− f

m∑

j=1

OjWjk

⎠ + ε+.(18)

By replacing theOfj and theW

fjk with (Oj +εj) and(Wjk+

εW jk), (18) becomes

εk = f

m∑

j=1

(

Oj + εj)(

Wjk + εWjk

)

⎠− f

m∑

j=1

OjWjk

⎠ + ε+.

(19)

Simply,

εk = f

m∑

j=1

(

OjWjk + h2

)

⎠− f

m∑

j=1

OjWjk

⎠ + ε+

≈ h2 × f ′⎛

m∑

j=1

OjWjk

⎠ + ε+,

(20)

where

h2 =m∑

j=1

(

εW jkOj + εjWjk + εW jkε j)

. (21)

εk ≈⎛

m∑

j=1

(

εW jkOj + εjWjk

)

⎠ f ′⎛

m∑

j=1

(

OjWjk

)

⎠ + ε+.

(22)

The error (22) can be generalized for the lth layer, l, in asimilar way:

εl ≈⎛

o∑

k=1

(εWklOk + εkWkl)

⎠ f ′⎛

o∑

k=1

OkWkl

⎠ + ε+. (23)

3.3. Output Error Estimation by MRRE and ARRE. The errorequation can be rewritten using the MRRE in the errorterm to find the maximum output error caused by reduced-precision. The average error can be estimated in the practicalapplication by replacing the MRRE with ARRE= (0.3607 ×MRRE).

From (16), the output error of the first layer is describedas

εj ≈⎛

n∑

i=1

(εWX + εXW)

⎠ f ′⎛

n∑

i=1

WijXi

⎠ + ε+. (24)

The εW (max)and εX(max)terms can be defined byεW (max) = W × −MRRE and εX(max) = X × −MRRE.Thus from (24), the error εj is bounded so that

εj <

n∑

i=1

[(

Wij ×−MRRE)

× Xi(Xi×−MRRE)×Wij

]

× f ′⎛

n∑

i=1

WijXi

⎠ + ε+

;

(25)

εj <

−2MRRE

n∑

i−1

(

WijXi)

⎠ f ′⎛

n∑

i=1

WijXi

⎠ + ε+

, (26)

where

ε+ ≈n∑

i=1

(

XiWij

)

×−MRRE. (27)

Finally, the output error of the second layer εk is alsodescribed from (22) as shown in (28), where the error ofweights can also be written as

εW jk(max) =∣

∣Wjk ×−MRRE∣

∣,

εk <

m∑

j=1

[(

Wjk×−MRRE×Oj

)

+(

εj ×Wjk

)]

× f ′⎛

m∑

j=1

OjWjk

⎠ + ε+

,

(28)

where

ε+ ≈n∑

i=1

(

OjWjk

)

×−MRRE. (29)

Page 44: FPGA Supercomputing Platfroms, Architecture, And Techniques

8 EURASIP Journal on Embedded Systems

Table 11: Results of output error on a neural network-based FPGAface detector.

Bit-widthCalculation Experiment

MRRE ARRE max

FPU32 4E-05 2.89E-05 1.93E-05

FPU24 0.0026 0.0018 0.0012

FPU20 0.0410 0.0296 0.0192

FPU18 0.1641 0.1184 0.0766

FPU16 0.6560 0.4733 0.2816

FPU14 2.62 1.891 0.9872

FPU12 10.4 7.5256 1.0741

3.4. Relationship between MRRE and Output Error. In orderto observe the relationship between the MRRE and outputerror, (28) is written as (30) again.

By using (26),

εk <

−MRRE× A× f ′⎛

m∑

j=1

OjWjk

⎠ + ε+

, (30)

where

A =m∑

j=1

[(

Wjk ×Oj

)

+

2×⎛

n∑

i=1

(

WijXi)

⎠× f ′⎛

n∑

i=1

WijXi

×Wjk

⎦.

(31)

Some properties are derived from (26) and (30) forthe output error. The differential of summations affects theoutput error proportionally like

εj ∝ f ′⎛

n∑

i=1

WijXi

⎠, from (24),

εk ∝ f ′⎛

m∑

j=1

OjWjk

⎠, from (26).

(32)

The output of the well-learned neural network systemgoes to the desired value as shown in Figure 2. In that case,the differential value goes to 0 as shown in Figure 8. It meansthe well-learned neural network system has less output error.

One more finding is that the output error is alsoproportional to the MRRE.

From the (30),

εk ∝ MRRE, (33)

where MRRE = 2 − ulp (assuming “rounding mode =truncation”). Therefore, (33) can be described as

εk ∝ 2−ulp. (34)

Finally, it is concluded that n-bits reduction in the FPUcreates 2n times the error. If one bit is reduced, for example,

4 times

12 14 16 18 20 22 24 26 28 30 32

FPU bits

0

2

4

6

8

10

12

Ou

tpu

ter

rors

Calculation (MRRE)Calculation (ARRE)

Experiment (max)Experiment (mean)

Figure 9: Comparison between analytical output errors andexperimental output errors.

2

12 14 16 18 20 22 24 26 28 30 32

FPU bits

−20−18−16−14−12−10−8−6−4−2

024

Ou

tpu

ter

rors

(log

2)

Calculation (MRRE)Calculation (ARRE)

Experiment (max)Experiment (mean)

Figure 10: Comparison between analytical output errors andexperimental output errors (log2).

the output error is doubled (e.g., 2−(−1) = 2). After puttingthe MRRE between FPU32 and other reduced precision FPUbits into error terms in (26) and (28) using MATLAB and realface data, finally, the total accumulated error of the neuralnetwork is obtained as shown in Table 11.

4. Result and Discussion

4.1. FPGA Synthesis Results. The FPGA-based face detectorusing the neural network and the reduced-precision, FPU, isimplemented in this paper. The logic circuits of the neuralnetwork-based FPGA face detector are synthesized using theFPGA design tool, Xilinx ISE on a Spartan-3 XC3S4000 [19].To verify the error model, first of all, the neural network ona PC is designed using MATLAB. Next, the weights and test-bench data are saved as a file to verify the VHDL code.

After simulation, area and operating speed are obtainedby synthesizing the logic circuits. The FPU uses the samecalculation method, floating-point arithmetic, as the PC soit is easy to verify and easy to change the neural network’sstructure.

Page 45: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 9

4.1.1. Timing. The implemented neural network-basedFPGA face detector (FPU16) took only 5.3 milliseconds toprocess 1 frame at 80 MHz which is 9 times faster than50 milliseconds (i.e., 40 milliseconds for loading time +10 milliseconds for calculation time) required for a PC(Pentium 4, 1.4 GHz) as shown in Table 2. As the total FPUrepresentation bits decrease, a maximum clock frequencyincreases considerably from 21% (FPU24) to 67% (FPU16)compared to FPU32.

The remaining question is to examine if bit-widthreduced FPU can still maintain a face detector’s overallaccuracy. For this purpose, detection rate error for bit-widthreduced FPU will be discussed in Section 4.2.2.

4.1.2. Area. As shown in Table 3, only 2% (650/27648)and 4% (1077/27648) of the total available slices (3S4000)are used for FPU16 and FPU32, respectively. Therefore,the stand-alone embedded face detection system includingpreprocessing, FPU, and other extra functions can be easilyimplemented on a small inexpensive FPGA.

As the bit-width decreases, the number of slices isdecreased from 18.5% (FPU24) to 39.7% (FPU16) comparedto FPU32.

Bit reduction of the FP adder leads to an area reductionand a faster operating clock speed. For example, a 50% bitreduction from FP adder 32 to FP adder 16 results in a 49%reduction of the adder area (250/486) and a 50% reductionof the memory (1880/3760) as shown in Table 4. It is possibleto use the XILINX FPGA 3S4000 which provides the sizeof 2160 Kbits memory (block RAM: 1728 Kb, distributedmemory: 432 Kb) when the FPU16 is necessary.

The number of slices of the floating-point adder variesfrom 31% (FP12: 173/556) to 45% (FP32: 486/1077) of thetotal size of the neural network as shown in Table 4.

4.1.3. Power. The results of power consumption are shown inTable 5. The power consumptions are obtained using XilinxWeb Power Tool [20].

As the bit-width decreases, the power consumptiondecreases. For example, bit reduction from the FPU32 to theFPU16 reduces the total power by 14.7% (FPU32: 306 mW,FPU16: 261 mW) through RAM, multiplier, and I/O asshown in Table 5.

The change of the logic cell does not considerably affectthe power as much as hardwired IP such as memory andmultiplier spend the power. See the number of configurablelogic blocks (CLBs) in Table 5.

4.1.4. Architectures of FP Adder. The neural network systemand the FPU hardware performance are greatly affected bythe FP addition [21]. The bit-width reduced FP additionis modified for this study from the commercial IP, LEONprocessor. LEON FPU uses standard adder architecture [16].The system performance and the clock speed can be furtherimproved by leading-one-prediction (LOP) algorithm and 2-path (close-and-far path) algorithm, respectively [18].

In our experiment, FP addition based on the LOPalgorithm increases the maximum clock frequency by 42.7%

compared to the performance of the commercial IP, LEON.The FP addition based on the 2-path algorithm [18, 22]increases the area by 111%, but improves the maximumclock frequency by 180% compared to the performance ofthe commercial IP, LEON as shown in Table 6.

4.1.5. Specification. The specification of the implementedneural network-based FPGA face detector is summarized inTable 7.

4.2. Detection Rate Error. Two factors affect the detectionrate error. One is the polynomial estimation error as shownin Figure 2 which is occurred when the activation functionis estimated through the polynomial equation. Anotherpossible error caused by the bit-width reduced FPU.

4.2.1. Detection Rate Error by Polynomial Estimation. Toreduce the error caused by polynomial estimation, thepolynomial equation, (35) can be more elaborately modifiedas shown in (36). The problem of (36) is not differentiableat ±1, and also the error (30) will be identically 0 (i.e.,f ′(sum) = (±1)′ = 0) for|sum| > 1, which will make erroranalysis difficult:

f (sum) = 0.75× sum

f (sum) = 0.75× sum, |sum| < 1,(35)

= 1, sum ≥ 1,

= −1, sum ≤ −1.(36)

Therefore, the simplified polynomial (35) is used in thispaper. It is confirmed through MATLAB simulation that thispolynomial approximation results in an average 4.9% errorin the detection rate compared with the result of (3) in ourexperiment as shown in Table 8.

4.2.2. Detection Rate Error by Reduced-Precision FPU. Table 9is obtained after the threshold value changed from 0.1 to 1.When the threshold is 0.5, the face detection rate is 83% andthe non-face detection rate is 55%. When the threshold is 0.6,face and the non-face detect is almost same as 71.67% and69.4% respectively. As the threshold value goes to “1” (i.e. asthe horizontal red line goes up in Figure 6), face detectionrate is decreased. It means input image is difficult to pass,and it is good for security. Therefore, the threshold is neededto be chosen accordingly depending upon applications. Theresult of Table 9 is used in the second column (FPU64(PC))of the Table 10.

Table 10 shows the detection rate error (i.e., |detectionrate of FPU64 (PC software)—detection rate of reduced-precision FPUs|) caused by reduced-precision FPUs. Thedetection rate is changed from FPU64(PC) to FPU16 by only5.91% (i.e., |72.27− 78.18|).

Table 11 and Figure 9 show the output error (|neuralnetwork output on PC- the output of VHDL|). Figure 10 isthe log graph (base is 2) of Figure 9.

Analytical results are found to be in agreement withsimulation results as shown in Figure 10. The analyticalMRRE results and the maximum experimental results show

Page 46: FPGA Supercomputing Platfroms, Architecture, And Techniques

10 EURASIP Journal on Embedded Systems

conformity of shape. The analytical ARRE results and theminimum experimental results also show conformity ofshape.

As the n bits in the FPU are reduced within the rangesfrom 32 bits to 14 bits, the output error is incremented by 2n

times. For example, 2-bit reduction from FPU16 to FPU14makes 4 times (2n=16−14=2 = 4) the error.

Due to the small number of fraction bits (e.g., 5 bitsin FPU12), no meaningful results are obtained under 14bits. Therefore, at least 14 bits should be employed toachieve an acceptable face detection rate. See Figures 9and 10.

5. Conclusion

In this paper, the analytical error model was developedusing the maximum relative representation error (MRRE)and average relative representation error (ARRE) to obtainthe maximum and average output errors for the bit-widthreduced FPUs.

After the development of the analytical error model,the bit-width reduced FPUs, and the neural network weredesigned using MATLAB and VHDL. Finally, the analytical(MATLAB) results with the experimental (VHDL) resultswere compared.

The analytical results and the experimental resultsshowed conformity of shape. According to both results, asthe n bits in FPU are reduced within the ranges from 32 bitsto 14 bits, the output error is incremented by 2n times.

An operating speed was significantly improved from anFPGA-based face detector implementation using a reducedprecision FPU. For example, it took only 5.3 millisecondsin the FPU16 to process one frame which is 9 times fasterthan 50 milliseconds (40 milliseconds for loading time +10milliseconds for calculation time) of the PC (Pentium 4,1.4 GHz). It was found that bit reduction from FPU 32 bitsto FPU16 bits reduced the size of memory and arithmeticunits by 50% and the total power consumption by 14.7%,while still maintaining 94.1% face detection accuracy. Thedeveloped error analysis for bit-width reduced FPUs willbe helpful to determine the specification for an embeddedneural network hardware system.

Acknowledgments

The authors would like to acknowledge the Natural Scienceand Engineering Research Council of Canada (NSERC) / theUniversity of Saskatchewan’s Publications Fund, the KoreaResearch Foundation, and a Korean Federation of Scienceand Technology Societies grant funded by the South Koreangovernment (MOEHRD, Basic Research Promotion Fund)for supporting this research and to thank the reviewers fortheir valuable suggestions.

References

[1] M. Skrbek, “Fast neural network implementation,” NeuralNetwork World, vol. 9, no. 5, pp. 375–391, 1999.

[2] D. E. Rumelhart and J. L. McClelland, Parallel DistributedProcessing: Explorations in the Microstructure of Cognition, vol.1, MIT Press, Cambridge, Mass, USA, 1986.

[3] X. Li, M. Moussa, and S. Areibi, “Arithmetic formats forimplementing artificial neural networks on FPGAs,” CanadianJournal of Electrical and Computer Engineering, vol. 31, no. 1,pp. 30–40, 2006.

[4] H. K. Brown, D. D. Cross, and A. G. Whittaker, “Neuralnetwork number systems,” in Proceedings of International JointConference on Neural Networks (IJCNN ’90), vol. 3, pp. 903–908, San Diego, Calif, USA, June 1990.

[5] J. Kontro, K. Kalliojarvi, and Y. Neuvo, “Use of short floating-point formats in audio applications,” IEEE Transactions onConsumer Electronics, vol. 38, no. 3, pp. 200–207, 1992.

[6] J. Tong, D. Nagle, and R. Rutenbar, “Reducing power byoptimizing the necessary precision/range of floating-pointarithmetic,” IEEE Transactions on VLSI Systems, vol. 8, no. 3,pp. 273–286, 2000.

[7] J. L. Holt and J.-N. Hwang, “Finite precision error analysisof neural network hardware implementations,” IEEE Transac-tions on Computers, vol. 42, no. 3, pp. 281–290, 1993.

[8] S. Sen, W. Robertson, and W. J. Phillips, “The effects ofreduced precision bit lengths on feed forward neural networksfor speech recognition,” in Proceedings of IEEE InternationalConference on Neural Networks, vol. 4, pp. 1986–1991, Wash-ington, DC, USA, June 1996.

[9] R. Feraud, O. J. Bernier, J.-E. Viallet, and M. Collobert, “A fastand accurate face detector based on neural networks,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol.23, no. 1, pp. 42–53, 2001.

[10] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 20, no. 1, pp. 23–38, 1998.

[11] T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin, andW. Wolf, “Embedded hardware face detection,” in Proceedingsof the 17th IEEE International Conference on VLSI Design, pp.133–138, Mumbai, India, January 2004.

[12] M. Sadri, N. Shams, M. Rahmaty, et al., “An FPGA based fastface detector,” in Global Signal Processing Expo and Conference(GSPX ’04), Santa Clara, Calif, USA, September 2004.

[13] Y. Lee and S.-B. Ko, “FPGA implementation of a face detectorusing neural networks,” in Canadian Conference on Electricaland Computer Engineering (CCECE ’07), pp. 1914–1917,Ottawa, Canada, May 2006.

[14] D. Chester, “Why two hidden layers are better than one,”in Proceedings of International Joint Conference on NeuralNetworks (IJCNN ’90), vol. 1, pp. 265–268, Washington, DC,USA, January 1990.

[15] IEEE Std 754-1985, “IEEE standard for binary floating-pointarithmetic,” Standards Committee of the IEEE ComputerSociety, New York, NY, USA, August 1985.

[16] LEON Processor, http://www.gaisler.com.[17] Olivetti & Oracle Research Laboratory, The Olivetti & Oracle

Research Laboratory Face Database of Faces,http://www.cam-orl.co.uk/facedatabase.html.

[18] I. Koren, Computer Arithmetic Algorithms, A K Peters, Natick,Mass, USA, 2nd edition, 2001.

[19] XILINX, “Spartan-3 FPGA Family Complete Data Sheet,”Product Specification, April 2008.

[20] XILINX Spartan-3 Web Power Tool Version 8.1.01, http://www.xilinx.com/cgi-bin/power tool/power Spartan3.

[21] G. Govindu, L. Zhuo, S. Choi, and V. Prasanna, “Analysisof high-performance floating-point arithmetic on FPGAs,” in

Page 47: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 11

Proceedings of the 18th International Parallel and DistributedProcessing Symposium (IPDPS ’04), pp. 149–156, Santa Fe,NM, USA, April 2004.

[22] A. Malik, Design trade-off analysis of floating-point adder inFPGAs, M.S. thesis, Department of Electrical and ComputerEngineering, University of Saskatchewan, Saskatoon, Canada,2005.

Page 48: FPGA Supercomputing Platfroms, Architecture, And Techniques

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 382983, 13 pagesdoi:10.1155/2009/382983

Research Article

Accelerating Seismic Computations Using Customized NumberRepresentations on FPGAs

Haohuan Fu,1 William Osborne,1 Robert G. Clapp,2 Oskar Mencer,1 and Wayne Luk1

1 Department of Computing, Imperial College London, London SW7 2AZ, UK2 Department of Geophysics, Stanford University, CA 94305, USA

Correspondence should be addressed to Haohuan Fu, [email protected]

Received 31 July 2008; Accepted 13 November 2008

Recommended by Vinay Sriram

The oil and gas industry has an increasingly large demand for high-performance computation over huge volume of data.Compared to common processors, field-programable gate arrays (FPGAs) can boost the computation performance with astreaming computation architecture and the support for application-specific number representation. With hardware supportfor reconfigurable number format and bit width, reduced precision can greatly decrease the area cost and I/O bandwidth ofthe design, thus multiplying the performance with concurrent processing cores on an FPGA. In this paper, we present a tool todetermine the minimum number precision that still provides acceptable accuracy for seismic applications. By using the minimizednumber format, we implement core algorithms in seismic applications (the FK step in forward continued-based migration and 3Dconvolution in reverse time migration) on FPGA and show speedups ranging from 5 to 7 by including the transfer time to and fromthe processors. Provided sufficient bandwidth between CPU and FPGA, we show that a further increase to 48X speedup is possible.

Copyright © 2009 Haohuan Fu et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Seismic imaging applications in oil and gas industry involvesterabytes of data collected from fields. For each data sample,the imaging algorithm usually tries to improve the imagequality by performing more costly computations. Thus,there is an increasingly large demand for high-performancecomputation over huge volume of data. Among all thedifferent kinds of imaging algorithms, downward continued-based migration [1] is the most prevalent high-end imagingtechnique today and reverse time migration appears to beone of the dominant imaging techniques of the future.

Compared to conventional microprocessors, FPGAsapply a different streaming computation architecture. Com-putations we want to perform are mapped into circuit unitson the FPGA board. Previous work has already achieved 20Xacceleration for prestack Kirchhoff time migration [2] and40X acceleration for subsurface offset gathers [3].

Besides the capability of performing computations ina parallel way, FPGAs also support application-specificnumber representations. Since all the processing unitsand connections on the FPGA are reconfigurable, we canuse different number representations, such as fixed-point,

floating-point, logarithmic number system (LNS), residuenumber system (RNS), and so forth, with different bit-widthsettings. Different number representations lead to differentcomplexity of the arithmetic units, thus different costs andperformances of the resulting circuit design [4]. Switching toa number representation that fits a given application bettercan sometimes greatly improve the performance or reducethe cost.

A simple case of switching number representations is totrade off precision of the number representation with thespeed of the computation. For example, by reducing theprecision from 32-bit floating-point to 16-bit fixed-point,the number of arithmetic units that fit into the same areacan be increased by scores of times. The performance ofthe application is also improved significantly. Meanwhile, wealso need to watch for the possible degradation of accuracyin the computation results. We need to check whetherthe accelerated computation using reduced precision is stillgenerating meaningful results.

To solve the above problem in the seismic applicationdomain, we develop a tool that performs an automated preci-sion exploration of different number formats, and figures outthe minimum precision that can still generate good enough

Page 49: FPGA Supercomputing Platfroms, Architecture, And Techniques

2 EURASIP Journal on Embedded Systems

Table 1

Integer part: m bits Fractional part: f bits

xm−1xm−2 · · · x0 x−1 · · · x− f +1x− f

Table 2

Sign: 1 bit Exponent: m bits Mantissa: f bits

S M F

seismic results. By using the minimized number format, weimplement core algorithms in seismic applications (complexexponential step in forward continued based migrationand 3D convolution in reverse time migration) on FPGAand show speedups ranging from 5 to 7 by including thetransfer time to and from the processors. Provided sufficientbandwidth between CPU and FPGA, we show that a furtherincrease to 48X speedup is possible.

2. Background

2.1. Number Representation. As mentioned in Section 1,precision and range are key resources to be traded off againstthe performance of a computation. In this work, we lookat two different types of number representation: fixed-pointand floating-point.

Fixed-Point Numbers. The fixed-point number has twoparts, the integer part and the fractional part. It is in a formatas shown in Table 1.

When it uses a sign-magnitude format (the first bitdefines the sign of the number), its value is given by(−1)xm−1·∑m−2

i=− f xi·2i. It may also use a two-complementformat to indicate the sign.

Floating-Point Numbers. According to IEEE-754 standard,floating-point numbers can be divided into three parts: thesign bit, the exponent, and the mantissa, shown as in Table 2.

Their values are given by (−1)S × 1·F × 2M . The signbit defines the sign of the number. The exponent part usesa biased format. Its value equals to the sum of the originalvalue and the bias, which is defined as 2m−1− 1. The extremevalues of the exponent (0 and 2m − 1) are used for specialcases, such as values of zero and ±∞. The mantissa is anunsigned fractional number, with an implied “1” to the leftof the radix point.

2.2. Hardware Compilation Tool. We use a stream compiler(ASC) [5] as our hardware compilation tool to develop arange of different solutions for seismic applications. ASCwas developed following research at Stanford Universityand Bell Labs, and is now commercialized by MaxelerTechnologies. ASC enables the use of FPGAs as highlyparallel stream processors. ASC is a C-like programmingenvironment for FPGAs. ASC code makes use of C++ syntaxand ASC semantics which allow the user to program on thearchitecture level, the arithmetic level, and the gate level.ASC provides the productivity of high-level hardware design

// ASC code starts hereSTREAM START;

// Hardware Variable DeclarationsHWint in (IN);HWint out (OUT);HWint tmp (TMP);

STREAM LOOP (16);tmp = (in� 1) + 55;out = tmp;

// ASC code ends hereSTREAM END;

Algorithm 1: A simple ASC example.

tools and the performance of low-level optimized hardwaredesign. On the arithmetic level, PAM-Blox II provides aninterface for custom arithmetic optimization. On the higherlevel, ASC provides types and operators to enable research oncustom data representation and arithmetic. ASC hardwaretypes are HWint, HWfix, and HWfloat. Utilizing the data-types, we build libraries such as a function evaluation libraryor develop special circuits to solve particular computationalproblems such as graph algorithms. Algorithm 1 showsa simple example of an ASC description for a streamarchitecture that doubles the input and adds “55.”

The ASC code segment shows HWint variables and thefamiliar C syntax for equations and assignments. Compilingthis program with “gcc” and running it creates a netlist whichcan be transformed into a configuration bitstream for anFPGA.

2.3. Precision Analysis. There exist a number of researchprojects that focus on precision analysis, most of which arestatic methods that operate on the computational flow ofthe design and uses techniques based on range and errorpropagation to perform the analysis.

Lee et al. [6] present a static precision analysis techniquewhich uses affine arithmetic to derive an error model of thedesign and applies simulated annealing to find out minimumbit widths to satisfy the given error requirement. A similarapproach is shown in a bit-width optimization tool calledPrcis [7].

These techniques are able to perform an automatedprecision analysis of the design and provide optimizedbit widths for the variables. However, they are not quitesuitable for seismic imaging algorithms. The first reason isthat seismic imaging algorithms usually involve numerousiterations, which can lead to overestimation of the errorbounds and derive a meaningless error function. Secondly,the computation in the seismic algorithms does not havea clear error requirement. We can only judge the accuracyof the computation from the generated seismic image.Therefore, we choose to use a dynamic simulation methodto explore different precisions, detailed in Sections 3.3 and3.4.

Page 50: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 3

2.4. Computation Bottlenecks in Seismic Applications. Down-ward-continued migration comes in various flavors includ-ing common azimuth migration [8], shot profile migra-tion, source-receiver migration, plane-wave or delayed shotmigration, and narrow azimuth migration. Depending onthe flavor of the downward continuation algorithm, there arefour potential computation bottlenecks.

(i) In many cases, the dominant cost is the FFT step.The dimensionality of the FFT varies from 1D (tiltedplane-wave migration [9]) to 4D (narrow azimuthmigration [10]). The FFT cost is often dominant dueto its n log(n) cost ratio, n being the number of pointsin the transform, and the noncache friendly nature ofmultidimensional FFTs.

(ii) The FK step, which involves evaluating (or lookingup) a square root function and performing complexexponential is a second potential bottleneck. Thehigh operational count per sample can eat up signifi-cant cycles.

(iii) The FX step, which involves a complex exponential,or sine/second operation for lumbar disc herniation.Spine 1993; 18: 2206-11. 16. Silvers HR, Lewis PJ,Asch HL, Clabeaux DE. Lumbar diskectomy forrecurrent disk herniation. J Spinal Disord 1994;7: 408-19. 17. Jonsson B, Stromqvist B. Repeatdecompression of lumbar nerve roots. A prospectivetwo-year evaluation. J Bone Joint Surg (Br) 1993;75-B: 894-7. cosine multiplication, has a similar, butcomputationally less demanding, profile. Subsurfaceoffset gathers for shot profile or plane-wave migra-tion, particularly 3D subsurface offset gathers, can bean overwhelming cost. The large op-count per sampleand the noncache friendly nature of the data usagepattern can be problematic.

(iv) For finite difference-based schemes, a significant con-volution cost can be involved.

The primary bottleneck of reverse time migration isapplying the finite-different stencil. In addition to the largeoperation count (5 to 31 samples per cell) the access patternhas poor cache behavior for real size problems. Beyondapplying the 3D stencil, the next most dominant cost isimplementing damping boundary conditions. Methods suchas perfectly matched layers (PMLs) can be costly [11]. Finally,if you want to use reverse time migration for velocity analysis,subsurface offset gathers need to be generated. The same costprofile that exists in downward continued-based migrationexists for reverse time migration.

In this paper, we focus on two of the above computationbottlenecks: one is the FK step in forward continued-based migration, which includes a square root functionand a complex exponential operation; the other one is the3D convolution in reverse time migration. We performautomated precision exploration of these two computationcores, so as to figure out the minimum precision that can stillgenerate accurate enough seismic images.

3. A Tool for Number RepresentationExploration

FPGA-based implementations have the advantage over cur-rent software-based implementations of being able to usecustomizable number representations in their circuit designs.On a software platform, users are usually constrained to afew fixed number representations, such as 32/64-bit integersand single/double-precision floating-point, while the recon-figurable logic and connections on an FPGA enables the usersto explore various kinds of number formats with arbitrarybit widths. Furthermore, users are also able to design thearithmetic operations for these customized number repre-sentations, thus can provide a highly customized solution fora given problem.

In general, to provide a customized number representa-tion for an application, we need to determine the followingthree things.

(i) Format of the Number Representation. There are existingFPGA applications using fixed-point, floating-point, andlogarithmic number system (LNS) [12]. Each of the threenumber representations has its own advantages and disad-vantages over the others. For instance, fixed-point has simplearithmetic implementations, while floating-point and LNSprovide a wide representation range. It is usually not possibleto figure out the optimal format directly. Exploration isneeded to guide the selection.

(ii) Bit Widths of Variables. This problem is generallyreferred to as bit width or word-length optimization [6, 13].We can further divide this into two different parts: rangeanalysis considers the problem of ensuring that a givenvariable inside a design has a sufficient number of bitsto represent the range of the numbers, while in precisionanalysis, the objective is to find the minimum number ofprecision bits for the variables in the design such that theoutput precision requirements of the design are met.

(iii) Design of the Arithmetic Units. The arithmetic opera-tions of each number system are quite different. For instance,in LNS, multiplication, division, and exponential operationsbecome as simple as addition or shift operations, whileaddition and subtraction become nonlinear functions toapproximate. The arithmetic operations of regular dataformats, such as fixed-point and floating-point, also havedifferent algorithms with different design characteristics. Onthe other hand, evaluation of elementary functions playsa large part in seismic applications (trigonometric andexponential functions). Different evaluation methods andconfigurations can be used to produce evaluation units withdifferent accuracies and performance.

This section presents our tool that tries to figure out theabove three design options by exploring all the possible num-ber representations. The tool is partly based on our previouswork on bit-width optimization [6] and comparison betweendifferent number representations [14, 15].

Figure 1 shows our basic work flow to explore differentnumber representations for a seismic application. We man-ually partition the Fortran program into two parts: one part

Page 51: FPGA Supercomputing Platfroms, Architecture, And Techniques

4 EURASIP Journal on Embedded Systems

Fortran program for seismic processing

Manual partition

Fortran codeexecuting on processors

Fortran codetargeting an FPGA

Profile

Range information(max/min values)

Distribution information

Map to a circuit design:arithmetic operation & function evaluation

Circuit design description

Translate circuit design description intobit-accurate simulation code

Value simulatorwith reconfigurable settings

Exploration with different configurations:number representations, bit-width values, etc.

Final design with customized number representation

Figure 1: Basic steps to achieve a hardware design with customizednumber representations.

runs on CPUs and we try to accelerate the other part (targetcode) on FPGAs. The partition is based on two metrics: (1)the target code shall consume a large portion of processingtime in the entire program, otherwise the acceleration doesnot bring enough performance improvement to the entireapplication; (2) the target code shall be suitable for astreaming implementation on FPGA, thus highly probableto accelerate. After partition, the first step is to profilethe target code to acquire information about the range ofvalues and their distribution that each variable can take. Inthe second step, based on the range information, we maptheFortran code into a hardware design described in ASCformat, which includes implementation of arithmetic oper-ations and function evaluation. In the third step, the ASCdescription is translated into bit-accurate simulation code,and merged into the original Fortran program to provide avalue simulator for the original application. Using this valuesimulator, explorations can be performed with configurablesettings such as different number representations, differentbit widths, and different arithmetic algorithms. Based on theexploration results, we can determine the optimal numberformat for this application with regards to certain metricssuch as circuit area and performance.

3.1. Range Profiling. In the profiling stage, the major objec-tive is to collect range and distribution information forthe variables. The idea of our approach is to instrumentevery target variable in the code, adding function calls to

a b

0

c d

Figure 2: Four points to record in the profiling of range informa-tion.

initialize data structures for recording range information andto modify the recorded information when the variable valuechanges.

For the range information of the target variables (vari-ables to map into the circuit design), we keep a record of fourspecific points on the axis, shown in Figure 2. The points aand d represent the values far away from zero, that is, themaximum absolute values that need to be represented. Basedon their values, the integer bit width of fixed-point numberscan be determined. Points b and c represent the values closeto zero, that is, the minimum absolute values that need to berepresented. Using both the minimum and maximum values,the exponent bit width of floating-point numbers can bedetermined.

For the distribution information of each target variable,we keep a number of buckets to store the frequency ofvalues at different intervals. Figure 3 shows the distributioninformation recorded for the real part of variable “wfld” (acomplex variable). In each interval, the frequency of positiveand negative values is recorded separately. The results showthat, for the real part of variable “wfld,” in each interval, thefrequencies of positive and negative values are quite similar,and the major distribution of the values falls into the range10−1 to 104.

The distribution information provides a rough metricfor the users to make an initial guess about which numberrepresentations to use. If the values of the variables covera wide range, floating-point and LNS number formats areusually more suitable. Otherwise, fixed-point numbers shallbe enough to handle the range.

3.2. Circuit Design: Basic Arithmetic and Elementary Func-tion Evaluation. After profiling range information for thevariables in the target code, the second step is to map thecode into a circuit design described in ASC. As a high-level FPGA programming language, ASC provides hardwaredata types, such as HWint, HWfix, and HWfloat. Users canspecify the bit-width values for hardware variables, and ASCautomatically generates corresponding arithmetic units forthe specified bit widths. It also provides configurable optionsto specify different optimization modes, such as AREA,LATENCY, and THROUGHPUT. In the THROUGHPUToptimization mode, ASC automatically generates a fullypipelined circuit. These features make ASC an ideal hardwarecompilation tool to retarget a piece of software code onto theFPGA hardware platform.

With support for fixed-point and floating-point arith-metic operations, the target Fortran code can be transformedinto ASC C++ code in a straightforward manner. We alsohave interfaces provided by ASC to modify the internalsettings of these arithmetic units.

Besides basic arithmetic operations, evaluation of ele-mentary functions takes a large part in seismic applications.

Page 52: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 5

0

5e + 06

1e + 07

1.5e + 07

2e + 07

2.5e + 07

3e + 07

3.5e + 07

4e + 07

4.5e + 07

5e + 07

Freq

uen

cy

−12 −10 −8 −6 −4 −2 0 2 4 6 8

Bucket

PositiveNegative

Figure 3: Range distribution of the real part of variable “wfld.”The leftmost bucket with index = −11 is reserved for zero values.The other buckets with index = x store the values in the range(10x−1, 10x].

For instance, in the first piece of target code we try toaccelerate, the FK step, a large portion of the computationis to evaluate the square root and sine/cosine functions. Tomap these functions into efficient units on the FPGA board,we use a table-based uniform polynomial approximationapproach, based on Dong-U Lee’s work on optimizinghardware function evaluation [16]. The evaluation of the twofunctions can be divided into three different phases [17].

(i) Range reduction: reduce the range of the inputvariable x into a small interval that is convenientfor the evaluation procedure. The reduction canbe multiplicative (e.g., x′ = x/22n for square rootfunction) or additive (e.g., x′ = x − 2πn for sine/cosine functions).

(ii) Function evaluation: approximate the value of thefunction using a polynomial within the small inter-val.

(iii) Range reconstructions: map the value of the functionin the small interval back into the full range of theinput variable x.

To keep the whole unit small and efficient, we usedegree-one polynomial so that only one multiplication andone addition are needed to produce the evaluation result.Meanwhile, to preserve the approximation error at a smallscale, the reduced evaluation range is divided into uniformsegments. Each segment is approximated with a degree-onepolynomial, using the minimax algorithm. In the FK step,the square root function is approximated with 384 segmentsin the range of [0.25, 1] with a maximum approximationerror of 4.74 × 10−7, while the sine and cosine functions areapproximated with 512 segments in the range of [0, 2]with amaximum approximation error of 9.54× 10−7.

3.3. Bit-Accurate Value Simulator. As discussed in Sect-ion 3.1, based on the range information, we are able todetermine the integer bit width of fixed-point, and partly

determine the exponent bit width of floating-point numbers(as exponent bit width does not only relate to the rangebut also to the accuracy). The remaining bit widths, such asthe fractional bit width of fixed-point, and the mantissa bitwidth of floating-point numbers, are predominantly relatedto the precision of the calculation. In order to find out theminimum acceptable values for these precision bit widths, weneed a mechanism to determine whether a given set of bit-width values produce satisfactory results for the application.

In our previous work on function evaluation or otherarithmetic designs, we set a requirement of the absolute errorof the whole calculation, and use a conservative error modelto determine whether the current bit-width values meet therequirement or not [6]. However, a specified requirement forabsolute error does not work for seismic processing. To findout whether the current configuration of precision bit widthis accurate enough, we need to run the whole program toproduce the seismic image, and find out whether the imagecontains the correct pattern information. Thus, to enableexploration of different bit-width values, a value simulatorfor different number representations is needed to providebit-accurate simulation results for the hardware designs.

With the requirement to produce bit-accurate results asthe corresponding hardware design, the simulator also needsto be efficiently implemented, as we need to run the wholeapplication (which takes days using the whole input dataset)to produce the image.

In our approach, the simulator works with ASC for-mat C++ code. It reimplements the ASC hardware datatypes, such as HWfix and HWfloat, and overloads theirarithmetic operators with the corresponding simulationcode. For HWfix variables, the value is stored in a 64-bit signed integer, while another integer is used to recordthe fractional point. The basic arithmetic operations aremapped into shifts and arithmetic operations of the 64-bit integers. For HWfloat variables, the value is stored in a80-bit extended-precision floating-point number, with twoother integers used to record the exponent and mantissabit width. To keep the simulation simple and fast, thearithmetic operations are processed using floating-pointvalues. However, to keep the result bit accurate, during eachassignment, by performing corresponding bit operations,we decompose the floating-point value into mantissa andexponent, truncate according to the exponent and mantissabit widths, and combine them back into the floating-pointvalue.

3.4. Accuracy Evaluation of Generated Seismic Images. Asmentioned above, the accuracy of a generated seismic imagedepends on the pattern contained inside, which estimates thegeophysical status of the investigated area. To judge whetherthe image is accurate enough, we compare it to a “target”image, which is processed using single-precision floating-point and assumed to contain the correct pattern.

To perform this pattern comparison automatically, weuse techniques based on prediction error filters (PEFs) [18]to highlight differences between two images. The basic workflow of comparing image a to image b (assume image a is the“target” image) is as follows.

Page 53: FPGA Supercomputing Platfroms, Architecture, And Techniques

6 EURASIP Journal on Embedded Systems

(i) Divide image a into overlapping small regions of 40×40 pixels, and estimate PEFs for these small regions.

(ii) Apply these PEFs to both image a and image b to getthe results a′ and b′.

(iii) Apply algebraic combinations of the images a′ and b′

to acquire a value indicating the image differences.

By the end of the above work flow, we achieve a singlevalue which describes the difference from the generatedimage to the “target image.” For convenience of discussionafterwards, we call this value as “difference indicator” (DI).

Figure 4 shows a set of different seismic images calculatedfrom the same dataset, and their DI values compared tothe image with correct pattern. The image showing correctpattern is calculated using single-precision floating-point,while the other images are calculated using fixed-pointdesigns with different bit-width settings. All these images areresults of the bit-accurate value simulator mentioned above.

If the generated image contains no information at all(as shown in Figure 4(a)), the comparison does not returna finite value. This is mostly because a very low precisionis used for the calculation. The information is lost duringnumerous iterations and the result only contains zeros orinfinities. If the comparison result is in the range of 104

to 105 (Figures 4(b) and 4(c)), the image contains randompattern which is far different from the correct one. Witha comparison result in the range of 103 (Figure 4(d)), theimage contains similar pattern to the correct one, butinformation in some parts is lost. With a comparison resultin the range of 102 or smaller, the generated image containsalmost the same pattern as the correct one.

Note that the DI value is calculated from algebraicoperations on the two images you compare with. Themagnitude of DI value is only a relative indication of thedifference between the two images. The actual usage of the DIvalue is to figure out the boundary between the images thatcontains mostly noises and the images that provide usefulpatterns of the earth model. From the samples shown inFigure 7, in this specific case, the DI value of 102 is a goodguidance value for acceptable accuracy of the design. Fromthe bit-width exploration results shown in Section 4, we cansee that the DI value of 102 also happens to be a precisionthreshold, where the image turns from noise into accuratepattern with the increase of bit width.

3.5. Number Representation Exploration. Based on all theabove modules, we can now perform exploration of differentnumber representations for the FPGA implementation of aspecific piece of Fortran code.

The current tools support two different number rep-resentations, fixed-point, and floating-point numbers (thevalue simulator for LNS is still in progress). For all thedifferent number formats, the users can also specify arbitrarybit widths for each different variable.

There are usually a large number of different variablesinvolved in one circuit design. In our previous work, weusually apply heuristic algorithms, such as ASA [19], to findout a close-to-optimal set of bit-width values for differentvariables. The heuristic algorithms may require millions of

test runs to check whether a specific set of values meet theconstraints or not. This is acceptable when the test run is onlya simple error function and can be processed in nanoseconds.In our seismic processing application, depending on theproblem size, it takes half an hour to several days to runone test set and achieve the result image. Thus, heuristicalgorithms become impractical.

A simple and straightforward method to solve theproblem is to use uniform bit width over all the differentvariables, and either iterate over a set of possible values oruse a binary search algorithm to jump to an appropriatebit-width value. Based on the range information and theinternal behavior of the program, we can also try to dividethe variables in the target Fortran code into several differentgroups, and assign a different uniform bit width for eachdifferent group. For instance, in the FK step, there is a clearboundary that the first half performs square, square root,and division operations to calculate an integer value, andthe second half uses the integer value as a table index, andperforms sine, cosine, and complex multiplications to get thefinal result. Thus, in the hardware circuit design, we dividethe variables into two groups based on which half it belongsto. Furthermore, in the second half of the function, some ofthe variables are trigonometric values in the range of [−1, 1],while the other variables represent the seismic image dataand scale up to 106. Thus, they can be further divided intotwo parts and assigned bit widths separately.

4. Case Study I: The FK Step in DownwardContinued-Based Migration

4.1. Brief Introduction. The code shown in Algorithm 2is the computationally intensive portion of the FK stepin a downward continued-based migration. The governingequation for the FK step is the double square root equation(DSR) [20]. The DSR equation describes how to downwardcontinue a wave-field U one depth Δz step. The equationis valid for a constant velocity medium v and is based onthe wave number of the source ks and receiver kg . The DSRequation can be written as (1), where ω is the frequency.The code takes the approach of building a priori a relativelysmall table of the possible values of vk/ω. The code thenperforms a table lookup that converts a given vk/ω value toan approximate value of the square root.

In practical applications, “wfld” contains millions of dataitems. The computation pattern of this function makes it anideal target to map to a streaming hardware circuit on anFPGA.

4.2. Circuit Design. The mapping from the software codeto a hardware circuit design is straightforward for mostparts. Figure 5 shows the general structure of the circuitdesign. Compared with the software Fortran code shown inAlgorithm 2, one big difference is the handling of the sineand cosine functions. In the software code, the trigonometricfunctions are calculated outside the five-level loop, andstored as a lookup table . In the hardware design, to takeadvantage of the parallel calculation capability provided bythe numerous logic units on the FPGA, the calculation

Page 54: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 7

0

1000

2000

3000

y

2000 3000 4000 5000 6000 7000 8000x

Seismic image

(a) DI = Inf

0

1000

2000

3000

y

2000 3000 4000 5000 6000 7000 8000x

Seismic image

(b) DI = 105

0

1000

2000

3000

y

2000 3000 4000 5000 6000 7000 8000x

Seismic image

(c) DI = 104

0

1000

2000

3000

y

2000 3000 4000 5000 6000 7000 8000x

Seismic image

(d) DI = 103

0

1000

2000

3000

y

2000 3000 4000 5000 6000 7000 8000x

Seismic image

(e) DI = 102

0

1000

2000

3000

y

2000 3000 4000 5000 6000 7000 8000x

Seismic image

(f) DI =10

0

1000

2000

3000

y

2000 3000 4000 5000 6000 7000 8000x

Full precision seismic image

(g) Correct pattern

Figure 4: Examples of seismic images with different Difference Indicator (DI) values. “Inf” means that the approach does not return a finitedifference value. “10x” means that the difference value is in the range of [1× 10x , 1× 10x+1).

Page 55: FPGA Supercomputing Platfroms, Architecture, And Techniques

8 EURASIP Journal on Embedded Systems

! generation of table step%ctable

do i = 1, size (step%ctable)k = ko∗ step%dstep ∗ dsr%phase (i)step%ctable (i) = dsr%amp (i) ∗ cmplx (cos(k), sin(k))

end do

! the core part of function wei wem

do i4 = 1, size (wfld, 4)do i3 = 1, size (wfld, 3)

do i2 = 1, size (wfld, 2)do i1 = 1, size (wfld, 1)

k = sqrt (step%kx (i1, i3) ∗∗ 2 + step%ky (i2, i4)∗∗2)itable = max (1, min (int (1 + k/ko/dsr%d), dsr%n))wfld (i1, i2, i3, i4, i5) = wfld (i1, i2, i3, i4, i5) ∗ step%ctable (itable)

end doend do

end doend do

Algorithm 2: The code for the major computations of the FK step.

Table 3: Profiling results for the ranges of typical variables infunction “wei wem.” “wfld real” and “wfld img” refer to the realand imaginary parts of the “wfld” data. “Max” and “Min” refer tothe maximum and minimum absolute values of variables.

Variable Step%x ko wfld real wfld img

Max 0.377 0.147 3.918e6 3.752e6

Min 0 7.658e–3 4.168e–14 5.885e–14

of the sine/cosine functions is merged into the processingcore of the inner loop. Three function evaluation units areincluded in this design to produce values for the squareroot, cosine and sine functions separately. As mentioned inSection 3.2, all three functions are evaluated using degree-one polynomial approximation with 386 or 512 uniformsegments:

U(

ω, ks, kg , z + Δz)

= exp

[

− iωv(√

1− vkgω

+

1− vksω

)]

U(

ω, ks, kg , z)

.

(1)

The other task in the hardware circuit design is to mapthe calculation into arithmetic operations of certain numberrepresentations. Table 3 shows the value range of sometypical variables in the FK step. Some of the variables (inthe part of square root and sine/cosine function evaluations)have a small range within [0, 1], while other values (especially“wfld” data) have a wide range from 10−14 to 106. If we usefloating-point or LNS number representations, their widerepresentation ranges are enough to handle these variables.However, if we use fixed-point number representations inthe design, special handling is needed to achieve acceptableaccuracy over wide ranges.

The first issue to consider in fixed-point designs is theenlarged error caused by the division after the evaluation

Step kx Step ky

sqrt sum = step kx2+ step ky2

Function evaluation unitsqrt res =

sqrt sum

Itable = max(1, min(sqrt res/ko/dsr% d, dsr% n))

k = ko×step% dstep×dsr% phase (itable)

Function evaluation unita = cos(k)

Function evaluation unitb = sin(k)

wfld = wfld×cmplx(a, b)×dsr% amp (itable)wfld

Updated wfld

Figure 5: General structure of the circuit design for the FK step.

of the square root (√

step%x2 + step%y2/ko). The values ofstep%x, step%y, and ko come from the software programas input values to the hardware circuit, and contain errorspropagated from previous calculations or caused by thetruncation/rounding into the specified bit width on hard-ware. Suppose the error in the square root result sqrt resis Esqrt, and the error in variable ko is Eko, assuming thatthe division unit itself does not bring extra error, theerror in the division result is given by Esqrt·sqrt res/ko +Eko·(sqrt res/ko2). According to the profiling results, ko

Page 56: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 9

holds a dynamic range from 0.007658 to 0.147, and sqrt reshas a maximum value of 0.533 (variables step%x and step%yhave similar ranges). In the worst case, the error fromsqrt res can be magnified by 70 times, and the error fromko magnified by approximately 9000 times.

To solve the problem of enlarged errors, we perform shiftsat the input side to keep the three values step%x, step%y,and ko in a similar range. The variable ko is shifted by thedistance d1 so that the value after shifting falls in the rangeof [0.5,1). The variables step%x and step%y are shifted byanother distance d2 so that the larger value of the two alsofalls in the range of [0.5,1). The difference between d1 andd2 is recorded so that after the division, the result can beshifted back into the correct scale. In this way, the sqrt reshas a range of [0.5,1.414] and ko has a range of [0.5,1]. Thus,the division only magnifies the errors by an order of 3 to 6.Meanwhile, as the three variables step%x, step%y, and ko areoriginally in single-precision floating-point representationin software, when we pass their values after shifts, a largepart of the information stored in the mantissa part can bepreserved. Thus, a better accuracy is achieved through theshifting mechanism for fixed-point designs.

Figure 6 shows experimental results about the accuracyof the table index calculation when using shifting comparedto not using shifting, with different uniform bit widths. Thepossible range of the table index result is from 1 to 2001. Asit is the index for tables of smooth sequential values, an errorwithin five indices is generally acceptable. We use the tableindex results calculated with single-precision floating-pointas the true values for error calculation. When the uniformbit width of the design changes from 10 to 20, designs usingthe shifting mechanism show a stable maximum error of3, and an average error around 0.11. On the other hand,the maximum error of designs without shifting vary from2000 to 75, and the average errors vary from approximately148 to 0.5. These results show that the shifting mechanismprovides much better accuracy for the part of the table indexcalculation in fixed-point designs.

The other issue to consider is the representation of“wfld” data variables. As shown in Table 3, both the realand imaginary parts of “wfld” data have a wide range from10−14 to 106. Generally, fixed-point numbers are not suitableto represent such wide ranges. However, in this seismicapplication, the “wfld” data is used to store the processedimage information. It is more important to preserve thepattern information shown in the data values rather than thedata values themselves. Thus, by omitting the small valuesand using the limited bit width to store the informationcontained in large values, fixed-point representations stillhave a big chance to achieve accurate image in the final step.In our design, for convenience of bit-width exploration, wescale down all the “wfld” data values by a ratio of 2−22 so thatthey fall into the range of [0,1).

4.3. Bit-Width Exploration Results. The original softwareFortran code of the FK step performs the whole computationusing single-precision floating-point. We firstly replace theoriginal Fortran code of the FK step with a piece of C++

1

0.1

0.01

10

100

1000

10000

Max

imu

m/a

vera

geer

ror

8 10 12 14 16 18 20 22

Uniform bit-width of hardware variables

Max error, without shiftMax error, with shift

Average error, without shiftAverage error, with shift

Figure 6: Maximum and average errors for the calculation of thetable index when using and not using the shifting mechanism infixed-point designs, with different uniform bit-width values from10 to 20.

code using double-precision floating-point to generate a full-precision image to compare with. After that, to investigatethe effect of different number representations for variablesin the FK step on the accuracy of the whole application, wereplace the code of the FK step with our simulation codethat can be configured with different number representationsand different bit widths and generate results for differentsettings. The approach for accuracy evaluation, introducedin Section 3.4, is used to provide DI values that indicate thedifferences in the patterns of the resulted seismic images fromthe pattern in full-precision image.

4.3.1. Fixed-Point Designs. In the first step, we apply uniformbit width over all the variables in the design. We change theuniform bit width from 10 to 20. With of uniform bit widthof 16, the design provides a DI value around 100, whichmeans that the image contains a pattern almost the same tothe correct one.

In the second step, as mentioned in Section 3.5, accord-ing to their characteristics in range and operational behavior,we can divide the variables in the design into differentgroups and apply a uniform bit width in each group. In thehardware design for the FK step, the variables are dividedinto three groups: SQRT, the part from the beginning tothe table index calculation, which includes an evaluation ofthe square root; SINE, the part from the end of SQRT tothe evaluation of the sine and cosine functions; WFLD, thepart that multiplies the complex values of “wfld” data witha complex value consisting of the sine and cosine values(for phase modification), and a real value (for amplitudemodification). To perform the accuracy investigation, wekeep two of the bit-width values constant, and change theother one gradually to see its effect on the accuracy of theentire application.

Figure 7(a) shows the DI values of the generated imageswhen we change the bit width of the SQRT part from 6 to

Page 57: FPGA Supercomputing Platfroms, Architecture, And Techniques

10 EURASIP Journal on Embedded Systems

20. The bit widths of the SINE and WFLD parts are set to20 and 30, respectively. Large bit widths are used for theother two parts so that they do not contribute much to theerrors and the effect of variables bit width in SQRT can beextracted out. The case of SQRT bit widths shows a clearprecision threshold at the bit-width value of 10. When theSQRT bit width increases from 8 bits to 10 bits, the DI valuefalls down from the scale of 105 to the scale of 102. Thesignificant improvement in accuracy is also demonstratedin the generated seismic images. The image on the left ofFigure 7(a) is generated with 8-bit design. Compared to the“true” image calculated with single-precision floating-point,the lower part of the image is mainly noise signals, while thelower part starts to show a similar pattern as the correct ones.The difference between the qualities of the lower and upperparts is because of the imaging algorithm, which calculatesthe image from summation of a number of points at thecorresponding depth. In acoustic models, there are generallymore sample points when we go deeper into the earth.Therefore, using the same precision, the lower part shows abetter quality than the upper part. The image on the rightof Figure 7(a) is generated with 10-bit design, and alreadycontains almost the same pattern as the “true” image.

In a similar way, we perform the exploration for the othertwo parts, and acquire the precision threshold 10, 12, and 16for the SQRT, SINE, and WFLD parts, respectively. However,as the above results are acquired with two out of the three bitwidths set to very large values, the practical solution shall belightly larger than these values. Meanwhile, constrained bythe current I/O bandwidth of 64 bits per second, the sum ofthe bit widths for SQRT and WFLD parts shall be less than30. We perform further experiments for bit-width valuesaround the initial guess point, and find out that bit widthsof 12, 16, and 16 for the three parts provide a DI value of131.5 and also meet the bandwidth requirement.

4.3.2. Floating-Point Designs. In floating-point design of theFK step, we perform an exploration of different exponent andmantissa bit widths. Similar to fixed-point designs, we use auniform bit width for all the variables. When we investigateone of them, we keep the other one with a constant highvalue.

Figure 7(b) shows the case that we change the exponentbit width from 3 to 10, while we keep the mantissa bit widthas 24. There is again a clear cut at the bit width of 6. Whenthe exponent bit width is smaller than 6, the DI value of thegenerated image is at the level of 105. When the exponent bitwidth increases to 6, the DI value decreases to around 1.

With a similar exploration of the mantissa bit width,we figure out that exponent bit width of 6 and mantissabit width of 16 provide the minimum bit widths needed toachieve a DI value around 102. Experiment confirms that thiscombination produces image with a DI value of 43.96.

4.4. Hardware Acceleration Results. The hardware acceler-ation tool used in this project is the FPGA computingplatform MAX-1, provided by Maxeler Technologies [21].It contains a high-performance Xilinx Virtex IV FX100

Table 4: Speedups achieved on FPGA compared to softwaresolutions. Xilinx Virtex IV FX100 FPGA compared with Intel XeonCPU of 1.86 GHz.

Size of dataset Software time FPGA time Speedup

43056 5.32 ms 0.84 ms 6.3

216504 26.1 ms 3.77 ms 6.9

Table 5: Resource cost of the FPGA design for the FK step indownward continued-based migration.

Type of resource Used units Percentage

Slices 12032 28%

BRAMs 59 15%

Embedded multipliers 16 10%

FPGA, which consists of 42176 slices, 376 BRAMs, and192 embedded multipliers. Meanwhile, it provides a high-bandwidth interface of PCI Express X8 (2 G bytes persecond) to the software side residing in CPUs.

Based on the exploration results of different numberrepresentations, the fixed-point design with bit widths of 12,16, and 16 for three different parts is selected in our hardwareimplementation. The design produces images containingthe same pattern as the double-precision floating-pointimplementation, and has the smallest bit-width values, thatis, the lowest resource cost among all the different numberrepresentations.

Table 4 shows the speedups we can achieve on FPGAcompared to software solutions running on Intel Xeon CPUof 1.86 GHz. We experiment with two different sizes ofdatasets. For each of the datasets, we record the processingtime for 10 000 times and calculate the average as the result.Speedups of 6.3 and 6.9 times are achieved for the twodifferent datasets, respectively.

Table 5 shows the resource cost to implement the FK stepon the FPGA card. It utilizes 28% of the logic units, 15%of the BRAMs (memory units) and 10% of the arithmeticunits. Considering that a large part (around 20%) of the usedlogic units are circuits handling PCI-Express I/O, there is stillmuch potential to put more processing cores onto the FPGAcard and to gain even higher speedups.

5. Case Study II: 3D Convolution inReverse Time Migration

3D convolution is one of the major computation bottlenecksin reverse time migration algorithms. In this paper, we imple-mented a 6th-order acoustic modeling kernel to investigatethe potential speedups on FPGAs. The 3D convolution usesa kernel with 19 elements. Once each line of the kernel hasbeen processed, it is scaled by a constant factor.

One of the key challenges to implement 3D convolutionis how to keep a fast access to all the data elements needed fora 19-point operations. As the data items are generally storedin one direction, when you want to access the data items in a3D pattern, you need to either buffer a large amount of dataitems or access them in a very slow nonlinear pattern. In our

Page 58: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 11

0

1000

2000

3000Reduced precision seismic image

8-bit fixed-point

2000 3000 4000 5000 6000 7000 80000

1000

2000

3000Reduced precision seismic image

2000 3000 4000 5000 6000 7000 8000

10-bit fixed-point

2000 3000 4000 5000 6000 7000 8000

“True” image: single-precision floating-point

0

1000

2000

3000Full precision seismic image

1E + 00

1E + 02

1E + 03

1E + 04

1E + 05

1E + 06

4 6 8 10 12 14 16 18 20 22

SQRT bit-width

Diff

eren

cein

dica

tor

valu

esof

the

gen

erat

edim

ages

(a) DI values for different SQRT bit-widths in a fixed-point design

0

1000

2000

3000Reduced precision seismic image

Floating-point: 5-bit exponent

2000 3000 4000 5000 6000 7000 80000

1000

2000

3000Reduced precision seismic image

2000 3000 4000 5000 6000 7000 8000

Floating-point: 6-bit exponent

2000 3000 4000 5000 6000 7000 8000

“True” image: single-precision floating-point

0

1000

2000

3000Full precision seismic image

1E − 01

1E + 00

1E + 01

1E + 02

1E + 03

1E + 04

1E + 05

1E + 06

2 64 8 10 12

Exponent bit-width

Diff

eren

cein

dica

tor

valu

esof

the

gen

erat

edim

ages

(b) DI values for different exponent bit-widths in a floating-point design

Figure 7: Exploration of fixed-point and floating-point designs with different bit widths.

Page 59: FPGA Supercomputing Platfroms, Architecture, And Techniques

12 EURASIP Journal on Embedded Systems

FPGA design, we solve this problem by buffering the currentblock we process into the BRAM FIFOs. ASC provides aconvenient interface to automatically buffer the input valuesinto BRAMs and the users can access them by specifying thecycle number that the value gets read in. Thus, we can easilyindex into the stream to obtain values already sent to theFPGA and perform the 3D operator.

Compared to the 3D convolution processed on CPUs,FPGA has two major advantages. One is the capabilityof performing computations in parallel. We exploit theparallelism of the FPGA to calculate one result per cycle.When ASC assigns the elements to BRAMs, it does so insuch a way as to maximize the number of elements thatcan be obtained from the BRAM every cycle. This meansthat consecutive elements of the kernel must not in generalbe placed in the same BRAM. The other advantage is thesupport for application-specific number representations. Byusing fixed-point of 20 bits (the minimum bit-width settingthat provides acceptable accuracy), we can reduce the areacost greatly thus put more processing units into the FPGA.

We test the convolution design on a data size of 700 ×700 × 700. To compute the entire computation all at thesame time (as is the case when a high-performance processoris used) requires a large local memory (in the case of theprocessor, a large cache). The FPGA has limited resourceson-chip (376 BRAMs which can each hold 512 32 bit values).To solve this problem, we break the large dataset into cubesand process them separately. To utilize all of our input andoutput bandwidths, we assign 3 processing cores to the FPGAresulting in 3 inputs and 3 outputs per cycle at 125 MHz(constrained by the throughput of the PCI-Express bus ).This gives us a theoretical maximum throughput of 375 Mresults a second.

The disadvantage of breaking the problem into smallerblocks is that the boundaries of each block are essentiallywasted (although a minimal amount of reuse can occur)because they must be resent when the adjacent block iscalculated. We do not consider this a problem since theblocks we use are at least 100× 100× 700 which means onlya small proportion of the data is resent.

In software, the convolution executes in 11.2 secondson average. The experiment was carried out using a dual-processor machine (each quad-core Intel Xeon 1.86 GHz)with 8 GB of memory.

In hardware, using the MAX-1 platform we performthe same computation in 2.2 seconds and obtain a 5 timesspeedup. The design uses 48 DSP blocks (30%), 369 (98%)RAMB16 blocks, and 30,571 (72%) of the slices on theVirtex IV chip. This means that there is room on the chip tosubstantially increase the kernel size. For a larger sized kernel(31 points), the speedup should be virtually linear, resultingin an 8 times speedup compared to the CPU implementation.

6. Further Potential Speedups

One of the major constraints for achieving higher speedupson FPGAs is the limited bandwidth between the FPGA cardand the CPU. For the current PCI-Express interface providedby the MAX-1 platform, in each cycle, we can only read

8 bytes into the FPGA card and write back 8 bytes to thesystem.

An example is the implementation of the FK step,described in Section 4. As shown in Algorithm 2, in ourcurrent designs, we take step%kx, step%ky, and both thereal and imaginary parts of wfld as inputs to the circuit onFPGA, and take the modified real and imaginary parts ofwfld as outputs. Therefore, although there is much spaceon the FPGA card to support multiple cores, the interfacebandwidth can only support one single core and get aspeedup of around 7 times.

However, in the specific case of FK step, there are furthertechniques we can utilize to gain some more speedups.From the codes in Algorithm 2, we can find out that wfldvaries with all the four different loop indices, while step%kxand step%ky only vary with two of the four loop indices.To take advantage of this characteristic, we can divide theprocessing of the loop into two parts: in the first part, weuse the bandwidth to read in the step%kx and step%kyvalues, without doing any calculation; in the second part,we can devote the bandwidth to read in wfld data only, andstart the processing as well. In this pattern, suppose we areprocessing a 100 × 100 × 100 × 100 four-level loop, thebandwidth can support two cores processing concurrentlywhile spending 1 out of 100 cycles to read in the step%kxand step%ky values in advance. In this way, we are ableto achieve a speedup of 6.9 × 2 × 100/101 ≈ 13.7 times.Furthermore, assume that there is an unlimited communi-cation bandwidth, the cost of BRAMs (15%) becomes themajor constraint. We can then put 6 concurrent cores onthe FPGA card and achieve a speedup of 6.9 × 7 ≈ 48times.

Another possibility is to put as much computation aspossible onto the FPGA card, and reduce the communicationcost between FPGA and CPU. If multiple portions of thealgorithm are performed on the FPGA without returning tothe CPU, the additional speedup can be considerable. Forinstance, as mentioned in Section 2, the major computationcost in downward continued-based migration lies in themultidimensional FFTs and the FK step. If the FFT and theFK step can reside simultaneously on the FPGA card, thecommunication cost between the FFT and the FK step canbe eliminated completely. In the case of 3D convolution inreverse time migration, multiple time steps can be appliedsimultaneously.

7. Conclusions

This paper describes our work on accelerating seismicapplications by using customized number representationson FPGAs. The focus is to improve the performance of theFK step in downward continued-based migration and theacoustic 3D convolution kernel in reverse time migration.To investigate the tradeoff between precision and speed,we develop a tool that performs an automated precisionexploration of different number formats, and figures outthe minimum precision that can still generate good enoughseismic results. By using the minimized number format,we implement the FK step in forward continued-based

Page 60: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 13

migration and 3D convolution in reverse time migrationon FPGA and show speedups ranging from 5 to 7 byincluding the transfer time to and from the processors. Wealso show that there are further potentials to accelerate theseapplications by above 10 or even 48 times.

Acknowledgments

The support from the Center for Computational Earth andEnvironmental Science, Stanford Exploration Project, Com-puter Architecture Research Group at Imperial College Lon-don, and Maxeler Technologies is gratefully acknowledged.The authors also would like to thank Professor Martin Morfand Professor Michael Flynn for their support and advice.

References

[1] J. Gazdag and P. Sguazzero, “Migration of seismic data byphase shift plus interpolation,” in Migration of Seismic Data,G. H. F. Gardner, Ed., Society of Exploration Geophysicists,Tulsa, Oklahoma, 1985.

[2] C. He, M. Lu, and C. Sun, “Accelerating seismic migrationusing FPGA-based coprocessor platform,” in Proceedings of the12th Annual IEEE Symposium on Field-Programmable CustomComputing Machines (FCCM ’04), pp. 207–216, Napa, Calif,USA, April 2004.

[3] O. Pell and R. G. Clapp, “Accelerating subsurface offset gathersfor 3D seismic applications using FPGAs,” SEG TechnicalProgram Expanded Abstracts, vol. 26, no. 1, pp. 2383–2387,2007.

[4] J. Deschamps, G. Bioul, and G. Sutter, Synthesis of Arith-metic Circuits: FPGA, ASIC and Embedded Systems, Wiley-Interscience, New York, NY, USA, 2006.

[5] O. Mencer, “ASC: a stream compiler for computing withFPGAs,” IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, vol. 25, no. 9, pp. 1603–1617,2006.

[6] D.-U. Lee, A. A. Gaffar, R. C. C. Cheung, O. Mencer, W. Luk,and G. A. Constantinides, “Accuracy-guaranteed bit-widthoptimization,” IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, vol. 25, no. 10, pp. 1990–2000,2006.

[7] M. L. Chang and S. Hauck, “Precis: a usercentric word-lengthoptimization tool,” IEEE Design & Test of Computers, vol. 22,no. 4, pp. 349–361, 2005.

[8] B. Biondi and G. Palacharla, “3-D prestack migration ofcommon-azimuth data,” Geophysics, vol. 61, no. 6, pp. 1822–1832, 1996.

[9] G. Shan and B. Biondi, “Imaging steep salt flank with plane-wave migration in tilted coordinates,” SEG Technical ProgramExpanded Abstracts, vol. 25, no. 1, pp. 2372–2376, 2006.

[10] B. Biondi, “Narrow-azimuth migration of marine streamerdata,” SEG Technical Program Expanded Abstracts, vol. 22, no.1, pp. 897–900, 2003.

[11] L. Zhao and A. C. Cangellaris, “GT-PML: generalized theory ofperfectly matched layers and its application to the reflection-less truncation of finite-difference time-domain grids,” IEEETransactions on Microwave Theory and Techniques, vol. 44, no.12, part 2, pp. 2555–2563, 1996.

[12] R. Matousek, M. Tichy, Z. Pohl, J. Kadlec, C. Softley, andN. Coleman, “Logarithmic number system and floating-pointarithmetic on FPGA,” in Proceedings of the 12th International

Conference on Field-Programmable Logic and Applications(FPL ’02), pp. 627–636, Madrid, Spain, August 2002.

[13] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Heuristicdatapath allocation for multiple wordlength systems,” inProceedings of the Conference on Design, Automation and Testin Europe (DATE ’01), pp. 791–796, Munich, Germany, March2001.

[14] H. Fu, O. Mencer, and W. Luk, “Comparing floating-point andlogarithmic number representations for reconfigurable accel-eration,” in Proceedings of the IEEE International Conferenceon Field Programmable Technology (FPT ’06), pp. 337–340,Bangkok, Thailand, December 2006.

[15] H. Fu, O. Mencer, and W. Luk, “Optimizing logarithmicarithmetic on FPGAs,” in Proceedings of the 15th Annual IEEESymposium on Field-Programme Custom Computing Machines(FCCM ’07), pp. 163–172, Napa, Calif, USA, April 2007.

[16] D.-U. Lee, A. A. Gaffar, O. Mencer, and W. Luk, “Optimizinghardware function evaluation,” IEEE Transactions on Comput-ers, vol. 54, no. 12, pp. 1520–1531, 2005.

[17] J. Muller, Elementary Functions: Algorithms and Implementa-tion, Birkhauser, Secaucus, NJ, USA, 1997.

[18] J. Claerbout, “Geophysical estimation by example: Environ-mental soundings image enhancement: Stanford ExplorationProject,” 1999, http://sepwww.stanford.edu/sep/prof/.

[19] L. Ingber, “Adaptive Simulated Annealing (ASA) 25.15,” 2004,http://www.ingber.com/.

[20] J. Claerbout, “Basic Earth Imaging (BEI),” 2000, http://sepwww.stanford.edu/sep/prof/.

[21] Maxeler Technologies, http://www.maxeler.com/.

Page 61: FPGA Supercomputing Platfroms, Architecture, And Techniques

Hindawi Publishing CorporationEURASIP Journal on Embedded SystemsVolume 2009, Article ID 507426, 6 pagesdoi:10.1155/2009/507426

Research Article

An FPGA Implementation of a Parallelized MT19937 UniformRandom Number Generator

Vinay Sriram and David Kearney

University of South Australia, Reconfigurable computing Laboratory, School of Computer and Information Science,Mawson Lakes Campus, Adelaide, SA 5085, Australia

Correspondence should be addressed to Vinay Sriram, [email protected]

Received 20 August 2008; Revised 16 February 2009; Accepted 21 April 2009

Recommended by Miriam Leeser

Recent times have witnessed an increase in use of high-performance reconfigurable computing for accelerating large-scalesimulations. A characteristic of such simulations, like infrared (IR) scene simulation, is the use of large quantities of uncorrelatedrandom numbers. It is therefore of interest to have a fast uniform random number generator implemented in reconfigurablehardware. While there have been previous attempts to accelerate the MT19937 pseudouniform random number generator usingFPGAs we believe that we can substantially improve the previous implementations to develop a higher throughput and more area-time efficient design. Due to the potential for parallel implementation of random numbers generators, designs that have both asmall area footprint and high throughput are to be preferred to ones that have the high throughput but with significant extra arearequirements. In this paper, we first present a single port design and then present an enhanced 624 port hardware implementationof the MT19937 algorithm. The 624 port hardware implementation when implemented on a Xilinx XC2VP70-6 FPGA chip hasa throughput of 119.6 × 109 32 bit random numbers per second which is more than 17x that of the previously best publisheduniform random number generator. Furthermore it has the lowest area time metric of all the currently published FPGA-basedpseudouniform random number generators.

Copyright © 2009 V. Sriram and D. Kearney. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Reconfigurable computing is increasingly being seen as anattractive solution for accelerating simulations that requirefast generation of large quantities of random numbers.Although random numbers are often a very small part ofthese algorithms inability to generate them fast enough,them can cause a bottleneck in the reconfigurable com-puting implementation of the simulation. For example, inthe simulated generation of infrared scenes that take intoaccount the effects of a turbulent atmosphere and the effectsof CCD camera sensor electronic noise, each 352×352 scenegenerated by the simulation requires more than 1.87 × 106

gaussian random numbers. A real-time simulation sequenceat 15 scenes/s thus needs more than 28.1 × 106 randomsamples generated per second. Since a typical softwareuniform generator [1] can only manage 10× 106 per secondyou would need 29 PCs to keep up with this rate.

A key requirement of Infrared (IR) scene simulation isthe necessity to generate large sequences of random numberson the provision of a single seed. Not all random numbergenerators are capable of doing this (e.g., see those presentedin [2]). Moreover, in order to achieve the high throughputrequired, it is important to make use of algorithm’s internalparallelism (i.e., by splitting the algorithm into independentsubsequences) as well as external parallelism (i.e., throughparallel implementations of the algorithm). It has beenrecommended in [3] and reinforced in [4] that in orderto prevent possible correlations between output sequencesin parallel implementation of the same algorithm usingdifferent initial seeds, it is necessary to use a random numbergenerator that has a period greater than 2200. In summarythe requirements of an FPGA optimized uniform randomnumber generator for IR scene simulation are as follows:

(1) should be seeded random number generator (so thatthe same sequence may be regenerated);

Page 62: FPGA Supercomputing Platfroms, Architecture, And Techniques

2 EURASIP Journal on Embedded Systems

Stage 1

Seed generator

Seed Seed

Seed value modulator

Seed[i] Seed[i]Output

generator Random number

New seed Dual port

RAM 624 32 bit

seeds

FIFO buffer

397 seeds

Stage 2 Stage 3Seed[i + 1]

Seed[i + 396]

Figure 1: Single port version.

(2) have the ability to generate a large quantity of randomnumbers from one seed;

(3) can be split in many independent subsequences;

(4) have a very large period;

(5) generate random numbers quickly;

(6) satisfy statistical tests from randomness;

(7) able to generate parallel streams of uncorrelatedrandom numbers.

2. Survey of FPGA-Based Uniform RandomNumber Generators

As discussed in the previous section IR scene simulation[5] requires fast generation of large sequences of randomnumbers on the provision of a single seed. From the extensiveliterature in the field of software pseudouniform randomnumber generators, some algorithms that achieve this arethe generalized feedback shift register and the MT19937.They both have the ability to generate large sequences ofrandom numbers on the provision of a single seed andhave the ability to be split in independent subsequences toallow for a more parallelized implementation in hardware.An additional benefit of these algorithms is that they havelarge periods. It has been recommended in [1] and reinforcedin [4] that in order to prevent possible problems withcorrelations when implementing the same algorithm withdifferent initial seeds in parallel, the algorithm needs tohave a period in excess of 2200. The MT19937 algorithmhas a period of 219937, which therefore allows for parallelimplementation of MT19937 algorithm with different initialseeds.

There are currently two FPGA optimized implementa-tions of the MT19937, including a single port design, see[6], and a multiport design presented in [7]. The well-knowngeneralized feedback shift register has been modified forFPGA implementation in [8] to achieve the smallest areatime design to date. Thus it is of interest to see if a hardwareimplementation of a 624 port MT19937 algorithm can bemade more competitive. This is the subject of investigationof this paper. This paper is organized as follows, in Section 2the MT19937 algorithm is briefly described. In Sections 3and 4 we present single port and 624 port hardware imple-mentations of the MT19937 algorithm. In Section 5, diehard

test results of the two hardware implementations alongwith the performance comparisons of these implementationswith other published FPGA-based pseudouniform randomnumber generators are presented.

3. MT19937

The origins of the MT19937 algorithm are from the Taus-worthe generator, which is a linear feedback shift register thatproduces long sequences of binary bits; see [9]. The periodof this polynomial, which is irreducible, depends on thecharacteristic polynomial. The period is the smallest integern for which xn+1 is divisible by the characteristic polynomial.The polynomial has the following form;

xn+1 = (A1xn + A2xn−1 + · · · + Akxn−k+1) mod 2, (1)

where xi, Ai ∈ [0, 1] for all i. Although this algorithmproduces uniform random bits, it is slow. This algorithmwas later modified by Lewis and Payne in [10], by creatinga variant of this known as the generalized feedback shiftregister.

xi =(

xi − p)

x or(

xi − q)

, (2)

where each xi is a vector of size w with components 0 or 1.The maximum possible period of 2p − 1 of this generator isachieved when the primitive trinomial xp+xq+1 divides xn−1 for n = 2p − 1, for the smallest value of n. The maximumperiod can be achieved by selecting n as a Mersenne Prime.It was later identified that the effectiveness, that is, therandomness of the numbers produced, of this algorithm wasdependent on the selection of initial seeds. Furthermore,the algorithm required n words working area (which wasmemory consuming) and the randomness of the numbersproduced was dependent on the selection of initial seeds.This discovery led Matsumoto and Kurita 1994 to furtherrevise this algorithm to develop the twisted generalizedfeedback shift register II in [11]. This generator used linearcombinations of only relatively few bits of the precedingnumbers and was thus considerably faster and was namedTT800. Later Matsumoto and Kurita in 1998 further revisedthe TT800 to admit a Mersenne-prime period, and this newalgorithm was called the MT19937; see [12].

The MT19937 algorithm generates sequences of uni-formly distributed pseudo random integers 32 or 54 bitnumbers between [0, 2w − 1). The MT19937 algorithm isbased on the following linear recurrence formula, where xand a denote word vectors, andA isw byw matrix. The proofof the algorithm is provided in [12],

xk+n = xk+m ⊗(

xuk | xlk+1

)

, (3)

where k = (0, 1, . . .).

4. Single Port Version

This section describes our first hardware implementation ofMT19937 which we call the single port version. Generation

Page 63: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 3

Uniform random number

Ou

tpu

t ge

ner

ator

Seed

val

ue

mod

ula

tor

Stage 1 Stage 2

MUX

mag1 mag2

Stage 3

Multi-plexer

Seed[i + 397]

0×1L

Seed[i + 1]

0×7FFFFFFFL

0×80000000L

Seed[i]

0×9D2C5680UL 0×EFC60000UL

New seed[i]�1

�11

�7

�15

Figure 2: Internal logic of stage 2 and stage 3.

of random numbers is carried out in 3 stages, namely, theseed generator, seed value modulator, and output generator.This is illustrated in Figure 1.

Typically the user provides one number as a seed;however, the MT19937 algorithm works with a pool of 624seeds so that generator stage generates 624 seeds from thesingle input from the user. In stage two (the seed valuemodulator), which is the core of the algorithm, three valuesseed[i], seed[i + 1], and seed[i + 396] are read from the pooland based on the computation defined in the algorithm;seed[i] is updated. In the final stage, the output generatorreads one of the pool values and generates the outputuniform random number from this value.

The logic used to generate values out of stages 2 and3 is shown in Figure 2. The simplest form of parallelismfor MT19937 is to perform stages 2 and 3 in parallel, andthis is illustrated in Figure 2. Note that it is not possibleto more finely pipeline the output generator because itsprocessing rate is tied to the seed value modulator, whichcan only be pipelined into 3 stages. In other words, the seedvalue modulator is a bottleneck in the design. It needs to bepointed out that if the data comes from a dual port BRAMonly one value can be read and one written in the sameclock cycle. Since we need three values to be read, we use3 dual port BRAMs. We then need logic to decide whichBRAM to write into. The write back selection logic formsanother stage in the seed value modulator, which now has 4stages. Not shown in Figure 1 is the logic by which the BRAMaddress will be read from and written to. The single portversion generates one new random number per clock cycle.In Figure 2, mag1, mag2, and the hex numbers are constantsgiven in the algorithm definition.

The single port version provided is similar to the softwareimplementation of the MT19937 algorithm as it does notprovide any significant parallelization in the generation ofthe seeds. The only parallelism that is exploited is in the

Table 1: Diehard test results.

Test Single-portimplementation

624 portimplementation

Birthday 0.348414 0.467321

OPERM5 0.634231 0.892018

Binary Rank (31× 31) 0.523537 0.678026

Binary Rank (32× 32) 0.521654 0.578317

Binary Rank (6× 8) 0.235435 0.457192

Bitstream 0.235891 0.280648

OPSO 0.624894 0.987569

OQSO 0.235526 0.678913

DNA 0.623498 0.446857

Stream Count-the-1 0.352235 0.789671

Byte Count-the-1 0.652971 0.865671

Parking Lot 0.662841 0.567193

Minimum Distance 0.623121 0.467819

3D Spheres 0.622152 0.678991

Squeeze 0.951623 0.456719

Overlapping Sums 0.542882 0.345671

Runs Up 0.626844 0.456191

Runs Down 0.954532 0.898761

Craps 0.347221 0.689187

concurrent execution of seed value modulator (stage 2)and output generator (stage 3). It was also found that itwas not possible to pipeline the output generator to morethan 3 stages as it was tied to the seed value modulator.Significant improvements in throughput could be achievedby the parallelization of the stages 2 and 3 in addition toexecuting them in parallel as shown above. However, the

Page 64: FPGA Supercomputing Platfroms, Architecture, And Techniques

4 EURASIP Journal on Embedded Systems

Seed 0 Seed 0

Seed 1

Seed 1

Seed 6

Seed 624

Seed pool New seed pool

Seed 396

Seed 397

Seed[i + 396]

Seed[i]

...

...

...

...

(a) Storage of seeds pools

Uniform random number

Uniform random number

Output generator

Seed value modulator

Output generator

Seed value modulator

Seed[i + 2]

Seed[i + 397]

Seed[i + 1]

Seed[i + 396]

Newseed[i]

Newseed

[i + 1]

Seed[i]

...

(b) An example of one of the parallel instances of the 624 port design

Figure 3: 624 port version.

problem with parallelizing stages 2 and 3 is that currentlythe seeds are all stored in a single dual port BRAM. It isnot possible to carry out multiple reads and multiple writesto a single BRAM in one clock cycle. Previously in [7]parallelization of both these stages was achieved by dividingthe seeds into multiple BRAMs. This however significantlyincreased the area requirements of the design. In the nextsection we study this problem in more detail and present ournew design that has a high throughput and is area efficient.

5. 624 Port Version

There has previously been an attempt to parallelize theMT19937 algorithm by dividing the seeds into various poolsand then replicating the stages 2 and 3 to generate multipleoutputs; see [7]. However, it was noted that this was foundnot to be area time efficient. A close examination of thedesign reveals that in order to parallelize the generation ofrandom numbers, the authors divide the seeds into multipleBRAMs. Although this did increase the throughput, it greatlyincreased the area of the design as well. The reason for thisis that the logic required to generate the necessary BRAMaddress increased in complexity with the dividing of seedsacross multiple BRAMs.

It is important to note here that the problem is notthe parallelization of the generation of the uniform randomnumbers but is the storing of seeds in multiple BRAMs.Thus if the seeds were to be stored in registers rather thanBRAMs the logic used to generate the BRAM address couldbe saved. The excessive use of BRAMs to store seeds wasalways considered problematic. For example, in [13] it wasfound that the TT800 algorithm suffered in a similar manner

when the seeds were distributed across multiple BRAMs. Inthis paper it was reported that the single port version used81 Xilinx slices while the 3 port one used 132 slices. Of the132 slices used, 60 slices were used for the complex BRAMaddress generation logic. We believe that we can parallelizethe MT19937 algorithm to the maximum possible 624 portby storing seeds in registers rather than BRAMs. In thissection, we study a more simplified design for a 624 portMT19937 random number generator that uses registers tostore seeds.

A careful examination of the addressing scheme showsthat the seeds can be divided into groups in such a way thatthere is no need for the logic in one group to access the seedsin another group. We call these groups seed pools and theseare shown in Figure 3.

We also present a generic model which makes use of eachof these seed pools to modify the seed value and generate newrandom numbers per clock cycle. Now on each seed poolthe two stages of the MT19937 presented in Figure 3 worktogether to modify each seed value and generate a new one.This is illustrated in Figure 3(b). From a point of view ofcircuit speed and complexity, no register is shared by morethan two reading channels and one writing channel. Theconsequence is that register access logic is simpler, smaller,and faster.

6. Results

6.1. Test for Randomness. As a preliminary test, the outputof the hardware implementations was successfully verifiedagainst the output of the software implementation. For amore complete test, the hardware implementations have

Page 65: FPGA Supercomputing Platfroms, Architecture, And Techniques

EURASIP Journal on Embedded Systems 5

Table 2: Period, area, time, and throughput comparisons.

PeriodXilinx

XC2VP70Slices

LUTs Clock rateMHz

Area time slices ×sec per 32 bit

number ×10−6

Throughput 32 bitnumbers/sec× 109

MT19937

[This work]

Single port 219937 87 — 319 0.34 0.24

624 port 219937 1253 — 190 0.009 119.6

Software∗[12] 219937 — — 2800 — 1.017

MT19937 [6] 219937 420 — 76 5.5 0.076

MT19937 [7]

SMT‡ 219937 149 — 128.02 1.16 0.12

PMT52‡ 219937 2.850 — 71.63 0.76 3.7

FMT52‡ 219937 11.463 — 157.60 1.45 8.2

PMT52in‡ 219937 2.914 — 62.24 0.9 3.2

FMT52in‡ 219937 5.925 — 74.16 0.77 3.8

LUT [8]

4-tap, k = 32 232 — 33 309 0.06† 0.3

4-tap, k = 64 264 — 65 310 0.05† 0.6

4-tap, k = 96 298 — 97 298 0.05† 1.1

4-tap, k = 128 2128 — 127 287 0.05† 1.8

4-tap, k = 256 2258 — 257 246 0.06† 1.8

4-tap, k = 1248 21248 — 1249 168 0.09† 6.7

3-tap, k = 32 232 — 33 302 0.06† 0.31

3-tap, k = 64 264 — 65 319 0.05† 0.64

3-tap, k = 96 298 — 97 308 0.06† 1.2

3-tap, k = 128 2128 — 127 287 0.06† 1.7

3-tap, k = 256 2258 — 257 243 0.07† 1.9

3-tap, k = 1248 21248 — 1249 173 0.09† 6.7∗

Software implementation was on a Pentium 4 (2.8 GHz) single core processor.† Each slice consists of 2 LUTs, therefore the area time rating of these desings equals LUTs/2 ∗ Time.‡ This design has been implemented on an Altera Stratix. Each Xilinx slice is equivalent to two Altera logic elements.

been tested using the diehard test. In Table 1 the test resultsof the diehard tests are presented. The diehard test producesP-values in the range [0, 1). The P-values are to be above .025and below .975 for the test to pass. Both the implementationspass this test.

6.2. Comparison with Existing FPGA-Based Uniform RandomNumber Generators. In this section we compare our designswith those that are currently published. We compare ourdesigns on the basis of area time rating and throughput. Incontrasting these solutions we take into account the amountof total resources used, including slices, LUTs, and flip flops.

From Table 2 it should be noted that our 624 porthardware implementation of the MT19937 algorithm whenimplemented on a Xilinx XC2VP70-6 FPGA chip achivesmore than 115x the throughput of the same algorithm’simplementation in software on a Pentium 4 (2.8 GHz)single core processor. It can also be seen that there are noother published random number generators from currentliterature that are able to achieve a throughput of greater than119 × 109 32 bit random numbers per second. The closest

competitors are the FMT52, 4-tap, k = 1248, and 3-tap, k =1248 random number generators which are still significantlybehind. The design presented herein has an AT rating of only0.009×10−9 for a throughput of 119×109 random numbersper second. A further criticism of [8] is that the specializedfeedback register matrix used in the implementation was notcompletely published.

Our best implementation, which is the 624 portMT19937, uses only 1253 Xilinx slices. This is significantlyless than all of the other multiport designs currently pub-lished in literature as we use registers to store seeds and havearranged our seed value modulator and output generatorpipelines in an area efficient manner. Thus we do not requireany complex BRAM address generation logic and access toBRAMs. As a result we save on area and since our designif 624 port we generate 624 uniform random numbers perclock cycle. In a reconfigurable computing implementation,where only the random number generation is accelerated inhardware, like all of the other FPGA-based random numbergenerators, the 624 port implementation is limited by the I/Obandwidth of the FPGA.

Page 66: FPGA Supercomputing Platfroms, Architecture, And Techniques

6 EURASIP Journal on Embedded Systems

7. Conclusion

In this paper we have presented a unique 624 portMT19937 hardware implementation. Whilst currently thereare hardware implementations of uniform random numbergenerators published none seem to be able to offer ahigh throughput as well as area time efficiency. It wasdemonstrated that the 624 port design presented in thispaper is a high throughput, area time efficient FPGAoptimized pseudouniform random number generator with alarge period and with the ability to generate large quantitiesof uniform random numbers from a single seed. Thusmaking suitable for use in a reconfigurable computingimplementation of real-time IR scene simulation.

Acknowledgment

Research undertaken for this report has been assisted with aninternational scholarship from the Maurice de Rohan fund.This support is acknowledged and greatly appreciated.

References

[1] P. L’Ecuyer, “Random number generation,” in Handbook ofSimulation, J. Banks, Ed., chapter 4, pp. 93–137, John Wiley& Sons, New York, NY, USA, 1998.

[2] W. H. Press, B. P. Flannery, et al., Numerical Recipes: The Art ofScientific Computing, Cambridge University Press, Cambridge,UK, 1986.

[3] A. Srinivasan, M. Mascagni, and D. Ceperley, “Testing parallelrandom number generators,” Parallel Computing, vol. 29, no.1, pp. 69–94, 2003.

[4] P. L’Ecuyer and R. Panneton, “Fast random number generatorsbased on linear recurrences modulo 2: overview and compar-ison,” in Proceedings of the Winter Simulation Conference, pp.110–119, IEEE Press, 2005.

[5] V. Sriram and D. Kearney, “High speed high fidelity infraredscene simulation using reconfigurable computing,” in Pro-ceedings of the IEEE International Conference on Field Pro-grammable Logic and Applications, IEEE Press, Madrid, Spain,August 2006.

[6] V. Sriram and D. A. Kearney, “An area time efficient fieldprogrammable mersenne twister uniform random numbergenerator,” in Proceedings of the International Conference onEngineering of Reconfigurabe Systems and Algorithms, CSREAPress, June 2006.

[7] S. Konuma and S. Ichikawa, “Design and evaluation of hard-ware pseudo-random number generator MT19937,” IEICETransactions on Information and Systems, vol. E88-D, no. 12,pp. 2876–2879, 2005.

[8] D. B. Thomas and W. Luk, “High quality uniform randomnumber generation through LUT optimised linear recur-rences,” in Proceedings of the IEEE International Conferenceon Field Programmable Technology (FPT ’05), pp. 61–68,Singapore, December 2005.

[9] R. Tausworthe, “Random numbers generated by linear recur-rence modulo two,” Mathematics of Computation, vol. 19, pp.201–209, 1965.

[10] T. Lewis and W. Payne, “Generalized feedback shift registerpseudorandom number algorithm,” Journal of the ACM, vol.20, no. 3, pp. 456–468, 1973.

[11] M. Matsumoto and Y. Kurita, “Twisted GFSR generators–II,”ACM Transactions on Modeling and Computer Simulation, vol.4, no. 3, pp. 254–266, 1994.

[12] M. Matsumoto and T. Nishimura, “Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random num-ber generator,” ACM Transactions on Modeling and ComputerSimulation, vol. 8, no. 1, pp. 3–30, 1998.

[13] V. Sriram and D. Kearney, “Towards a multi-FPGA infraredsimulator,” The Journal of Defense Modeling and Simulation:Applications, Methodology, Technology, vol. 4, no. 4, pp. 50–63,2007.