An Independent Analysis of Floating-Point DSP Design Flow and Performance on Altera 28-Nm FPGAs

7/29/2019 An Independent Analysis of Floating-Point DSP Design Flow and Performance on Altera 28-Nm FPGAs

1/14

2012 Berkeley Design Technology, Inc. Page 1

An Independent Analysisof

Floating-point DSP Design Flow and Performance

on Altera 28-nm FPGAsBy the staff of

Berkeley Design Technology, Inc.

October 2012

OVERVIEW

FPGAs are increasingly used as parallel processing engines fordemanding digital signal processing applications. Benchmark results show that onhighly parallelizable workloads, FPGAs can achieve higher performance andsuperior cost/performance compared to digital signal processors (DSPs) andgeneral-purpose CPUs. However, to date, FPGAs have been used almostexclusively for fixed-point DSP designs. FPGAs have not been viewed as aneffective platform for applications requiring high-performance floating-pointcomputations. FPGA floating-point efficiency and performance has been limiteddue to long processing latencies and routing congestion. In addition, thetraditional FPGA design flow, based on writing register-transfer-level hardwaredescriptions in Verilog or VHDL, is not well suited to implementing complexfloating-point algorithms.

Altera has developed a new floating-point design flow intended tostreamline the process of implementing floating-point digital signal processingalgorithms on Altera FPGAs, and to enable those designs to achieve higherperformance and efficiency than previously possible. Rather than building adatapath consisting of elementary floating-point operators (for example,multiplication followed by addition followed by squaring), the floating-pointcompiler generates a fused datapath that combines elementary operators into asingle function or datapath. In doing so, it eliminates the redundancies present intraditional floating-point FPGA designs. In addition, the Altera design flow is ahigh-level, model-based design flow using Alteras DSP Builder AdvancedBlockset with MATLAB and Simulink from MathWorks. Altera expects that byworking at a high level, FPGA designers will be able to implement and verifycomplex floating-point algorithms more quickly than would be possible withtraditional HDL-based design.

BDTI performed an independent analysis of Alteras floating-point DSPdesign flow. BDTIs objective was to assess the performance that can be obtainedon Altera FPGAs for demanding floating-point DSP applications, and to evaluatethe ease-of-use of Alteras floating-point DSP design flow. This paper presentsBDTIs findings, along with background and methodology details.

WP-01187-1.0


2/14


Contents

1. Introduction ....................................................... 22. Implementation ................................................. 43. Design Flow and Tool Chain .......................... 84. Performance Results ....................................... 115. Conclusions...................................................... 136. References ........................................................ 141. IntroductionTwo Floating-point Design Examples

Advances in digital chips are enabling complexalgorithms that were previously limited to researchenvironments to move into the realm of everydayembedded computing applications. For example,for many years, linear algebra (and specificallysolving for systems involving large sets ofsimultaneous linear equations) has been usedmainly in research environments, where large-scalecompute resources are available and real-timecomputation is usually not required. Solving forlarge systems involves either matrix inversion orsome kind of matrix decomposition. In additionto being computationally demanding, thesetechniques can suffer from numeric instability ifsufficient dynamic range is not used. Therefore,efficient and accurate implementation of suchalgorithms is only practical in floating-pointdevices.

Altera recently introduced floating-pointcapability in the DSP Builder Advanced Blockset

tool chain to simplify implementation of floating-point DSP algorithms on Altera FPGAs, whileimproving performance and efficiency of floating-point designs compared to traditional FPGAdesign techniques. In a previous white paper [1],BDTI analyzed and evaluated the performanceand efficiency of the Altera Quartus II software

v11.0 tool chain on a single-channel floating-point Cholesky solver implementation example,synthesized for the 40-nm Stratix IV and Arria IVFPGAs.

In this paper, we evaluate the effectiveness of

Alteras approach using the newer Quartus IIsoftware v12.0 tool chain and assess theperformance of Alteras 28-nm Stratix V and Arria

V FPGAs. For this evaluation, BDTI focused onsolving a large set of simultaneous linear equationsusing two types of decompositions: a multi-channel Cholesky matrix decomposition and theQR decomposition using the Gram-Schmidtprocess. These decompositions combined withforward and backward substitutions constitute a

solution for the vector x in a simultaneous set oflinear equations of the formAx = B.

Matrix decomposition is used in manyadvanced military radar applications such asSpace-Time Adaptive Processing (or STAP) and

various estimation problems in digitalcommunications. The QR decomposition iscommonly used for any general m-by-n matrix A,

while the Cholesky decomposition is the preferredalgorithm for a square, symmetric, and positivedefinite matrix for its high computationalefficiency. Both decompositions use verycomputationally demanding algorithms andrequire high data precision, making floating-pointmath a necessity. In addition, the two examplesstudied in this paper use vector dot products andnested loops at the core of their algorithm, and

these operations are found in a wide range ofdigital signal processing applications involvinglinear algebra and finite impulse response (FIR)filters.

In the examples described in this paper,simultaneous sets of complex-data linearequations are solved in both methods and theresults are reported. Using the QR solver asshown in Section 4, an Altera Stratix V FPGA iscapable of performing 315 matrix decompositions

NOTATION AND DEFINITIONS

MBold capital letter denotes a matrix.zBold small letter denotes a vector.

L*The conjugate transpose of matrix L.

l*The conjugate transpose of element l.

Hermitian Matrix A square matrix with complexentries that is equal to its own conjugate transpose.This is the complex extension to a real symmetricmatrix.

Positive Definite Matrix A Hermitian matrix M is

positive definite ifz*Mz > 0 for all non-zero complexvectors z. The quantity z*Mz is always real because Mis a Hermitian matrix for the purposes of this paper.

Orthonormal Matrix A matrix Q is orthonormal ifQTQ = I where I is the identity matrix.

Cholesky Decomposition A factorization of aHermitian positive definite matrix M into a lower

triangular matrix L and its conjugate transpose L* such

that M = LL*.

QR Decomposition A factorization of a matrix M ofsize m-by-n into an orthonormal matrix Q of size m-by-

n and an upper triangular matrix R of size n-by-n suchthat M = QR.

Fmax The maximum frequency of an FPGA design.


3/14


per second of size 400400, running at 203 MHzand achieving 162109 floating-point operationsper second (GFLOPS).

Both design examples evaluated in this paperare provided as part of the DSP Builder software

v12.0 tool chain, or they may be requested [email protected].

Floating-point Design FlowTraditionally, FPGAs have not been the

platform of choice for demanding floating-pointapplications. Although FPGA vendors haveoffered floating-point primitive libraries, theperformance of FPGAs in floating-pointapplications has been very limited. Theinefficiency of traditional floating-point FPGAdesigns is partially due to the deeply pipelinednature and wide arithmetic structures of thefloating-point operators, which create largedatapath latencies and routing congestion. In turn,the latencies create hard-to-manage problems indesigns with high data dependencies. The finalresult is often a design with a low operatingfrequency.

The Altera DSP Builder Advanced Blocksettool flow attacks these issues at both thearchitectural level and the system design level. The

Altera floating-point compiler fuses large portionsof the datapath into a single floating-pointfunction instead of building them up bycomposition of elemental floating-point operators.It does this by analyzing the bit growth in thedatapath and choosing the optimum inputnormalization to allocate enough precisionthrough the datapath in order to eliminate asmany normalization and denormalization steps aspossible, as shown in Figure 1. The IEEE 754format is only used at datapath boundaries; withindatapaths, larger mantissa widths are used toincrease the dynamic range and reduce the needfor denormalization and normalization stepsbetween successive operators. Normalization anddenormalization functions use barrel shifters of upto 48 bits wide for a single precision floating-pointnumber. This consumes a significant amount oflogic and routing resources and is the main reason

why floating-point implementations on FPGAshave not been efficient. The fused datapathmethodology eliminates a significant number ofthese barrel shifters. Multiplications involving thelarger precision mantissas use Alteras 27-bit27-bit multiplier mode in Stratix V and Arria Vdevices. Figure 1(b) shows the fused datapathmethodology for the simple case of a two-adderchain as compared to the traditionalimplementation shown in Figure 1(a). The fuseddatapath in Figure 1(b) eliminates the inter-operator redundancy by removing the

denormalization of the output of the first adderand the normalization of the input of the secondadder. The elimination of the extra logic androuting, and the use of hard multipliers maketiming and latency across complex datapathspredictable. Both single- and double- precisionIEEE 754 floating-point algorithms areimplemented with reduced logic and higherperformance. Altera claims that a fused datapathcontains 50% less logic and 50% less latency thanthe equivalent datapath constructed out ofelementary operators [2]. And, as a result of the

wider internal data representation, typically theoverall data accuracy is higher than that achievedusing a library with elementary IEEE 754 floating-point operators.

The Altera floating-point DSP design flowincorporates the Altera DSP Builder AdvancedBlockset, Alteras Quartus II software tool chain,the ModelSim simulator, as well as MATLAB andSimulink from MathWorks. The Simulinkenvironment allows the designer to operate at the

Figure 1 (a) Traditional floating-point

implementation (b) Fused datapath

Normalize

Normalize

Denormalize

Denormalize

Adder1

Adder2

In 1 In 2

In 3

Out

(a)

IEEE 754

IEEE 754

Normalize

Denormalize

Adder1

Adder2

In 1 In 2 In 3

Out

(b)

Modifiedformat

IEEE 754


4/14


algorithmic behavioral level to describe, debug,and verify complex systems. Simulink featuressuch as data type propagation and vector dataprocessing are incorporated into the DSP Builder

Advanced Blockset, enabling a designer toperform quick algorithmic design spaceexploration.

In order to evaluate the efficiency andperformance of Alteras floating-point designflow, BDTI used the DSP Builder AdvancedBlockset tool flow to validate two examples, bothsolving a set of simultaneous linear equationsusing complex data-type in single-precisionfloating-point representation. One solution isachieved using the Cholesky decomposition, whilethe other uses QR decomposition via the Gram-Schmidt process. Section 2 of the paper describesthe implementation of the two floating-pointexamples. Section 3 presents BDTIs experience

with the design flow and tool chain. Section 4presents the performance of the implementationon two different Altera FPGAs: the high-endmedium-size Stratix V 5SGSMD5K2F40C2Ndevice and the mid-range Arria V5AGTFD7K3F40I3N device. Finally, Section 5presents BDTIs conclusions.

2. ImplementationBackground

Sets of linear equations of the form Ax = barise in many applications. Whether it is anoptimization problem involving linear leastsquares, a Kalman filter for a prediction problem,or MIMO communications channel estimation,the problem remains one of finding a numericalsolution for a set of linear equations of the formAx = b. For a general matrix of size m-by-n, wherem is the height of the matrix and n its width, QRdecomposition may be used to solve for vector x.

The algorithm decomposesAinto an orthonormalmatrix Q of size m-by-n and an upper triangularmatrix Rof size n-by-n. Since Q is orthonormal,QTQ = I and Rx = QTb. Given that Ris an upper

triangular matrix, x can easily be solved bybackward substitution without even inverting theoriginal matrixA. In the QR example in this paper

we work with over-determined matrices with mn.When matrix A is symmetric and positive

definite, such as covariance matrices that arise inmany problems, the Cholesky decomposition andsolver are commonly used. The algorithm findsthe inverse of matrix Athus solving for vector x

in x= b. The Cholesky decomposition is atleast twice as efficient as QR decompositiondepending on the algorithms used for thedecomposition. Since these decompositionalgorithms are recursive in nature and involvedivision, a large numeric dynamic range becomes anecessity as the matrix size increases. Mostimplementations, even for matrix sizes as small as44 in MIMO channel estimation, for example,are performed using floating-point operations. Forlarger systems requiring high throughput, such asthe ones found in military applications, therequired floating-point operation rate has typicallybeen prohibitive for embedded systems.Frequently, designers either abandon the wholealgorithm for a sub-optimum solution or resort tousing multiple high-performance floating-pointprocessors, raising cost and design effort.

Architectural OverviewCholesky Solver

In our design example, the Cholesky solver isimplemented in the FPGA as two subsystemsoperating in parallel in a pipelined fashion. Thefirst subsystem executes the Choleskydecomposition and forward substitutionsteps 1and 2 in the sidebar titled The Cholesky Solver. Thesecond subsystem executes the backwardsubstitutionstep 3 in the same sidebar. Since theinput matrix is Hermitian and the decompositiongenerates complex conjugate transposed triangular

matrices, memory utilization is optimized byloading only the lower triangular half of the inputmatrixAand overwriting it as the lower triangularmatrix L is generated. Both subsystems arepipelined, utilizing an input stage and a processingstage to allow processing to occur in one area of amemory while the other half is used for loadingnew data. The output of the decomposition andforward substitution stage goes into the inputstage of the backward substitution, as shown inFigure 2.

The core of the decomposition is the complexvector dot product engine (also referred to as the

vector multiplier) which computes equations (3)and (4). For the Stratix V device, a vector size(VS) of up to 90 complex-data elements is used,

whereas for the Arria V device, a vector size of upto 45 complex-data elements is implemented. The

vector size also corresponds to the number ofparallel memory reads needed to supply the dotproduct engine with a new set of data every clockcycle and thus determines the width and


5/14


partitioning of the on-chip dual-port memory. Forimplementation reasons, the storage of a matrix of

a given size is partitioned into ceil (N/VS) memorybanks, where ceil() is the ceiling function andNisthe size of the matrix.

The Cholesky solver design is a multi-channelimplementation. The maximum number ofchannels is a compile-time parameter and islimited only by the available memory on thedevice. The memory partitioning is identical to asingle-channel implementation except thatmultiple copies of the same structure are used.

The decomposition is performed one elementat a time, column-wise, starting from the top leftcorner, proceeding in a vertical zigzag fashion, andending at the bottom right corner of the matrix asshown in Figure 3(a). The diagonal element ofeach column is calculated first, followed by all thenon-diagonal elements below it in the samecolumn before moving to the diagonal element atthe top of the next column to the right. The

schedule of events and iterations is controlled witha four-level nested for loop. The outermost loopimplements column-wise processing; the secondloop implements bank-wise processing; the thirdinner loop processes the rows; and the innermostloop processes the multiple channels. Placing thechannel processing as the innermost loopeffectively turns the floating-point accumulatorinto a time-shared accumulator and hides itslatency better. The NestedLoopsblock in the DSPBuilder Advanced Blockset integrates up to threenested loops in a single processing block, making

the implementation faster and more resourceefficient when compared to a similar function thatis implemented as three separate for-loop blocks.

The block abstracts away the intricate loop controlsignals, reduces design and debug times, andmakes the overall loop structure more readable.

The dot product engine operates on the rowsof the matrix and calculates up to vector sizemultiplications in the summation term ofequations (3), (4), and (6) simultaneously in onecycle. A circular memory structure is used at theinput of the dot product engine to cycle over therows of the multiple input matrices. For vectordot products shorter than vector size, unusedterms are masked out and are not included in thesummation. For dot products longer than vectorsize, partial sums of products are calculated andsaved at bank boundaries until the output of allbanks for a given element in that row are availablefor a final summation, as shown in Figure 3(b).

The summation of the bank outputs is performedin a single accumulator loop using the floating-

THE CHOLESKY SOLVER

The recursive Cholesky algorithm used to solve for vector

x in Ax = b has three steps:

Step 1. Decomposition, i.e. finding the lower triangularmatrix L, where A = LL*

(1)for i = 2 to n,

(2)for j = 2 to (i-1),

(3)end

(4)

end

Note the dependencies in the equations above. Thediagonal elements in eq. (4) depend only on elements totheir left in the same row. Non-diagonal elements dependon elements to their left in the same row, and on the

elements to the left of the corresponding diagonal elementabove them.

Step 2. Forward substitution, i.e. solving for y in theequation Ly = b,

(5)for i = 2 to n,

(6)end

Step 3. Backward substitution, i.e. solving for x in the

equation L* x = y,

(7)for i = n-1 to 1,

/ (8)end

where,

n= the dimension of matrix

A

lij= element at row i columnj of matrix L

aij= element at row i columnj of matrix A

yi = element at row i of vector y

bi = element at row i of vector b

xi = element at row i of vector x

The output of step 1 is the Cholesky decomposition, andthe output of step 3 is the solution x of the linear equation

Ax = b. Note that the algorithm indirectly finds the inverseof matrix A to solve for x = A-1 b.


6/14


point adder block from the DSP BuilderAdvanced Blockset. This feedback loop has a

latency of 13 cycles. By swapping the for Banksloop and the for Rows loop relative to the orderthat one would traditionally have in a softwareimplementation, and adding multi-channelprocessing, the floating-point accumulator latencyis hidden and hardware utilization is improved.

The DSP Builder Advanced Blocksetautomatically calculates loop delays to address thistype of latency. The tool can be instructed tocalculate and fill in the minimum delay bychecking the Minimum delaybox in the Loop Delayblock. Paths that incur identical delays may beassigned an equilvalence group number and thetool will then assign the same delay value to allmembers in the group. Although not used in theexamples evaluated in this paper, the DSP Builder

Advanced Blockset provides an ApplicationSpecific Floating-Point Accumulator where theuser can customize the accumulator to optimize

its speed and resource requirements byconfiguring parameters such as maximum inputsize and required accumulation accuracy. Thisblock allows for single-cycle-per-sampleaccumulation of a single stream of floating-pointnumbers at a high clock rate.

The second subsystem performs backwardsubstitution. This subsystem has its own input andoutput memory blocks. Like the Cholesky forwardsubstitution subsystem, it is pipelined into aninput stage and a processing stage. Since thecomplexity of the backward substitution is on the

order of N2

compared to N3

for thedecomposition, vector processing is not employedfor the dot product. Instead, a single complexmultiplier is used for the dot product which isenough to keep pace with the Choleskydecomposition and the forward substitutionsubsystem.

QR SolverThe QR decomposition and solver is

implemented as two subsystems operating inparallel in a pipelined fashion as shown in Figure4. The first subsystem executes steps 1 and 2 in

the sidebar titled The QR Solver, whereas thesecond subsystem executes the backwardsubstitution in step 3. The backward substitutionsubsystem is identical to the one used in theCholesky solver. In contrast to Cholesky whereequations (1) to (6) are implemented as a singlefour-deep nested loop, the QR solver employs asimple finite state machine (FSM) to cycle throughfour main operations; finding the magnitudesquare of a vector in equation (2), the dot product

THE QRSOLVER

Solving for x in Ax = b via QR decomposition using the

Gram-Schmidt process has three steps:

Step 1. Decomposition of matrix A of size m-by-n, i.e.finding the matrices Q of size m-by-n, and R of size n-by-

n, where A = QR.

(1)for k = 1 to n,

_ (2) _ (3)

for i = (k+1) to n, and k n, (4)

end

t = k + 1, and k n, (5)for i = 1 to m, and k < n,

_ (6)end

end

Note that the orthonormal matrix Q = [q1 q2 qn], is not

explicitly calculated but rather its orthogonal form, U = [u1u2 un], is found and used in this restructured set of

equations, where qi = .

Step 2. Calculating d, where d = QTb.

for k = 1 to n,

(7)end

Step 3. Backward substitution, i.e. solving for x in the

equation Rx = d.

for i = n-1 to 1,

/ (8)end

where,

m = the row dimension of matrix A

n = the column dimension of matrix Aui=i

th column of matrix U

uij= element at row i columnj of matrix U

rij= element at row i columnj of matrix Rai = i

th column of matrix Aaij= element at row i columnj of matrix A

di = element at row i of vector d

bi = element at row i of vector b

xi = element at row i of vector x

The QR solver finds the solution x of the linear equationsAx = b without finding the inverse of matrix A which may

be undefined.


7/14


of two vectors in equation (4), subtracting a dotproduct from a vector in equation (6), and the dotproduct in equation (7) to find vector d. The

NestedLoop block in the DSP Builder AdvancedBlockset is used in all phases of the FSM togenerate all control and event signals for thedatapath.

The common processing block in the fouroperations listed above is the dot product engine.In order to increase performance, vectorprocessing is used to calculate the dot product.Similar to the Cholesky design, the vector size is acompile-time parameter. To reuse this engine in allfour states of the FSM, a data multiplexer is usedat its input and controlled by the FSM eventcontroller. The multiplexer selects the correctinputs for the dot product engine for each state ofthe FSM. The details of the dot product engineand the floating-point accumulator are similar tothose in the Cholesky design and hence are notpresented here.

The memory needed for the QRdecomposition is optimized by reusing the maincore memory block that initially holds matrix A

and input vector b. Processing happens column-wise, left to right, in the core memory. Once acolumn is consumed and no longer needed, it isoverwritten by the corresponding column of anew matrix. By the time the old matrix A isdecomposed, the original contents of this memoryblock are replaced by the new matrix, thusmaintaining the capability of back-to-back matrixprocessing without any stalls.

In the first two states of the FSM, elements ofthe R matrix are generated element-by-elementrow-wise starting with the diagonal element foreach row. In the subtraction state of the FSM, thecolumns of matrix A in the core memory arerecursively updated and replaced by the partially

calculated vectors. For example, at column k,all columns k+1 to nare updated in memory bysubtracting the scaled version of from each of

VS...

VS

Bank boundaries at multiples ofvector size (VS). Partial sum of

products are generated at these

points and saved until all rowsbelow are processed (for bank

loop).

eij

j

.

.

.

i

Last

First

(a) (b

Figure 3 (a) Processing sequence (b) Computation of diagonal element eij includes two partial

sum-of-products at j=VS and j=2*VS, plus the last remaining partial dot product section

Figure 2 Cholesky process pipelining and memory reuse

Input

Input

Decomposition/Forward

Decomposition/Forward

Input

Processing in the upper triangular memory

space

Processing in thelower triangular

memory space

Decomposition / Forward Substitution Subsystem

Backward

Backward

Backward Substitution Subsystem

Processing in the lower triangular

memory space

Processing in the upper triangular

memory space

Time

Input

Input

Input


8/14


the columns k+1 to n. This process recursivelycalculates the vectors to and is a moreefficient version of equation (6) for hardwareimplementation. Processing column-wise, in thefourth state of the FSM, each newly calculated

is multiplied with input vector b to generate asingle element of column vector d. The Qmatrix is not explicitly generated, but itsorthogonal columns, , are generated, used insubsequent phases of the FSM, and overwritten bythe corresponding column, , of the new matrixin the next cycle of the FSM. Figure 5 shows thememory organization and processing sequence ofthe QR decomposition subsystem.

The outputs of the first subsystem are the Rmatrix and the column vector d. The Rmatrix isan upper triangular matrix and is generated row-

wise left to right as shown in equations (3) and (4).A rectangular memory structure is used in a ping-pong manner. While the upper triangular sectionis being processed by the backward substitutionsubsystem, the lower triangular section is gettingfilled by the decomposition subsystem, and vice

versa. The d column vector is appended to the Rmatrix and is generated row-wise top to bottom.

The output of the backward substitution, is thesolution vector x of the linear equationsAx = b.

The QR solver may be implemented in a multi-

channel format similar to the Cholesky solver toimprove utilization, reduce latency, and increasethe throughput of the design. The throughputimprovement would mainly come from theeffective reduction in the latency of the floating-point accumulator.3. Design Flow and Tool ChainEvaluation Methodology

For this evaluation, Altera provided BDTIwith implementations of the Cholesky and the QR

solvers created using the DSP Builder AdvancedBlockset. Altera provided BDTI engineers a PCwith the necessary Altera and MathWorks toolsinstalled. BDTI engineers then examined the

Altera designs, simulated and synthesized them, allunder the Simulink environment. Additionally, the

Figure 4 QR process pipelining and memory reuse

Time

Input 1

Input 1

Main core input

memory space Decomposition and Vector dCalculation Subsystem

Back. 2

Backward SubstitutionSubsystem

Decomposition sequenceof the input matrices

Dec. 1

Back. 1

Input 2 Input 3 Input 4

Dec. 2 Dec. 3 Dec. 4

Input 3

Input 2

Back. 3

Input 4 Back. 4Processing in the lowertriangular memory space

Processing in the upper

triangular memory space

New Aa

1 a

2 a

3

Current

vector uk

Modifiedu

k+1to u

n

Main core memory

Column-wise processing

Row-wisep

rocessing

R matrix d vector

d1

d2

d3

dk

Toptobottomp

rocessing

r1

r2r

3

rk

Decomposition

subsystem

Figure 5 Memory organization of the QR decomposition subsystem


9/14


synthesized designs were run on two separatedevices: the Stratix V and Arria V FPGAs. In theprocess, BDTI evaluated the Altera design flowand the performance of the two design examples.

Since the boards used for this evaluationlacked the ability to generate the stimuli for theexample designs, each of the two designs containsa stimulus generation block that is implementedusing DSP Builder Advanced Blockset blocks andcompiled with the application design under test(DUT). In order to minimize the impact of thestimulus block on DUT performance and FPGAresource usage, matrixAis generated on-the-fly bythe stimulus block from a much smaller set ofrandom data. This small set of data is generated byMATLAB m-file scripts, loaded into the stimulusblock memory and compiled with the design.

These MATLAB scripts use the same algorithm asthe stimulus block in the DUT to generate the

double-precision floating-point format referencedata for the solution vector x to measure the errorperformance of the Simulink models and thedesigns running on the FPGAs. In the Choleskydesign this algorithm guarantees Hermitianpositive definiteness for matrix A, whereas in theQR solver design it guarantees the creation of a

well-conditionedAmatrix.Four configurations of vector, matrix, and

channel sizes were implemented for the Choleskysolver design, three configurations on the Stratix

V FPGA and two configurations on the Arria V

device, with one common configuration betweenthe two. For the QR solver design, fourconfigurations were implemented for the Stratix VFPGA, and two of these were also implementedon the Arria V device. All configurations wereevaluated for FPGA resource utilization,achievable clock rate, throughput, and functionalcorrectness. FPGA design constraints, such asclock rate, device selection, and speed grade arespecified in the Simulink environment.

Throughput and performance were evaluatedat the Simulink model and the hardware platformlevels. The Simulink model of both designs isinstrumented to display both the forwardprocessing and the total of forward and backwardprocessing cycles, and the Quartus II softwareruns give the Fmax achieved for each configuration.Each configuration is then downloaded to thehardware platform, its operating frequency set toFmax, and processing is started. The solution

vectors x of each solver are captured for eachconfiguration and compared against the

corresponding MATLAB generated double-precision floating-point references.

A post-simulation script calculates thedifference between the Simulink IEEE 754 single-precision floating-point output and the MATLABgenerated double-precision floating-pointreference vectors. Similarly, a script calculates thedifference between the captured output of thehardware simulations and that of the MATLABgenerated double-precision floating-pointreferences.

Evaluation of the Tool ChainSimulink is built upon and requires the

MATLAB framework. In the Simulinkenvironment, the evaluated designs use blocksfrom the Altera DSP Builder Advanced Blockset,

which is a separate blockset from the standardDSP Builder library. The DSP Builder Advanced

Blockset is geared towards block-basedimplementation of DSP algorithms and datapathsand uses a higher level of abstraction than thestandard DSP Builder library, which comprisesmore general and elementary blocks. The DSPBuilder Advanced Blockset library contains over50 common trigonometric, arithmetic, andBoolean functions in addition to the morecomplex fast Fourier transform (FFT) and FIRfilter building blocks. Notable additions inQuartus II software v12.0 are the low-latencysquare root, the nested for loop, and a

customizable floating-point accumulator blocks.Elements from the standard Blockset and the DSPBuilder Advanced Blockset cannot be mixed in adatapath structure at the same hierarchy level; onlyblocks from the DSP Builder Advanced Blocksetsupport the floating-point compiler. Blocks fromthe standard Blockset are not optimized forfloating-point processing. In addition, althoughimporting of hand-coded HDL is available for thestandard Blockset, it is not available in the DSPBuilder Advanced Blockset since the tool cannotperform optimizations at the HDL level. Ingeneral, the block-based design-entry approach is

well suited for DSP algorithms; however, due tothe lack of constructs such as caseor switch in theblock library, a text-based approach is moreintuitive for designs that are control-intensive andinvolve state machines.

Starting a simulation with DSP BuilderAdvanced Blockset compiles the Simulink model,generates HDL code and constraints for the

Altera Quartus II software environment, builds a


10/14


test bench and script files for the ModelSimenvironment, and simulates the Simulink model.

The time required to run the simulations forvarious configurations ranged from 3 minutes to28 minutes depending on matrix size on a 3 GHzIntel Xeon W3550 PC. The Simulink simulationgenerates detailed resource utilization estimates

without the need to run a Quartus II softwarecompilation, thus helping the designer to quicklydetermine the device size needed. When using ahardware development kit, the user must supplypin-out assignments based on the board layout;the user can re-use the pin-out constraints in theprovided .qsf file in the Quartus II softwareproject folder for the design, or use the pinplanner in Quartus II software to assign andmanage pin-outs from scratch.

Experiments were performed on the model toevaluate the ease of algorithm exploration and the

corresponding HDL generation. Input parameters,such as vector dot product size, matrix size, anddata type were changed in the stimulus block andsimulations run. In all cases, the correct RTL code

was generated within minutes and simulationoutputs matched the MATLAB reference.

All configurations were synthesized using theQuartus II design software, which may belaunched directly from the Simulink environment.

A designer may use the Quartus II software inpush-button mode with either default or user-selected optimization parameters, or use the

Design Space Explorer (DSE) tool. Available aspart of the Quartus II software, DSEautomatically runs multiple router passes using adifferent seed for each pass. The route with thebest clock rate is saved. This is an automaticprocess requiring no user intervention but takesmuch longer than the push-button method. TheQuartus II software push-button runs rangedfrom 1 hour to 6.5 hours depending on the designsize.

The higher abstraction level employed by theDSP Builder Advanced Blockset design flowallows for a faster algorithmic space explorationand simulation cycles, thus reducing the overalltime to reach a final optimized design. However,this advantage is not inherent to Simulink in thesame way in which, for example, data-typepropagation is inherent to Simulink. In order toexploit the design space exploration advantagesoffered by the block-based design approach overhand-written RTL, additional steps are required bythe designer when creating the Simulink model. In

particular, the model must be structured to enablea parameter-driven algorithmic space exploration.For the examples analyzed in this paper, themodels are implemented to enableexperimentation with different matrix sizes, vectorsizes, and (in the case of the Cholesky solver)number of parallel channels. Once a model hasbeen created incorporating this type of flexibility,performance and the resource usage estimates for

various design configurations can be explored byvarying these parameters. An understanding ofhardware design is also required to achieve goodthroughput rates and resource utilization, asexemplified in the floating-point accumulatorblock discussed in Section 2 of this paper.

Training for the DSP Builder AdvancedBlockset design flow entails a 4-hour class by

Altera and approximately 10 hours of on-linetutorials and demos. In addition, BDTI spent

approximately 90 hours exploring the tool andboth models for hands-on experience. The timeand effort required to get up to speed with thetool chain will depend on the skills andbackground of the designer. A seasoned engineer

with both Simulink block-based design and FPGAhardware design experience will likely find theDSP Builder Advanced Blockset approachefficient and easy to use. For an FPGA designer

with little or no knowledge of MATLAB andSimulink, designing at a higher level of abstractionmay represent a new way of thinking and thus an

initial challenge, entailing a significant learningcurve. Once the methodology is mastered, thedesigner can achieve significantly faster designcycles than a HDL approach. One can focus onimplementing the algorithm and not worry abouthardware design details such as pipelining. Designand verification time is significantly reduced sincethe majority of functional simulation and

verification is done in the Simulink environment.The RTL output from the Simulink compilationmay be run in the ModelSim software for a fullfunctional simulation.

The learning curve may be less steep for anengineer with system-level design background

who has little or no skills in hardware design.Although the tool chain integrates hardwarecompilation, synthesis, routing, and automaticscript generation within the Simulink environmentand abstracts away many complex design conceptssuch as data pipelining and signal vectorizing,some knowledge of hardware design is still neededto complete an implementation.


11/14


4. Performance ResultsThis section presents the results of BDTIs

independent evaluation of the Altera Choleskyand QR solvers floating-point implementationexamples.

All designs used Alteras DSP Builder

Advanced Blockset v12.0, with MathWorksrelease R2011b, and built with Quartus II designsoftware v12.0 SP1. RTL simulations were doneusing ModelSim 10.1. The designs were built fortwo Altera 28-nm FPGAs: the high-end medium-size Stratix V 5SGSMD5K2F40C2 device, and themid-range Arria V 5AGTFD7K3F40I3N device.

The Stratix V FPGA used in this analysis, features345.2K ALUTs, 1,590 2727-bit variable-precision multipliers, and 2,014 M20K memoryblocks. The Arria V FPGA features 380.4K

ALUTs, 1,156 2727-bit variable-precisionmultipliers, and 2,414 M10K memory blocks. Thehardware platforms used for RTL evaluation werethe DSP Development Kit, Stratix V Edition, andthe Arria V FPGA Development Kit. TheModelSim software was used for oneconfiguration to assess ease of use of the toolfrom the Simulink environment.

Combined, a total of eleven cases weresimulated and built for both designs on the twodevices. Resource utilization, performance, andaccuracy results were recorded for each case.

Table 1 lists the resource utilization and clock

speed achieved for the Cholesky and QR solversfor each configuration. The Cholesky solverdesign provides a maximum matrix size parameter.

At runtime, matrix sizes smaller than themaximum design size may be used. For theresource utilization results presented in Table 1,each configuration was synthesized with themaximum matrix size parameter equal to thematrix size under evaluation in order to obtain theactual resources consumed by the reported matrixsize. The resources used by the stimuli blocks

were not included in the totals. It is worthwhile tonote that none of the configurations evaluated inthis paper used the FPGAs to capacity. To achievethe best Fmax in a reasonable amount of synthesisand place-and-route time in the Quartus IIsoftware, identical preset optimization parameters

were used for each design to improve speed. Wechose the 6/9090/45 configuration for the

Cholesky design example to run the Quartus IIsoftwares Design Space Explorer (DSE) to assessthe speed improvement and time taken forsynthesis as compared to the push-button mode.In this case, a speed improvement of 12.5% wasachieved, however the Quartus II software runtime increased from 2 hours to 7.5 hours tosynthesize the design.

The FPGA resource utilization is consistentwith expectations for the designs underevaluation. Memory use is dominated by matrix

Example

Device Configuration

(Channel Size/Matrix Size/Vector Size)

ALUT(K)(Used /

% ofTotal)

Registers(K)

(Used /% of Total)

DSP Blocks(Variable-Precision 2727

Multipliers used/% of Total)

M20K (Stratix)/M10K (Arria)

(Blocks used /% of Total)

Fmax,(MHz)P: Push-

button usedD: DSE used

Cholesky

StratixV 1 / 360360 / 90 198 / 57% 339 / 49% 391 / 25% 1411 / 70% 189 (P)

20 / 6060 / 60 135 / 39% 235 / 34% 268 / 17% 955 / 48% 234 (P)

64 / 3030 / 30 74 / 22% 124 / 18% 146 / 9% 793 / 39% 288 (P)

ArriaV 6 / 9090 / 45 104 / 27% 179 / 24% 214 / 19% 1094 / 45%

176 (P)198 (D)

64 / 3030 / 30 73 / 19% 121 / 16% 154 / 13% 1694 / 70% 185 (P)

QR

StratixV 1 / 400400 / 100 184 / 53% 377 / 55% 428 / 27% 1566 / 78% 203 (P)1 / 200100 / 100 180 / 52% 375 / 54% 428 / 27% 504 / 25% 207 (P)

1 / 200100 / 50 96 / 28% 201 / 29% 228 / 14% 281 / 14% 260 (P)

1 / 10050 / 50 95 / 28% 198 / 29% 227 / 14% 230 / 12% 259 (P)

ArriaV 1 / 200100 / 50 97 / 25% 202 / 27% 238 / 21% 372 / 15% 171 (P)

1 / 10050 / 50 95 / 25% 200 / 26% 237 / 21% 245 / 10% 170 (P)

Table 1 Resource utilization and clock speed


12/14


storage and is proportional to the matrix size andthe number of channels for a multi-channeldesign. The DSP blocks usage increases linearly

with the vector size. The vector multiplier requires4 variable-precision DSP blocks per 27-bit 27-bit complex valued floating-point multiplication.Given a vector size of 60 complex floating-point

values, 240 DSP blocks are required for the vectordot product engine.

Table 2 shows the performance of theCholesky and the QR solvers for all

configurations. The performance for each case isgiven at the Fmax reported in Table 1. Thethroughput is calculated by dividing Fmax by thecycles consumed by the solver forward subsystemexecution. Since the backward substitutionsubsystem executes in parallel and with lowerlatency than the forward subsystem, the overallthroughput is not affected by the former. For themulti-channel Cholesky solver throughput, thisresult is multiplied by the number of channels thatare processed in parallel (Channel Size parameterin Table 2). The overall latency for each case is

calculated by dividing the total cycles taken by theexecution of the forward and backwardsubsystems by Fmax.The choice of vector sizerelative to matrix size is a compromise and isapplication dependent. If the vector size is muchsmaller than the matrix size, the design will beresource efficient at the expense of latency, asshown for the QR solver 200100 matrix sizeconfigurations with different vector sizes.

The multi-channel Cholesky design improvesupon the single-channel design that was analyzedin the previous paper by BDTI. In the single-channel implementation, latencies, such as thosefound in the floating-point accumulator werepartially hidden by rearranging the processingorder in the algorithm. As reported in reference[1], the efficiency of the single-channelimplementation depended mostly on the matrixand vector sizes. Looking at the throughputcolumn of Table 2, the multi-channel

implementation has significant benefits inprocessing efficiency, particularly for smaller sizematrices and vector sizes. Multi-channelprocessing improves throughput by completelyhiding the implementation latencies described inSection 2 of this paper. For a given matrix and

vector sizes, a multi-channel implementation willdeliver a higher peak throughput than its single-channel counterpart.

The last column in the table shows the numberof real-data floating-point operations per secondin units of 109 (GFLOPS) for each of the

configurations. The number of operationsrequired by each solver depends on thedecomposition algorithm used. The reportednumbers were derived from the actualimplementation of the Cholesky solver and QRsolver algorithms in floating-point complex-dataformat on the two FPGAs used for thisevaluation. For the Cholesky solver, the numberof real-data floating-point operations isapproximated to the second order term 4n3/3 +

Table 2 Performance Results

Example Device Configuration

(Channel Size/Matrix Size/Vector Size)

ThroughputReported by

Simulink(kMatrices/sec)

OverallLatency

(sec)@Fmax (MHz)

GFLOPS(Real

Data Type)

Cholesky

Stratix V 1 / 360360 / 90 1.43 1112 @ 189 91

20 / 6060 / 60 118.35 330 @ 234 39

64 / 3030 / 30 544.28 222 @ 288 26

Arria V 6 / 9090 / 4531.3135.22

347 @ 176308 @ 198

3438

64 / 3030 / 30 349.62 344 @ 185 16

QR

Stratix V 1 / 400400 / 100 0.315 3970 @ 203 162

1 / 200100 / 100 8.76 167.0 @ 207 141

1 / 200100 / 50 6.17 204.5 @ 260 99

1 / 10050 / 50 32.82 43.3 @ 259 66

Arria V 1 / 200100 / 50 4.05 311 @ 171 65

1 / 10050 / 50 21.54 66 @ 170 44


13/14


12n2, whereas for the QR solver 8mn2 + 6.5n2 +mnis used.

Table 3 shows the error performance of theCholesky and QR solvers for both the Simulinksimulation and the design implementation runningon the hardware development boards using single-precision floating-point operations. The error iscalculated by comparing the output of each of theSimulink and the hardware platform simulations

with the double-precision floating-point referencefor the solution vectors x generated by MATLAB.For the multi-channel Cholesky solver cases, onlythe error performance of a single randomlychosen channel is reported for brevity. Althoughthe error performance is input data dependent, onaverage, the RTL implementation benefits fromthe fused datapath methodology and achieves astatistically equal or higher precision than thestandard IEEE 754 single-precisionimplementation as demonstrated by comparingthe Frobenius Norm in columns (4) and (5) of

Table 3. We use the Frobenius Norm to get ameasure of the overall error magnitude in the

resultant vector and is given by: | |

Where N is the size of the vector, e is thedifference vector between observed x and itsMATLAB generated golden reference, and iis theindex of the elements in vector e. The maximumnormalized error is given by:

max _ _/ _

5. ConclusionsIn this paper, we evaluated a new approach to

implementation of floating-point DSP algorithmson FPGAs using Alteras DSP Builder AdvancedBlockset design flow. This design flowincorporates the Altera DSP Builder AdvancedBlockset, Alteras Quartus II software tool chain,and ModelSim simulator, as well as MATLAB andSimulink from MathWorks. This approach allowsthe designer to work at the algorithmic behaviorallevel in the Simulink environment. The tool chaincombines and integrates the algorithm modelingand simulation, RTL generation, synthesis, placeand route, and design verification stages within theSimulink environment. This integration enablesquick development and rapid design spaceexploration both at the algorithmic level and at theFPGA level, and ultimately reduces overall designeffort. Once the algorithm is modeled anddebugged at a high level, the design can besynthesized, and targeted to an Altera FPGA.

For the purpose of this evaluation, the design

examples were single-precision complex-dataIEEE 754 floating-point Cholesky and QR solversmodeled in Simulink using the Altera DSP Builder

Advanced Blockset. The largest design examplewe evaluated was a QR solver for a complex-valued floating-point matrix of size 400400 and avector size of 100. Running at 203 MHz thisexample processes 162 GFLOPS. The reportedGFLOPS values in Table 2 are for the actual

Example

Device

Configuration(Reported Channel Number /

Matrix Size/Vector Size)

MathWorks SimulinkIEEE 754

Floating-PointSingle-Precision Error

(Frobenius Norm /MaximumNormalized Error)

Alteras DSP BuilderSynthesized RTL

Floating-PointSingle-Precision Error

(With Fused DatapathMethodology)

(Frobenius Norm /Maximum Normalized Error)

Cholesky Stratix

V

1 / 360360 / 90 2.11e-6 / 1.02e-4 1.16e-6 / 8.58e-57 / 6060 / 60 4.24e-7 / 8.59e-6 1.82e-7 / 2.62e-653 / 3030 / 30 7.48e-8 / 2.08e-6 3.84e-8 / 1.15e-6

Arria V3 / 9090 / 45 4.08e-7 / 9.72e-6 1.99e-7 / 5.52e-6

63 / 3030 / 30 8.93e-8 / 2.38e-6 5.91e-8 / 1.24e-6

QR

StratixV

1 / 400400 / 100 4.53e-6 / 1.45e-4 5.15e-6 / 1.03e-41 / 200100 / 100 1.24e-6 / 1.13e-5 9.97e-7 / 8.15e-61 / 200100 / 50 8.38e-7 / 6.70e-6 8.97e-7 / 4.15e-6

1 / 10050 / 50 9.13e-7 / 4.68e-6 6.96e-7 / 4.94e-6

Arria V1 / 200100 / 50 9.27e-7 / 2.33e-5 9.31e-7 / 9.95e-6

1 / 10050 / 50 9.13e-7 / 4.68e-6 6.96e-7 / 4.94e-6

Table 3 Error performance of the Simulink model and the synthesized RTL compared to the

MATLAB double-precision floating-point reference


14/14


implementation of the Cholesky solver and theQR solver algorithms in floating-point complex-data format on the two FPGAs. For a validcomparison with other competing platforms, thesame algorithms should be implemented on theseplatforms and their performance measured. Allreported performance results were achieved usingthe Altera DSP Builder Advanced Blockset toolflow with no hand optimization or floor planning.Starting from a high-level block-based design inSimulink, the tool chain automatically pipelined,generated the RTL code and synthesized thedesign to achieve usable speeds and resourceutilization. The Altera floating-point design flowsimplifies the process of implementing complexfloating-point DSP algorithms on an FPGA bystreamlining the tools under a single platform.

With its fused datapath methodology, complexfloating-point datapaths are implemented with

higher performance and efficiency than previouslypossible.

The new approach analyzed in this paper alsoentails a significant learning curve for using theDSP Builder Advanced Blockset. This is especiallytrue for a designer not familiar with MATLABand Simulink. The block-based design-entryapproach may present an initial challenge for atraditional hardware designer. In addition, in orderto exploit the advantages offered by the block-based design approach over hand-written RTL,additional steps are required by the designer when

creating the Simulink model. For example, toenable experimentation with different matrix andvector sizes, as is done in the two design examplesin this paper, the Simulink model was structuredto incorporate a parameter-driven design toexplore the various design configurations.

Currently, designers using the DSP BuilderAdvanced Blockset must limit themselves to theelements provided by the blockset in order toachieve optimized performance. Elements fromthe standard DSP Builder Blockset are notoptimized with the floating-point compiler norcan they be mixed with the Advanced Blockset atthe same hierarchy level. Hand-coded HDLblocks may only be imported into the StandardBlockset. Additionally, the DSP Builder AdvancedBlockset is geared towards DSP implementationsand may have limited use for designs involvingheavy control and state machines.

The next version of the DSP BuilderAdvanced Blockset, expected to be released by theend of 2012, will include floating-point extensions.

The designer will no longer be limited to the twostandard IEEE 754 single-precision and double-precision formats, but will have the choice of atotal of seven different precisions ranging from 16to 64 bits (exponent plus mantissa). Using the newEnhanced Precision Supportblock in the DSP BuilderAdvanced Blockset, the designer may choose thedata-width that best fits their application.

6. References[1] Berkeley Design Technology, Inc., 2011. AnIndependent Analysis of Alteras FPGA Floating-point DSP Design Flow. Available for downloadat http://www.altera.com/literature/wp/wp-01166-bdti-altera-floating-point-dsp.pdf.

[2] S.S. Demirsoy, M. Langhammer, 2009. Fuseddatapath floating point implementation ofCholesky decomposition. FPGA 09, February,

2009.

An Independent Analysis of Floating-Point DSP Design Flow and Performance on Altera 28-Nm FPGAs

Documents

Transcript of An Independent Analysis of Floating-Point DSP Design Flow and Performance on Altera 28-Nm FPGAs