Building a High Performance Bit Serial Processor in an FPGA

22
the Andraka onsulting roup C G 16 Arcadia Drive North Kingstown, RI 02852-1666 USA Andraka Consulting Group 1995 Building a High Performance Bit Serial Processor in an FPGA Raymond J. Andraka Andraka Consulting Group 16 Arcadia Drive North Kingstown, RI 02852-1666 USA 1996 On-Chip System Design Conference

Transcript of Building a High Performance Bit Serial Processor in an FPGA

Page 1: Building a High Performance Bit Serial Processor in an FPGA

theAndraka onsulting roupC G16 Arcadia Drive North Kingstown, RI 02852-1666 USA

Andraka Consulting Group 1995

Building a HighPerformance Bit SerialProcessor in an FPGA

Raymond J. Andraka

Andraka Consulting Group16 Arcadia DriveNorth Kingstown, RI 02852-1666 USA

1996On-Chip SystemDesign Conference

Page 2: Building a High Performance Bit Serial Processor in an FPGA

theAndraka onsulting roupC G16 Arcadia Drive North Kingstown, RI 02852-1666 USA

Andraka Consulting Group 1995

Abstract

This paper describes theadvantages and pitfalls of a bitserial architecture by studyingthe design of a vectormagnitude processor inside aradar signal processor. Thedesign combines bit serialarithmetic with a CORDICalgorithm to process 8 million12 bit vectors per secondinside a single FPGA. The bitserial architecture has theadvantage of a very compactdesign solution that avoidsmany of the place and routeproblems commonly associatedwith FPGAs. The bit clockrequired to obtain therequired data rate pushes theupper limits of today’s FPGAs.Therefore, this paper alsoshowcases the highperformance FPGA designtechniques needed to make abit serial design moreattractive. Finally, the paperdiscusses how to extend thetechniques to develop any bitserial processor.

Authors/Speakers

Raymond J. Andraka

Current Activities

Ray Andraka is the chairmanof the Andraka ConsultingGroup, a digital hardwaredesign firm specializing inhigh performance FPGAdesigns. His currentconsulting activities includesupporting reconfigurablecomputer research, applyingFPGAs to next generationgeneral aviation avionics, anddeveloping the electronicspackage for a new medicaldevice. Ray is also writingapplication notes describingdesign techniques for FPGAs.

Author Background

Ray Andraka has an MSEEfrom University ofMassachusetts at Lowell and aBSEE from Lehigh University.He has originated andimproved dozens of designs inXilinx, AMD, Altera, Actel,Atmel, QuickLogic, and NSCFPGAs over the past 8 years.Many of these designs werefor high performance signalprocessing applications. Hissignal processing experienceincludes over 5 yearsdesigning pipelined radarsignal processors for Raytheonand 3 years of signal detectionand reconstruction algorithmdevelopment for the US AirForce. He also spent 2 yearsdeveloping image readers andprocessors for G-Tech. Hepresented a paper at the 1993EE-Times PLD Conferencedescribing his design of a 27tap 12 bit FIR filter in anFPGA. That paper is the basisfor designs in at least twoFPGA vendor’s macro libraries

Page 3: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

theAndraka onsulting roupC G"Specialists at Maximizing FPGA performance"

Building a High Performance Bit SerialProcessor in an FPGA

the Andraka Consulting Group

Raymond J. Andraka

the Andraka Consulting Group16 Arcadia Drive

North Kingstown, RI 02852-1666401/884-7930 Fax 401/884-7950 Email [email protected]

Current ActivitiesChairman, the A ndraka Consulting Group. ACG is a hardware design firmspecializing in high performance FPGA designs. Services include new designs as well evaluation and improvement of existing designs.Current work includes reconf igurable logic computers in DSP applications,medical electronics, and digital avionics. I am also writing application notesfor FPGA vendors.

Author Backgr oundMSEE, University of Mass., June 1992; BSEE, Lehigh University, Jan 1984High performance FPGA designs, Andraka Consult ing Group, 1994-presentImage readers, point of sales terminals and network equipment,

smart card readers and display controllers in FPGAs, G -Tech 1993-1995Radar signal processor design and development, Raytheon 1988-1993Data interception and reconstr uction algorithm devel opment, US Air Force 1984-1988

Overview

• Problem Statement

• Solution

• Analysis

• Summary

In years gone by, hardware was expensive. Acommon design technique to reduce the costlyhardware was to process data one bit at a time,reusing the same hardware for each bit. Today thecost per gate is a fraction of a percent of what it wasthen. With cheap hardware, the parallel designshave become so commonplace that bit serial solutionsare often overlooked for applications where they maybe the best choice. This is especially true when usingan FPGA in signal processing applications. In thosecases the limited size and routing capacity of thedevices is pushed to its limit, often yieldingunsatisfactory results.

A bit serial approach to the design of these processorscan alleviate many of the problems associated withFPGAs, including the headaches caused by the slowand limited routing resources. This paper presentsthe design tips and techniques to successfully build ahigh performance bit serial processor in an FPGA,beginning with the logic behind choosing a bit serialapproach. In order to highlight the possibilities, thereal world design of a radar vector magnitudeprocessor is used to provide examples of thetechniques discussed.

To better understand the problem, it is necessary tofirst review the function of the vector magnitudeprocessor and the constraints on the design. In doingthat, we will select and develop the most suitablealgorithm. Once the algorithm has been defined, wewill discuss how to efficiently implement it inhardware.

Page 4: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

CW Radar Needs VectorMagnitude

• 12 Bit complex input

• 12 Bit accuracy

• 8 Mhz data

• 95% Fault detection

Range correlator

FFT

Buffer memory

Magnitude

Threshold

Target computer

Most of the processing in the radar signal processor isperformed on complex data in cartesian form. Thetarget computer at the end of the process requires thedata in polar form. A vector magnitude processor isused to convert from cartesian space to polar form.The FFT simultaneously outputs two 12 bit wordsrepresenting the in-phase (I) and quadrature (Q) dataevery 125 nS. The target computer uses the vectormagnitude and the range-doppler position of eachpoint in the FFT output to determine presence, size,distance, direction and speed of the target. Tominimize process noise, the vector magnitude has tobe accurate to 12 bits. A hardware threshold functionis added between the magnitude and the targetcomputer to discard data with magnitudes below aprogrammable floor. This significantly lightens theprocess load of the target computer..

The Problem

• Tight board space - single chip

• Low volume - no ASIC

• DSP microprocessor not adequate

• Parallel solutions won’t fit FPGA

The radar processor needs to fit in a fairly tightspace, and the power budget is small. For thesereasons, the vector magnitude processor, which isonly a small part of the overall process, needs to be assmall as practical. The board space allocated to theprocessor is roughly 6 square inches. This mandatesa compact solution, and a single chip solution ispreferred.

The total production run is only around 120 unitsincluding spares. A solution using an ASIC istherefore only acceptable as a last resort, as the non-recurring engineering costs do not get spread outover enough units to bring the cost down to areasonable level.

The performance of general purpose DSPmicroprocessors is not adequate in this application.Even with advances in DSP processor design such ason chip cache, RISC code, out of order execution, andbranch prediction, these processors still suffer fromthe inherent serial nature of their instructionstreams. Computationally intense algorithms like thevector magnitude require many instructions per datapoint because there is no dedicated hardware toperform the function. Vector magitude requires atleast 20 instructions per point to compute. Keepingup with the 8 MHz data stream would require a 6 nsinstruction cycle.

The logic for a conventional parallel solution to thevector magnitude problem only fits in the largestFPGAs. The routing complexity for such a designexceeds the capability of most of those parts.

Page 5: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Solution Process

• Define, test & decompose algorithm

• Hardware implementation

– Bit serial appropriate?– design & optimize data path– design control path

• Evaluate design

As with any problem, finding a solution begins withunderstanding the problem. We are already aware ofthe performance and size constraints. In order toevaluate the options we also need to know what theprocessor has to do. We will first evaluate variousmethods of computing the magnitude, then select thealgorithm best suited to a hardware implementation.The evaluation includes modeling the favoredalgorithm to verify suitability and to obtain testvectors for the hardware. Once the algorithm isselected, it is decomposed to provide insight into howbest to implement it in hardware. After ascertainingthat a bit serial approach is reasonable, the hardwaredetail design is done beginning with the data path.Finally, the completed design is evaluated forfunction and timing.

Alternative Algorithms

• SQRT(X2 + Y2)

• Larger + 1/2 smaller

• Look-up table

• Specialty chips

• CORDIC

The most obvious method of computing a vectormagnitude is by using the familiar Pythagorasalgorithm: r=Sqrt(x2 + y2). The square root functionis very difficult to implement in hardware, and thetwo squares each require complexity approachingthat of a multiplier. The Pythagoras approach is notvery promising for a compact hardware solution.

A well known approximation of vector magnitude isthe sum of the larger vector component and half ofthe smaller component. Later on, you may recognizethat this is essentially the same as doing the firsttwo iterations of the CORDIC algorithm. From anerror analysis performed on the CORDIC algortihm,this only yields 3 bits of accuracy in the result. Ourapplication requires a full 12 bits of accuracy. Thisapproach is clearly not adequate.

For smaller data widths, a lookup table can be used toconvert the input data to a magnitude. In this case,the input is 24 bits wide (2 12 bit signed components)and the output needs to have 12 bits. Using an inputsign reduction, the input can be easily reduced to 22bits. The look up ROM needs to be 222 (4M) wordsdeep by 12 bits wide. Even with input sign reduction,this is a very large lookup table, and would require aboard full of ROMs to implement.

There are a few specialty chips such as the TRWTMC2330 and the GEC Plessey PDSP16330 designedspecifically for polar conversions. While these chipsare available, they are typically very expensive. Allof these parts are low volume products with singlesources of supply. In the final analysis, the selectedsolution provides more function (the thresholdfunction and much of the entire processor’s timinglogic was pulled into the magnitude processor design)at a substantially lower cost.

Page 6: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

What is CORDIC Magnitude?

• Rotate vector to axis

• Series of decreasingfixed angles

• Number of iterationsdetermines accuracy

• Angles chosen forsimple hardware

r2=x2+y2

x=r cos(α)

α

CORDIC (COordinate Rotation DIgital Computer)algorithms are a class of iterative algorithms usingonly shifts and adds to realize the result. The specificalgorithm for computing vector magnitude was thefirst CORDIC algorithm, and was the result of workin the late 1950’s by Jack Volder[1]. Later workextended the CORDIC class to cover computation ofmany of the transcendental functions including all ofthe trignometric functions, the hyperbolictrignometric functions, exponentials and the inversesof all of these. The same architecture can also beused, although somewhat inefficiently, to performmultiplication and division. I also demonstrated asquare root CORDIC algorithm in 1991 as part of mygraduate work. The chief advantage of the CORDICalgorithm is that the hardware is relatively simple:Vector magnitude hardware is about the samecomplexity as an array multiplier!

The trignometric CORDIC functions all work byrotating a vector through an angle. For magnitude,the vector is rotated to the x axis, and the resulting xcomponent is interpreted as the magnitude. Eachiteration of the algorithm uses a successively smallerangle. The higher the number of iterations, thecloser the rotated vector gets to the target angle, andtherefore the closer the result is to the truemagnitude.

The key to the CORDIC algorithm is that the discreterotation angles are carefully selected to allowimplementation using only shifts an adds.

CORDIC Algorithm Explained

• Coordinate rotation in a plane:– x’ = xcos( δδ) - ysin( δδ)– y’ = ycos( δδ) + xsin( δδ)

• Rearranges to:– x’ = cos( δδ) [x - ytan( δδ)]– y’ = cos( δδ) [y + xtan( δδ)]

The rotation of a vector in a plane through anarbitrary angle, δ, is mathematically described by:

x’ = x cos(δ) - y sin(δ)

and y’ = y cos(δ) + x sin(δ).

Rearranging these yields:

x’ = cos(δ) [x - y tan(δ)]

and y’ = cos(δ) [y - x tan(δ)].

If we always rotate the vector by an angle of ± δ , thecos(δ) term becomes a constant regardless of therotation decision. That constant can be treated as ascaling constant at the end of the computation. If therotation angle is further restricted so thattan(δ) = +2-i , then the rotation can be implementedusing only a shift and an add (or subtract). Byexecuting a series of successively smaller rotations,any arbitrary rotation angle can be achieved. Thecosine terms of the successive rotations accummulateas a constant that is dependent only upon the numberof iterations performed. The constant at eachiteration is cos(atan(2-i )), which reduces to 1/sqrt(1+2-2i ). The aggregate constant is the product of theconstants from each iteration, which approaches0.607253 as the number of iterations goes to infinity.In the case of the radar, the constant is simplyfactored into an aggregate processing gain attributedto the entire signal processor.

Page 7: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

CORDIC Equations

• If δδ is chosen so tan( δδ) = ± 2-i then:

– xi+1 = k i (x i - di y i 2-i)

– yi+1 = ki ( yi + di xi 2-i)

• Where:

– di = -1 if y i > 0, +1 otherwise

– ki = 1/sqrt(1+ 2 -2i)

The decision function, di, is used to make the vectorconverge on the x axis. If the angle before rotation ispositive (y is positive), a clockwise rotation is made,otherwise the rotation is counterclockwise. The signof y before each iteration is easy to obtain, and workswell to control the progress of the process.

CORDIC Notes

• Works for - ππ/2 ≤ δ ≤ π≤ δ ≤ π/2

– x<0 requires angular reduction

• Output scaled by product of k i’s

• Decisions describe rotation

The CORDIC magnitude algorithm will work for anyangle with a magnitude less than the sum of theangles subtended at each iteration. That sum isapproximately 100 degrees for 3 or more iterations.For convenience, we limit the input to the CORDICprocessor to ±90 degrees. Vectors in the Left halfplane (negative x) are easily accommodated by asimple angular reduction: negate the x and ycomponents if the x input is negative, effectivelydoing a 180 degree rotation.. This reduction isincluded in my hardware.

The CORDIC processor has a constant processinggain attributed to the constant cosine terms. Thisresults in the output being scaled by approximately1.647. In the case of the radar, this is lumped withseveral other fixed gains in the signal processor anddealt with in the post processor. Most applicationscan treat the gain in the same manner. Inapplications where the unscaled magnitude is needed,a multiplier is required to correct the scale.

The direction of each partial rotation is determinedby a binary decision function, and the angle ofrotation is fixed for a given iteration. It is thereforepossible to determine the total angular displacementsolely from the sequence of decisions made incomputing the magnitude. A small ROM can be usedto convert the decision vector to an angulardisplacement in any angular measurement system.Of course the components of the decision vector needto be aligned in time using appropriate delays tocreate the ROM address. Alternatively, an additionaladder/subtractor can be added to the processor whicheither adds or subtracts the appropriate fixed angleat each iteration to accumulate the total angle inwhatever units are convenient. The decision vectoritself comprises an angular unit system referred to asa Binary Angular Measures or BAMs.

The fact that the decision vector describes the angleof rotation hints that a vector could be rotatedthrough an arbitrary angle by controlling the rotationwith the desired angle instead of the decision variableused for magnitude. Obviously, that angle has to beexpressed in BAMs. This can be accomplished byeither using a small ROM or by using an additionaladder/subtractor as discussed above. In addition togeneral vector rotation, such a processor could beused to compute the sine and cosine of an angle (aunit vector placed on the x axis rotated through thedesired angle will yield the sine and cosine). Theprocessor gain can even be compensated for byscaling the initial unit vector by the reciprocal of theprocessor gain! Similar strategies can be used toarrive at the other trignometric functions and theirinverses. This is exactly the mechanism used byearly ‘scientific’ calculators to compute thetrignometric functions.

Page 8: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Angular Error Analysis

• Max err( αα) = δδn

• Err(r) = r(1-cos( εε))

• Accuracy improves by2 bits/iteration

• 7 iterations needed ε

error

{The output from the magnitude processor is the projection ofthe rotated vector onto the x axis. When the rotated vectorlies exactly on the x axis, there is no error due to the angle.For most angles, however, the rotated vector angle will not beexactly zero. In that case, the computed output is reduced bythe cosine of the angular error. The worst case angular erroroccurs when the next to last iteration produces a vector with azero angle. The last iteration then moves the vector awayfrom the axis by the angle of the last rotation. The magnituderesult is bounded by the true magnitude and the cosine of thelast rotation angle times the true magnitude. Note the error isbiased; it always reduces the result. The desired accuracydetermines the number of iterations required. The tabularanalysis of the angular error shows a 2 bit per interationimprovement in the accuracy of the result.

maximum errorsI 2^(-i) rot angle % mag phas e err mag error mag scale

atan(2^-i) 1-cos (a) (deg) % * full scale bits factor

0 1.000 0.785398 29.2893 45 1199.691 11 0.7071071 0.500 0.463648 10.5573 26.56505 432.4262 9 0.6324562 0.250 0.244979 2.9857 14.03624 122.2963 7 0.6135723 0.125 0.124355 0.7722 7.125016 31.62982 5 0.6088344 0.063 0.062419 0.1947 3.576334 7.976639 3 0.6076485 0.031 0.03124 0.0488 1.789911 1.998536 1 0.6073526 0.016 0.015624 0.0122 0.895174 0.499908 -1 0.6072787 0.008 0.007812 0.0031 0.447614 0.124994 -3 0.6072598 0.004 0.003906 0.0008 0.223811 0.03125 -5 0.6072549 0.002 0.001953 0.0002 0.111906 0.007812 -7 0.607253

10 0.001 0.000977 0.0000 0.055953 0.001953 -9 0.60725311 0.000 0.000488 0.0000 0.027976 0.000488 -11 0.607253

Truncation Error Analysis

• Limited precision causes error

• Right-shifted data truncated

• 3 extra LSBs needed

• Round final result to drop LSBs

The result of the angular error analysis assumes infiniteprecision in the arithmetic. Further errors are caused by thetruncation of the LSBs in the right shifted terms at eachiteration. An approximation of the maximum error to beexpected due to truncation can be found by summing theerrors at each iteration when all the truncated bits are ‘1’s.The tabulated results below, show that for 7 iterations, theerror can be as large as 3 least significant bits. To reclaimthe accuracy, the processor needs to carry three additional bitsof precision. The input data should be padded on the rightwith three zero digits to keep from losing the lsbcontributions. The output can be taken from bit 3 and up tomaintain the 12 bit path outside of the processor. To retain1/2 lsb accuracy, a number equal to the bit 2 weight should beadded to the result to round the result rather before truncatingit.

i 2 (̂-i) truncation accumulated bitserror truncation error

0 1.000 0.000 0.000 01 0.500 0.500 0.500 12 0.250 0.750 1.250 13 0.125 0.875 2.125 24 0.063 0.938 3.063 25 0.031 0.969 4.031 36 0.016 0.984 5.016 37 0.008 0.992 6.008 38 0.004 0.996 7.004 39 0.002 0.998 8.002 4

10 0.001 0.999 9.001 411 0.000 1.000 10.000 4

Page 9: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Overflow Error Analysis

• Avoid internal overflows

• Extra MSBs allow growth

• Need one extra MSB

• Drop MSB in result

The processor described in this paper is a fixedprecision integer arithmetic processor. As such, it issubject to overflows. While the effects of an overfloware predictable, the output can be very confusing andcan be devastating to downstream processing.Overflows can be treated by using saturatingarithmetic to minimize the ill effects. Overflows canbe avoided entirely by adding sufficient ‘guard’ bitsabove the most significant bit to allow enoughheadroom to accommodate the maximum possiblegrowth. The result can be truncated (or rounded) toretain the desired word width. Saturating arithmeticrequires a considerable amount of hardware to realizeand is not a perfect solution. In parallel systems,additional precision often requires more hardwarethan saturating arithmetic, so that approach becomesattractive. In contrast, the precision in bit serialsystems can extended simply by increasing the lengthof the input word (may require extra delay registers).The only real penalty is the increased number ofclocks required to process each word.

If both vector component inputs are allowed to go tofull scale simultaneously, the magnitude of theresultant vector is 1.414 times full scale. TheCORDIC process has a gain of approximately 1.6,which when combined with the maximum vectormagnitude yields a maximum output of 2.33 timesthe full scale input. The data grows in the CORDICprocess monotonically, so the maximum level is at theoutput. Without further knowledge of the input, theprocessor would require 2 ‘guard’ bits to accommodatethe maximum growth without overflowing. In oursystem, the combined effect of the limiting in therange correlator and the scaling performed in theFFT and correlator limit the magnitude of the inputvector sufficiently to require only one guard bit.

The processor works on two’s complement data, butby definition, the output is always positive. Theguard bit in the output is the sign of the output, andis therefore always zero. Provided the targetcomputer is expecting unsigned data, the extra bitcarried through the process can be dropped at theoutput with no consequence.

Analysis of the algortihm shows that the X data pathis always positive after the input angle reduction, soit could use unsigned arithmetic as long as a zerosign bit is provided to the Y path elements. The Ycomponent needs to retain sign, but does reduce inmagnitude as the algorithm progresses, so thenumber of bits in Y could be reduced after the initialrotation. In a parallel system these nuances can savea significant amount of hardware. A serial system,however, is much easier to deal with and uses lesshardware if it has a fixed precision throughout.

Analysis Summary

• 7 Iterations• 16 Bit internal precision

– 3 Extra LSBs– 1 Extra MSB

• Round result to eliminate LSBs• Drop result sign bit

The various error analyses dictate the form of theCORDIC processor we need. The desired accuracyrequires 7 iterations, which translates into 7 cordicstages. Errors caused by truncation require 16 bitprecision internally to maintain the accuracy. Thesign bit can be dropped at the output since the outputis always positive, and the three extra LSBs can beeliminated by rounding to obtain the desired 12 bitoutput.

Page 10: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Model the Algorithm

• C implementation

• Word size should match hardware

• Proof of algorithm and accuracy

• Provides test vectors for hardware

For all but the most simple algorithms, I highlyrecommend modeling the algorithm in a high levellanguage such as C before proceeding with thehardware design. This is especially true for iterativealgorithms like the CORDIC since a seemingly smallerror can lead to drastically different results. Themodel should accurately reflect the precision andprocess through out the computation. Therefore, inthe case of the CORDIC, it should model eachiteration. The important result from the modelling isto observe any ill effects from truncation, overflowand other boundary conditions, as well as to verifythe basic function of the algorithm. This will allowyou to correct algorithmic problems before investingtime in detailed hardware design. A very favorablebenefit of modeling the algorithm, is that it provides asource of test vectors and intermediate results fortracing problems while simulating and debugging thehardware.

CORDIC Process

Q=F ± G2-i

i=0 i=1 i=n-1

sign(F)

A

±

A

± Mag

Q= ± A

X in

Y in

Round

sign(X)

F

G

±

F

G

±

Q

Q

F

G

±

F

G

±

Q

Q

F

G

±

F

G

±

Q

Q

Mapping out the process provides clues as to how toimplement the algorithm in hardware. The iterativeprocess is drawn as a process flow, which is easilytranslated to a straight-through or pipelined design.

For applications where required performance is low,the algorithm can actually be performed by loopingthe data through the same hardware several times.The early ‘scientific’ calculators implement CORDICalgorithms using hardware loops and bit serialtechniques to get away with an amazingly smallamount of hardware. In our application, the desiredthroughput demands a pipelined design.

The pipelined 7 iteration design requires 17 adder-subtractors with pipeline registers inserted atconvenient points. Most of the adder subtractorsneed to process 16 bit data (the first in each pathcould be reduced to 12 bits, and each after that needsto increase precision by one bit to a maximum of 16bits). Total pipeline latencies up to severalmilliseconds are acceptable, as delays on this order donot affect the operation of the radar.

The logic for a parallel design will fit in some of thelarger FPGAs such as the larger Xilinx 4K seriesparts. Unfortunately, the design cannot be routed inmost of those parts, as they do not have sufficientrouting resources to provide both the straight andcrossed interconnect between the CORDIC stages.The parallel design should route in a Xilinx XC4025FPGA as long as the design is properly floorplannedand no other logic is included in the device (thisresults in less than 20% logic utilization). Themaximum performance of the parallel designimplemented in the Xilinx XC4025 is less than fourtimes the performance of the basic bit serial designpresented in this paper!

Page 11: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Bit Serial Approach A Quick Primer

• Process data one bit at a time

• Large hardware savings

• Clock per bit instead of per word

• Time-hardware product improved

Today, most computing is done on a word at a timebasis. These bit-parallel designs operate on theentire width of a data word simultaneously. A bit-serial design operates on the data one bit at a time,reusing the same hardware for each bit. While this isinefficient timewise, the hardware savings can beenormous.

The majority of the processes we do on a data streaminvolve passing intermediate results across the widthof the data word (carries for example). Thepropagation delays associated with this perpendicularflow can be significant, and directly influence theoverall performance of a word wide system. In serialsystems, the intermediate result can generally beheld in a register until the bit that needs it arrives.The clock to clock propagation delays in a serialsystem are usually a small fraction of the delays in anequivalent parallel system. This means the time-hardware product of the serial system will besignificantly higher than a parallel system (less thann serial processors are required to match theperformance of an n bit parallel processor). Thesignal routing in a serial design also tends to verylocalized. In technologies like FPGAs where therouting resources are limited, this provides asubstantial advantage. The local routing also helpsto diminish the delay associated with theinterconnect in FPGAs.

Is Bit Serial Appropriate?

• Algorithm adaptable to bit serial?• Data rate achievable?• Bit clock available?• Pipeline latency acceptable?

A few questions need to be addressed whenconsidering a bit serial solution. First, the algorithmneeds to be able to be expressed as a serial procedure.Most algorithms can be serialized, although some areconsiderably more difficult than others to realize.Decomposition of the algorithm and transitiveanalysis on the subfunctions can be helpful inmapping out a serial strategy.

A second consideration when contemplating a serialsolution is whether the data rate is achievable in thetarget technology. Modern FPGAs can realisticallyachieve register to register times as low as 8 ns inuseful serial designs. The bit rate is therefore limitedto around 125 MHz.. The overall data rate is the bitrate divided by the number of bits in the stream,including any overhead bits used for control. Expectto add one bit interval for resetting serial elementsbefore each word. If the resulting rate is too low andenough unused logic exists it may be possible to usetwo or more processors in a parallel interleavearrangement to boost the data rate.

If the device can achieve the required performance, abit clock must be available to clock the processor. Insystems where the high bit clock is not available, aphase lock loop arrangement for synthesizing theclock can be used. The high frequency bit clockneeded for high performance serial systems demandscareful attention to transmission line effects andelectromagnetic interference.

Finally, the application for the processor must be ableto tolerate any pipeline delay introduced by the serialprocessor. The latency in a parallel system isfrequently as high or higher than the equivalentserial system, so this is rarely a concern.

Page 12: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Bit Serial Conversion

• Decompose algorithm• Use standard serial solutions• Transitive Analysis for unusual

functions• Relative delays for bit shifting

The conversion of a parallel design to a bit serialalgorithm begins with separating the algorithm intoits component parts. Once that is done, many of themore common functions can be directly implementedusing readily available serial solutions. Solutions tothe more unusual functions can be identified usingtransitive analysis.

Transitive analysis can be very valuable forunderstanding the vagaries of implementing unusualfunctions in serial logic. Transitive analysis is simplythe practice of determining which output bits areaffected by the values of each input bit. This not onlyindicates the preferred shift direction, but alsoindicates which bits need to be held for futurecomputations and hints of the complexity of thefunction.

Since the data is presented a bit at a time, bit shiftingis accomplished by delaying one bitstream relative toanother. Most of the time, this will require somespecial treatment to handle the misaligned wordboundaries. An example of the shifting is found inthe magnitude processor presented in this paper.

Transitive Analysis

• Left transitive functions - lsb first

• Right transitive functions - msb first

• Intransitive functions - trivial

• Full transitive functions - difficult

• Ripple transitive - preferred form

The majority of arithmetic functions are lefttransitive, meaning only outputs of equal or higherorder than (to the left of) the input bit are influencedby the value of the input. Left transitive functionstherefore favor right shifting so that the LSBs arepresented first. These functions include all of theunadic arithmetic operators, addition, subtraction,gray code and bcd conversions.

Right transitive functions have outputs that areaffected only by the inputs of higher order than theoutput. The most notable example of a righttransitive function is the compare. If your processoris primarily performing sorting or other compareintensive operations, MSB first may be the preferredshift direction. It is worth noting that compares canbe done in a predominantly left transitive systemusing subtraction and detecting sign and overflow inthe result. This requires more complexity than themsb first logic, but is easier than switching shiftdirection in the middle of a process.

Intransitive functions are those where the output isonly influenced by the inputs of the same order. Thebitwise logical operators all fall in this class. Thesefunctions are trivial to implement in serial systems.

Functions that sample several or all of the input bitsto produce an output not dependent on the input bitpositions can be treated as either left or righttransitive. Examples of these are tally adders, paritytrees, and zero detection circuits.

Fully transitive functions are those which everyinput affects every output. I include partiallytransitive functions that cannot be classified as eitherright or left transitive in this class, since these gettreated the same way. Functions in this class includebit rotations, bit field operations, and arbitraryencode/decode functions. These functions can be

Page 13: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

difficult to translate to bit serial mechanisms.Sometimes they are best handled in the parallelrealm. Fortunately, most of the functions you will bedealing with can be classified as left or righttransitive. A little ingenuity often helps findalternative solutions for those functions that are not.Translation of this difficult class is beyond the scopeof this paper.

Ripple transitive functions with a distance of n arethose whose output can be formed the from thecurrent input and the n previous inputs and outputs.Previous in this case refers to either the next higheror next lower bit depending on the direction of theripple transitivity. This class of functions isinteresting because many of the left or righttransitive functions can be expressed more compactlyin a ripple transitive form. The unadic artihmeticoperations, add and subtract, for instance, can all beexpressed as left ripple transitive functions with adistance of 1. This means that each output can beexpressed as a function of an output from theprevious bit and the current input. Multiplicationalso can be expressed in a ripple transitive form,which is the basis of the serial by parallel multiplier.Since this form has the effect of simplifying thehardware, it is the preferred form. The rippletransitive form is often the intuitively obvious form ofthe function.

Shift-Add/Subtract Element

FA

0

1

A

B

C

Y=F ± G2-i

2’s C

Y

±

F

G

±

The development of the shift-add/subtract elementused throughout the CORDIC processor is illustrativeof the conversion process. Notice the carry from thefull adder is registered (effectively a 1 bit shift) andreturned to the carry input of the adder. This is animplementation of the ripple transitive form of theadder. It is a stock solution.

The 2-i shift is accomplished using delays to realignthe bits in F relative to the bits in G. The data ispresented LSB first. In order to shift G i bits to theright, the bits in G must arrive at the processingelement i bit times before the corresponding bit in F.By inserting the delay on the F path, the F data isshifted to the left relative to G which is the same asthe G data being shifted to the right relative to F.

There are two problems with the element as shownhere. First, , the carry flip-flops need to be clearedbefore each new data word so that the previousword’s carry out is not added to the current word.There is a similar clearing concern in the 2’scomplement stage. The other problem is that whenintentional bit misalignments are introduced, theword boundaries also become misaligned. Thiscauses the dangling lsbs on the next right shiftedword to be processed with the msbs of the other datastream(s). Extra logic needs to be added to eliminatethe overlap between the words on shifted inputs. Inmost cases, the best way to do this is to replace thedangling lsbs on the right shifted word with eitherzeros for unsigned systems or the extended sign ofthe previous word for two’s complement systems.Note also, that extra pipeline delays need to be addedto the F path to match any delays in the 2’sComplement stage so that the data retains it’sintended alignment.

Page 14: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Improved Shift-Add/Subtract

±

FA

0

1

A

B

C

Y=F ± G2-i

YFzi

G

0

1

0

1

RN

SX

The function of the two’s complement stage in theoriginal design can be decomposed and partiallycombined with the adder that follows it. Two’scomplement is found by inverting the operand andadding one to the result. Setting the carry bit in theadder (essentially forcing a carry in) before the thedata arrives causes the the sum to be increased byone. Therefore, we can invert the input to besubtracted and set the carry to realize subtraction.The familiar serial algorithm (pass the data until thefirst ‘1’ bit then invert) is in fact just a logic reductionof a serial adder with one input at zero and the carryset before the operation begins. To create a selectableadd/subtract, you simply need to control the inversionon the subtracted input and either set or reset thecarry flip flop as appropriate before the operationbegins. The inversion is easily controlled using anexclusive OR gate.

Synchronous design is essential to achieving the highperformance required. The flip flop direct sets andresets are often not recognized as beingasynchronous. These should not be used in thedesign (they can be connected as a global reset forpower up initialization-but don’t waste routingresources connecting them). The word cycle resetsand sets should be done using added logic to realizesynchronous behavior. These controls do notnecessarily need to be introduced right at thetargeted register. In our design, that could causeunnecessary routing congestion. The effect of a set orreset can be had by forcing the value of the dataupstream of where the reset is needed. The carryoutput of the full adder (this is the register that needsto be reset) is the majority function of the adderinputs. Its value can be forced by controlling any twoinputs to the adder. The logic used in thisimplementation forces 1’s or 0’s into the F and Ginputs to set or reset the carry. The injection point

for the reset is convenient to the controlled inversionas well.

The drawing shows ghosted flip-flops after the resetand twos complement logic. These registers are notfunctionally needed, but in many of the FPGAtechnologies they should be inserted to break up theotherwise long combinatorial delay. Even with partsthat allow enough inputs and logic to implement theadder and reset logic in one cell, it may beadvantageous for routing reasons to insert theregister to force the use of two cells.

This improved shift-add/subtract element also solvesthe misaligned word problem introduced by the shiftof G relative to F. The SX control passes G datauntil the sign bit arrives, then holds the value of thesign for the rest of the word, effectively extending thesign over the lsbs of the next data point . Signextension is considerably more difficult to accomplishin an msb first system.

The delay queue on the F input still exists, asrepresented by the z-i on the F label. I have notincluded the registers in this drawing for clarity, andto aid in the understanding of the control logicdeveloped next.

Page 15: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Processor Controls

• Internal controls– derived from data path

• External controls– periodic, independent of data

• Hybrid controls– separate into components

The serial data path normally includes a number ofcontrols that influence the execution of the function.I classify these controls as internal and externalcontrols.

The internal controls are those derived from thedatastream, and are usually dependent on theintermediate values of the data. These controls aregenerated entirely from within the data pathprocessor. Examples of internal controls are thosederived from the sign of data or from the results ofcomparisons. Adding the internal controls oftenrequires additional logic and pipeline delays toextract and align the control with the data. Theseadjustments to the data path can effect the timing ofthe external controls and may add additional externalcontrols. This is why I treat the internal controlsfirst.

The external controls are not affected by the values ofthe data stream. The external controls are generallyperiodic with a period equal to the word rate. Thesecontrols are created external to the data path by someform of sequencer. This class includes all the timingcontrols to the data path such as resets, signextension and latch controls. The timing of theexternal controls is affected by added pipeline delays,so their design is best left until after the data pathhas been optimized for a particular FPGA.

Occasionally a control is a hybrid of the two classes.In these cases, it is usually advantageous to keep theinternal and external components of the controlseparate until the point where they are used. Thispractice sometimes even allows the logic to bepartitioned more efficiently.

Shift-Add/Subract Internal Control

±x

Fyzi

Gx0

1

RS

Fy

F R 0 1 2 3 4 5 6 7 8 9 A B C D E S R 0 1 2 3 4 5 6 7 8 9 A B C D E S R 0A/S Ctl FF Q S R S R SA/S FF R 0 1 2 3 4 5 6 7 8 9 A B C D E S Rsign ext ff Q R 0 1 2 3 4 5 6 7 8 9 A B C D E S R 0

Fz-1 R 0 1 2 3 4 5 6 7 8 9 A B C D E S R 0 1L H H

FxziFx

Gy

±y

The only internal control in the shift-add/subtractelement is the add/subtract control. The magnitudeCORDIC algorithm uses the sign of the Y componentfrom the previous iteration to select add or subtract(the control is inverted to one path so one adds whenthe other subtracts).

The sign bit is the last bit presented since the word isshifted lsb first. A pipeline delay is required to delaythe data to the next stage until the sign is available.The length of the delay pipe depends on the wordlength and the relative delays between the sign pick-off and the control application.

The value of the sign affects the process of the entiredata word, so the sign value has to be held. A simpleregister and multiplexer arrangement satisfies this,but requires an additional external control signal.

I have found that a timeline is a useful aid forensuring the proper alignment of control and data. Itreveals the delays necessary to keep data and controlaligned, as well as nature of the control signals. I usea spreadsheet for the timeline, as it allows quickreproduction of the data stream and easymodifications

The delay used to shift the G data with respect to theF data can be absorbed into the delay required toobtain the sign. The G data is taken off a tap in thedelay pipe an appropriate number of bits ahead of theF output. The tap is different for each stage in thepipeline. Pipeline reductions like this are frequentlypossible with the proper selection of control pick-offs.A matching delay is required on the X data path tokeep the X and Y data aligned.

Page 16: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Device Selection

• Registers and IO• Speed• Cell architecture - fine grain superior• Storage technology - RAM• Tools & availability

Any FPGA is suited to bit serial design. The firstconsideration is whether there is sufficient logic andI/O resources to support the design. The minimumnumber of registers needed can be obtained from thegroundwork we’ve already done. In most cases, thelogic ahead of each flip flop is fairly minimal or can bebroken into a couple of pipeline stages. Most bitserial designs have a very low routing complexity, sothe routing resources are not an issue.

For high performance bit serial designs, the cell androuting delays play a very significant role in thesuccess or failure of a design. Examination of thetiming published for a one bit adder or a similarmacro in a vendor’s library will give a ballpark figureof the types of speeds you can expect to get from eachdevice.

Finer grain architectures tend to be superior for bitserial designs because they offer a higher register tologic ratio. The serial designs tend to have manyregisters with little or no combinatorial logic in frontof them, so much of the logic in coarse grained arraysgoes unused. One caveat to using fine graineddevices is that some seemingly simple logic functionsmay require several cells to implement, or mayrequire some changes in logic to tailor the algorithmto the logic. Look-up table devices are the easiest touse, but beware of the register counts and resist thetemptation to use all of the inputs to a cell.

The RAM based FPGAs are a boon to in circuittestability. The fault isolation requirement for thisprocessor is easily met by loading a test function intothe FPGA in place of the vector magnitude. Thedrawbacks to RAM based parts are that they need tobe loaded from an external device or bus before theydo anything, and that the device programs are notsecure from reverse engineering.

A final consideration in part selection is the price andavailability of the part and the usefulness of the tools.The price and availability of the parts are superficialto this paper. For high performance work, the toolsmust, at a minimum include relatively painlessfloorplanning, an accurate static timing analyzer thatoperates on the routed design, and the ability toinspect and directly manipulate the placement androuting. A good macro generation facility that allowshand place and route is also high on the should-havelist.

Based on these considerations, as well as someexternal influences we chose the NationalSemiconductor CLAy 31™ for the vector magnitudeprocessor.

National Semiconductor CLAy 31

• 3136 registers

• 108 I/O

• ~2ns / cell 01

A

LB

A

L

B

The CLAy 31 cell is basically a half adder with aregister on the sum output. It contains a little extralogic to allow configuration as a 2 input mux and afew other basic functions. Each cell has 2 directconnections via the A and B wires to each of the fournearest neighbors, and connections to horizontal andvertical local busses (L). This cell architecture isextremely well suited for bit serial processors.

A basic serial adder can be made using four cells,although a six cell variant is more routable. The sixcell adder has a worst case clock to clock time of11.5ns. Pipelining the serial adder reduces theminimum clock period to 5.8 ns.

One of the strengths of the CLAy 31 is the sheernumber of registers it contains (3136 cells with a flip-flop per cell). I’ve done a 27 tap bit serial correlator[2] in this part, and it is possible to fit a serial FIRfilter with over 50 taps.

Page 17: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

High Performance Tune-Up

• Synchronous design• Use hierarchy• Alternate logic realizations• Pipeline to break long delays• Duplicate logic to reduce fanouts• Handcraft place & route

The serial architecture derived so far is already wellmatched to nearly any FPGA. The biggest headacheswith FPGAs are usually caused by routing complexityexceeding the ability of the resources. A serial designhas minimal routing by nature. There are stillseveral things which can be done to tune the designfor better results.

First, as I mentioned earlier, the entire design shouldbe synchronous, including the sets and resets. If youhave followed my advice so far, this is already takencare of.

Use hierarchy in the design. This keeps the designorganized, makes it easier to select macro boundaries,and speeds the design cycle. Some of the automaticplace and route tools also look at the heirarchy forhints when placing the design.

Some logic may not map efficiently into the cellarchitecture in FPGAs that do not use look-up tables.The National Semiconductor device, for example doesnot implement OR gates very well (the OR gate in thelibrary must connect one input to a local bus). Lookfor alternate logic realizations that will fit thearchitecture better.

Add pipeline registers to break up long delays.Sometimes, rearranging the order signals getcombined in combinatorial logic will permit betterpartitioning. When adding pipeline delays in a path,remember to add an equal number in any parallelpaths, including control timing. A frequently missedsource of long delays is in the routing. The clock toclock timing can often be improved by “pipelining thewires”. This is usually not obvious from theschematics.

The serial designs normally do not have very highfan-outs in the data path. Be aware though, thateven small fan-outs (as low as three) can affect timing

enough to ruin some of the fastest designs. Eliminatehigh fan-outs by duplicating the signals at the sourceor in a pipelined distribution tree. The externalcontrols are especially susceptible to high fan out. Onthose controls, I prefer to pipeline the control, takingtaps off the pipe at the correct phase closest to thedestination. Doing this will usually reduce themaster timing to a very simple ring counter.

On very high performance designs it is absolutelycritical to understand how the logic fits into the cellsand what interconnect is used. I strongly recommendcareful floorplanning at this stage. A time savingtechnique is to design macros for the subfunctions.Place and route the macros as hard or relationallyplaced macros. In the design of such pieces,awareness of the locations of the I/O relative theadjacent functions goes a long way toward a gooddesign. There are always those who will frown on thedetail at this stage, but the human mind is still thebest place and route tool there is and the layout is themost critical link to maximum performance.

Serial processors are generally compact enough thatin cases where a single processor does not meet therequired cycle time, additional processors can be usedin a parallel interleave fashion. Those algorithmsthat operate on a block of data rather than onindividual data points are generally not well suitedfor parallel interleave acceleration.

Page 18: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Finished Shift-Add/Subtract

RS

FD

FDHA

FD

FDXOAN3

FDMUX

FDMUX

FD

FDMUX

FDHA

FDHA

FDNAS

FO

QD

R

FSEX

FI

GCISX

FID FM

FC

FHC

FHS

FFC

RNFASNI

Q

R

D

QD

R

D

R

Q

D Q10

R

D Q10

R

QD

R

D Q10

R

Q

R

D

Q

R

D

Q

R

D ASNO

ASO

The finished design of the shift-add/subtract elementincorporates a number of the performance tune ups.None of these improvements would have beenapparent without floorplaning.

The design shown has the clock and asynchronousresets omitted for clarity. The asynchronous resetsare only for power-up reset. The processor isinitialized synchronously using the external controlsRNF, RNG (not shown) and RS at each stage.

The traditional OR gate in the carry logic of a fulladder has been replaced with an exclusive-OR (the 1-1 input condition is impossible in this case) yielding amuch faster and easier to route design. This is a casewhere the device architecture influences the design.

The single level design of the full adder required aminimum clock to clock time of 11.5 ns. Adding apipeline register halves the minimum period to 5.8ns.The full adder inputs had to be swapped around toavoid putting the pipeline register in the carryfeedback loop (the non-pipelined six cell design addsthe fedback carry at the first half adder).

The inverted add-subtract control was originallyobtained from the AS flip flop through an inverter.The long route combined with the multiple loads onthe AS signal caused the delay to exceed the targetvalue of 8ns. Adding registers, essentially pipeliningthe wires, brought those delays in line. The delayqueue was extended to keep the data and controlaligned.

I handcrafted a placed and routed macro containing apair of the adder-subtractors and the a/s control.Each stage of the processor consists of one instance ofthis macro, the external control logic (a macro) and 2delay queues. By doing the majority of the work inreusable macros, the entire design was completedinside two weeks.

External Controls

• Use a Timeline

• Route controls perpendicular to data

• Keep fanouts low

• Generate control with pipeline

When the data path logic is laid out, run the externalcontrols on long lines or busses extendingperpendicular to the general data flow. If the desiredcontrol signal connects to more than one column(assuming the data is flowing across), split it intoseparate controls. In cases where the control hasmultiple loads in the same column, it may beworthwhile splitting the load to reduce the delay dueto fanout.

The design of the external control pipeline shouldwait until after the data path design has beencompleted, including layout. Again, a timelineshowing the data flow through the processor isinvaluable for deriving the controls. The externalcontrols are periodic, and should be easy to generate.Use the timeline to find the timing relationshipsbetween every external control signal and the data.This will provide the phasing information you need tobuild the control pipeline.

Higher performance can be obtained from the designif the control is treated as a pipeline, where astimulus is injected into the front and then ripplesdown the pipe providing the control as it propagates.This avoids long routing and allows the control to bedistributed. The stimulus in this scheme can often bethe output from a ring or LFSR counter.

Page 19: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

External Control Pipe Section

RNO

01

SI

ASCFF Q S SRS2 H HRNF2,RNG2 L LSX2 H H H H H HASCFF Q S SRS3 H HRNF3,RNG3 L LSX3 H H H H H H H HASCFF Q S SRS4 H HRNF4,RNG4 L LSX4 H H H H H H H H H H

RSI RSO

RNI

SO

RNG RNF SX RS

The external control is provided by a pipeline. Thetop pipeline provides the reset signals correspondingto the reset bit time in the data. This gets phasedproperly to match the data everywhere it is needed.The bottom pipeline generates the sign extensioncontrol, which grows one bit time longer in eachsubsequent pipe section. This pipe section is repeatedfor each cordic stage. The first stage input is derivedfrom the parallel to serial shift register load pulse.The three outputs on the right connect to thecorresponding inputs of the next stage. The layout ofthe control pipe is designed to match the 4 cell widthof the shift-add/subtract element and delay queues.

Design Verification

• Functional Simulation - Viewlogic– No delays– Check proper Function

• Static timing - FPGA timing tool– Max register to register delays– Indicates maximum bit clock

The first step in a design verification is functionalsimulation. Use Viewlogic’s Viewsim or anequivalent logic simulator. Don’t bother withmodelling the delays, the goal here is to verify the theproper function. Use the same vectors used with theC model so that the model outputs can be comparedwith the simulation results. The simulation data setshould minimumally cover all of the boundary

conditions and all possible combinations of the signsof input vectors. In the case of the CORDICprocessor, a minimal set would include vectors fromall four quadrants with all combinations of maximumand zero magnitudes.

After the function is verified, the timing should bechecked to make sure the specification is met. Ifyou’ve done your homework in the design phase, youwill already have a good feel for where the timinglies. Nevertheless, a thorough static analysis shouldbe done to verify the system will run under allconditions. Most of the FPGA tools now includetiming analyzers that provide worst case delaysbetween various elements of the design. Be cautiousof the timing reports on some of the tools; the tool setup can be complicated and it is easy to miss somepaths. Also, make sure the tool is set up for worstcase timing, not typical or best. A hierarchical designusing prerouted macros saves a lot of effort in thetiming analysis.

The Atmel tools indicate entire CORDIC processorhas a worst case register to register delay of 8 ns inthe -2 part. For the 17 bit data word (16 data bitsplus the reset bit), this means a 136 ns word cycletime. A single processor does not meet therequirement. A CORDIC pipeline processor onlyoccupies about 25 percent of the FPGA, including theI/O shift registers. In order to boost the apparentperformance, I used two instances of the processor,interleaving the data so that the apparent minimumcycle is reduced to 68ns. While there are enoughunused logic cells to allow up to 4 pipelines. Theshape of the completed processor is such that only 2parallel processors will fit. The design could probablybe laid out differently so that at least 3 and maybefour pipes would fit with similar performance. Thisdesign was purposely done with tall narrow cells toleave space for post processing. If symetric samplingat the input and/or output is required for all phases,the precision of the processor can be increased tomake the cycle time evenly divisable.

I am not a fan of timing simulation for synchronousdesigns. There are almost always timing situationswhich are either not set up properly and not detectedor which are not recognized and therefore notsimulated. A thorough static analysis will uncover allthe potential timing problems.

Page 20: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Design Assessment

• 30 x 22 cells per processor pipe• 7.35 MHz data (14.7 MHz w/ 2 pipes)• 80% of FPGA left usable

The CORDIC processor studied in this paper occupiesa rectangular area 22 cells high and 30 cells wide, notincluding the output rounding and the I/O shiftregisters. Nearly 80% of the device is open for the I/Oand other uses. The design uses a remarkably high67% of the cells it covers for logic. Less than 2% ofthat logic is unregistered. The majority of theremaining cells under the macro are used for wiring.

A single instance of the pipeline can handle worddata rates up to 7.35 MHz under worst caseconditions. Adding a second pipeline in a parallelinterleave arrangement doubles the maximum datarate to 14.7 Mhz, and uses a little over 42% of thedevice.

The pipeline latency of the CORDIC processor is 148bit clock cycles. That translates to a latency of lessthan 9 word times. The equivalent parallel processorwould need more than 9 pipeline delays to achieveperformance any better than the serial processor.

Parallel Comparison

• Upper limit is about 20 Mhz• Adders alone occupy 70% of device• Not routable• Minimum 15 clock latency

An estimate of the size and performance of anequivalent parallel processor in the same device canbe made by using the characteristics of a 16 bit adder.The best performing 16 bit adder in this device has amaximum clock frequency of 20.5 Mhz assuming theinputs are registered right at the adder. That adderfully occupies an 8 x 16 cell rectangular area. The 17adders needed in this design alone cover 70% of theavailable area in the array. A compact 16 bit addercan be realized in half the area with a 2 x 32 cellfootprint, but its performance limits the clock to amere 13.2 Mhz assuming registered inputs andoutputs.

The wiring of the cross terms in the processorrequires an abundance of long routes. The routingresources of the FPGA are too limited tosimultaneously provide the needed connections andthe required logic density. The parallel realization ofprocessor will not route in this part. Even if it didroute, the performance is not much higher than canbe done with a serial processor!

With special care and low logic utilization, theparallel design looks like it would be feasible in thelargest Xilinx 4K parts. Even in those parts,however, the best performance would only be about20MHz based on the 16 bit adder performance!

Page 21: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Vector Magnitude Applications

• Extracting polar datafrom FFT

• Digital coherent receivers• Robotics and positioning• Medical, radar and sonar

imaging

The CORDIC vector magnitude has applicationwherever a conversion from cartesian to polar spaceis required. These potential applications include FFTpost processing, digital recievers, motion control andsensing in robotics, and most imaging applications.Minor modifications to the CORDIC processorpresented here can provide trignometric functions ofinput angles, their inverses, hyperbolic trig functionsand their inverses, exponentials and logarithmsamong other functions.

Bit Serial Processor Applications

• Audio signal processing• Robotics and positioning• Medical, radar and sonar

imaging• Encryption• Code conversions

Bit serial processing is applicable in nearly anysystem where the data rate is achievable by a serialsystem. The potential for enormous reductions inhardware makes bit serial design attractiveanywhere more bang for the buck is needed. Primeapplications are audio, motion control, imaging,encryption and code conversions.

Bit Serial Designs Work

• Compact

• Efficient

• Fast

• Easy

In this paper I used a real world example of a RADARapplication to demonstrate the techiques used tocreate a high performance bit serial processor. Thesimple interconnect and compact logic allow theseprocessors to significantly exceed the processingpower per gate of equivalent parallel designs. Highspeed design techniques make bit serial a viablealternative even in moderately high performancesystems!

Page 22: Building a High Performance Bit Serial Processor in an FPGA

Building a High PerformanceBit Serial Processor in an FPGA

Recommended Resources

• Tools– Viewlogic workview plus– Point tools for selected FPGA– C compiler

• Other resources– VLSI Signal Processing: A bit serial

Approach. P.Denyer & D.Renshaw– Vendor application notes

• Consulting services, etc.– Andraka Consulting Group

(401/884-7930) [email protected]

[1] J.E. Volder, “Binary Computation Algortihms forCoordinate Rotation and Function Generation,”Convair Report, IAR-1 148, Aeroelectronics Group,June 1956.[2] R.J. Andraka, “FIR Filter fits in an FPGA using aBit Serial Approach,” Proceedings of the EE-Times3rd Annual PLD design Conference and Exhibit,March 1993.[3] C.R. Rupp, “Digital Function Analysis andSynthesis,” Course notes to 16.674, University ofMassachussets at Lowell, 1991[4] P. Denyer & D. Renshaw, “VLSI SignalProcessing: A Bit Serial Approach. SelectedReadings,” Addisson-Wesley, 1985.

CLAy is a registered Trademark of NationalSemiconductor Corporation (CLAy is the acronym forConfigurable Logic Array).

Viewsim and Workview Plus are registeredtrademarks of Viewlogic Systems, Inc.

Xilinx XC4025, XC4013 and XC4010 are registeredtrademarks of Xilinx.