Final Report

34
A Polyphase Filter Bank Structure for a Space-based Radar System Kevin Camera ([email protected]) Changchun Shi ([email protected]) University of California, Berkeley Department of Electrical Engineering and Computer Science Final Project Report EE225C, Fall 2000 Prof. Borivoje Nikolic Prof. Bob Brodersen Abstract This project’s goal was to develop a polyphase filter bank system for use in a space-based radar receiver. The system is composed of 32 FIR filters which are fed by a delayed and downsampled input stream. Each filter passes its output to a 32- point FFT which separates the signal into 32 subchannels. The unique feature of this approach is the downsampling of the input stream before the filters, which reduces both the number of filter taps and the clock rate of the system. Architectural exploration of the components yielded a design with 9-tap transposed-form FIR filters and a fully parallel FFT. At the conclusion of the project, the system occupied 11.98mm 2 and dissipated 200mW at 2.5V with a total combinational logic delay of 27.73ns (36.1MHz). Arbitrarily high throughput can be achieved with pipelining, since all components are feed-forward. Simulink simulation showed the total SNR through the system is about 50dB with maximum randomized input signals.

Transcript of Final Report

Page 1: Final Report

A Polyphase Filter Bank Structurefor a Space-based Radar System

Kevin Camera ([email protected])Changchun Shi ([email protected])

University of California, BerkeleyDepartment of Electrical Engineering and Computer Science

Final Project ReportEE225C, Fall 2000

Prof. Borivoje NikolicProf. Bob Brodersen

AbstractThis project’s goal was to develop a polyphase filter bank system for use in a space-based

radar receiver. The system is composed of 32 FIR filters which are fed by a delayed and downsampled input stream. Each filter passes its output to a 32-point FFT which separates the signal into 32 subchannels. The unique feature of this approach is the downsampling of the input stream before the filters, which reduces both the number of filter taps and the clock rate of the system. Architectural exploration of the components yielded a design with 9-tap transposed-form FIR filters and a fully parallel FFT. At the conclusion of the project, the system occupied 11.98mm2 and dissipated 200mW at 2.5V with a total combinational logic delay of 27.73ns (36.1MHz). Arbitrarily high throughput can be achieved with pipelining, since all components are feed-forward. Simulink simulation showed the total SNR through the system is about 50dB with maximum randomized input signals.

Page 2: Final Report

IntroductionAnyone active in our field (and nowadays, one out of two people outside our field) is keenly

aware of Moore’s Law, the observation that the number of transistors that can be integrated on a single die doubles every 18 months. This is a direct result of the reduced feature size of silicon circuit structures, which has the additional side effect of making circuits faster and consume less power. So what does this statement of the obvious have to do with VLSI digital signal processing? It means that new applications are continuously being made possible that were previously beyond the computation capability of past devices. Algorithms stranded in the realm of theory for years can now be developed, tested, and used in everyday life. It’s definitely an exciting time to be a designer.

Among the hottest areas of interest today is wireless communications. The fast pace of high-tech life often does not afford the luxury of finding a power outlet, telephone line, and network port to stay in touch with others. Communicating without wires unfortunately requires a lot more work from the electronics, since the air waves are much more prone to interference and unpredictable attenuation than copper wires. This places additional burden on the signal processing components, which must overcome these obstacles and provide a reliable exchange of information despite these effects.

The focus of this project will be a specific application not directly related to end-user communication, but rather to the reception of radar signals in space. However, the system structure designed here (to our knowledge, for the first time) can be applied equally well to more tangible commercial devices. This paper will present the details of the development process and performance of a space-based radar receiver, consisting of a finite impulse response (FIR) filter bank and a fast Fourier transform (FFT). We will first discuss the motivation for looking at this problem, followed by an analysis of existing solutions and their tradeoffs. We then present an overview of the system functionality, including the design decisions and implementation details of both the filter bank and FFT. In the final three sections of the paper, we present the simulation results of the completed system, an analysis of what the results mean in terms of communication performance, and some concluding remarks on the lessons of this project and the direction of future work.

MotivationFortunately, a graduate student’s research advisor is an excellent source of motivation, and

there was no exception in the case of this work. Prof. Bob Brodersen suggested the project to us as a good match for the course material, and would also assist a DARPA-funded research effort which was charged with specifying a system to meet the space radar application. In conversations with the system researchers, they mentioned that to more rapidly accomplish their goal, a parameterizable hardware implementation of the polyphase filter bank would be extremely helpful. As this course project was intended to produce a parameterizable and reusable signal processing core, the work certainly seemed to be mutually beneficial and increased the project’s value.

With the implementation we define in this paper, a significant amount of the structural characteristics of the hardware are configurable by high-level parameters. Thus if the algorithm designers determine that the input word length is insufficient to achieve a certain bit error rate, or

Page 3: Final Report

that more branches are needed in the filter bank to produce more subchannels, they need only to change two variables and resynthesize the hardware to evaluate the new system parameters. In addition to the fact that we are unaware of any previous implementations of this filter bank structure, this project can be a valuable resource to the signal processing community.

Existing WorkWhile filter banks have received some attention in the theoretical world (such as in []), there

are no known implementations at the time of writing. Therefore, we have no comparisons to make at the system level. However, the two primary components, FIR filters and FFTs, are very well known and have been realized in many forms.

The FIR filter is one of the most well known and frequently used components in signal processing. The most common types of FIR filters are the direct and transposed forms. While the transposed form typically reduces the critical path by retiming delays into the adder chain, it also suffers from significant fan-in to apply the input to all taps of the filter at once. Any architecture more novel than these two examples is not relevant to this application, as this project does not impose any uniquely difficult requirements on the FIR filters (the core clock rate is only 15MHz thanks to downsampling at the input). However, because the application uses a bank of filters, the system will require 32 FIRs. Therefore the area and power cost of each filter will be the most important metrics to minimize.

The FFT is also a very common processing element, especially in recent multiple antenna and OFDM systems. As such, much research has been conducted on FFT architectures and their relative advantages. The three most common approaches are parallel, column, and pipelined. The column architecture requires O(N) processing elements, where N is the number of points in the FFT, and time multiplexes them log(N) times for each FFT iteration. The pipelined approach is the transposition of this: O[log(N)] processing elements are time multiplexed N times for each iteration. Since the size of the FFT in this application is a mere 32 points, the reduced throughput of these approaches was not worth the area savings. Therefore, only the fully parallel architecture was investigated, which obtains arbitrarily large throughput (achievable by pipelining) at an O[N log(N)] cost in hardware.

System SummaryA critically sampled FFT filter bank structure used on a space radar system is to be built

using a polyphase implementation. The filter bank approach provides an economic way to separate data in the frequency domain. The input x(n) to the filter bank comes from a 480 MHz, 12-bit A/D converter, which operates on a purely real analog IF signal. The sampled signal first goes through serial-to-parallel conversion, which is realized by a 1-to-M demux operation. As shown in Figure 1, this can be realized by using unit delays followed by M-fold downsampling. In our design, the value of M will initially be fixed at 32. Thus the 12-bit, 480 MHz input stream is converted to a set of 32 12-bit words, each generated at a rate of 15 MHz (derived from 480 MHz / 32).

Page 4: Final Report

Figure 1: Polyphase implementation of the FFT filter bank

Each of the 32 word-wide outputs of the demux stage goes into an FIR filter whose transfer function is given by Ei(z), i=0…M-1. These filters are purely real and will accept the number of taps as a parameter. The bit widths and accumulator size will also be parameterizable for reuse by other groups. Next, the 32-word vector, each element being the output of an FIR filter, enters an FFT to perform a discrete Fourier transform. At this point, these 32 sub-band outputs xi(n) are ready to be modulated and transmitted via a wireless channel. The underlying theory to support this polyphase implementation is sketched below:

Let Xk(z) and X(z) be the z-transforms of output xk(n) and input x(n), then

So is defined to be the z-transfer function from x(n) to xk(n), and consequently is the Fourier transformation from the input to the kth output.

Equation 1

Therefore in order to realize a transfer function , where ML is the

number of taps, one just needs to have .

Furthermore, by doing so, Equation 1 guarantees is automatically realized, where

Figure 2

Page 5: Final Report

is a shifted version of with . Figure 2 above shows an example where four sub-bands with transfer functions shown need to be designed. Due to the nice feature of polyphase filter banks described above, one just needs to design H0(z) and then place its coefficients in the M filters Ei(z), followed by an FFT. This greatly saves computational complexity compared to the traditional way of designing Hk(z) separately.

Being the reverse process of the transmitter, the receiver utilizes an inverse FFT followed by M synthesis FIR filters. An M-to-1 multiplexer finally collects the recovered x’(n), which ideally is equal to x(n) for perfect reconstruction.

FIR Filter Bank

With default parameters, the filter bank contains 32 branch filters, and each filter contains 9 taps. Both number of branches and number of taps are parameterizable. However, to make the illustration clearer, M=4, taps=3 case is used in this report. Since both simulink, MC, and vhdl test bench files are designed to be fully parameterizable (TB.vhd is generated from a Matlab script TB_gen.m), it’s only a matter of increasing simulation time for more complex systems.

Two realizations of the filter bank are designed and tested in fixed-point domain in simulink. As showed in figure 2-b, system 1 consists of a delay-down-sample chain shifting the phase for each of the M branches; followed by filters operated at speed M times slower than the input data rate. The alternative approach, system 2, is depicted in figure 2-c. A single-in-multiple-out filter bank processes the input data at the same frequency, followed by an M-down sampler for each

2 Out2

1

Out1

z

1Unit Delay4

z

1Unit Delay3

z

1Unit Delay2

z

1

Unit Delay1

signal1

signal2

signal3

signal4

Out

FixPtGateway Out5

Out

FixPtGateway Out4

Out

FixPtGateway Out3

Out

FixPtGateway Out2

In

FixPtGateway In4

S12

In

FixPtGateway In3

S12

In

FixPtGateway In2

S12

In

FixPtGateway In1

S12

FixPtFIR3S24

FixPtFIR2S24

FixPtFIR1S24

FixPtFIRS24

Convert

FixPtConversion3

S12 2 5̂

Convert

FixPtConversion2

S12 2 5̂

Convert

FixPtConversion1

S12 2 5̂

Convert

FixPtConversion

S12 2 5̂

4

Downsample4

4

Downsample3

4

Downsample2

4

Downsample1

1

In1

4

In1

In2

In3

last branchscope1

In1

In2

In3

f irst branchscope2

In1Out1

f ir_bank_f ix

In1Out1

f ir_bank_accurate

outv ec2

To Workspace4

outv ec1

To Workspace3

outv ec3

To Workspace2

inv ec

To Workspace1

outv ec0

To Workspace

Out1

Subsy stem1

Demux

Demux

Demux

In1 Out1Out2

4_f ilter_block

4

4

4

4

1

Out1z

1

Unit Delay

Out

FixPtGateway Out2

In

FixPtGateway In3

S12

SIMOFIR

FixPtFIRS24

Convert

FixPtConversion

S12 2 5̂

4

Downsample1

1

In1

44 4 44

a) Final comparison block diagram

b) Filter bank system block diagram, type 2 (for 4 branches)

c) Filter bank system block diagram, type 1 (fully parametrizable)

Figure 3: Block Description

Page 6: Final Report

branch. In order for these two systems to possess the same functionality, their coefficients are directly related. For example, if in system 1 the branch filters have coefficients given by (Branch 1: [ 1 2 3], …, Branch 4: [1 5 2]), then the equivalent system 2 has coefficients ( Branch 1: [ 1 0 0 0 2 0 0 0 3 0 0 0], …, Branch 4: [0 0 0 1 0 0 0 5 0 0 0 2]). A proof and detailed discussion can be found in [2]-[5].

Furthermore, a floating-point domain Simulink realization similar to system 2 is also designed to study the finite word length effect of fixed point systems. All three systems are combined together in figure 2-a for testing.

Though both system 1 and system 2 described in last page have the same functionality, in implementation system 1 has a much lower speed requirement because filters operate at a frequency M times slower than system 2. So only system 1 is implemented in Module Compiler as the global architecture. In Module Compiler, the following micro-architectures are explored:

Operator merge (e.g. merge MAC into one operator to increase speed and decease area); Carry save (using carry save method to shorten critical path delay); Booth encoding (using booth encoding to shorten critical path delay of multipliers) Carry look ahead full adder ( employing carry look ahead adders)

Since the filter coefficients may not be constant, multipliers are used instead of shift-add chain. Both direct form fir filter and transposed form are considered for each of the combinations above. The table in page 2 summarizes the simulation result in Module Compiler. Comparisons among different architectures are based on a 9tap filter for each of the 32 branches. Nine taps are enough in the Radar System in which the filter bank is used ( 9 taps and 32 branches give an equivalent filter with 9x32=287 taps).

As one can see, direct form fir filter in general has less area and power consumption (unless both opmerg and cs are off), due to the large amount of multipliers and adders at the output that opmerg and cs can optimize. However, direct form is going to be used only when either or both opmerg and cs are turned on. For transposed form fir filter, the key to save power and area without loosing speed is to use carry save method. Actually from the table we can see that with carry save on, all the other micro-architectures don’t really make much difference.

In comparison between direct and transposed forms, one can see that with the same power consumption transposed form has higher speed (due to the natural pipelined nature), while direct form has less area. So if the power is the issue with speed set to 15MHz, then transposed form architecture with carry save is the choice in the class technology.

Page 7: Final Report

A block diagram of the transposed form FIR with carry-save is depicted below (also see [Error: Reference source not found]).

Parallel FFT

This block performs a 32-point Fast Fourier Transform on the input sequence. The FFT is a form of the discrete Fourier transform which exploits periodicity in the complex coefficients to minimize the amount of computation required by the algorithm. As a 32-point FFT, it takes 32 consecutive values from an input sequence and calculates 32 points of the Fourier transform across one period.

Each of the inputs to the FFT is generated by one branch of the filter bank structure. As described above, the 12-bit 480MHz input stream to the system is interleaved and downsampled at the filter bank, such that the FFT inputs are 12 bits at only 15MHz. The word width is parameterizable to allow precision adjustments as needed. However, since the A/D limits us to 12 bits, precision could only be gained in the form of increased range and less truncation throughout the datapath. The width of the coefficients is specified separately as its own parameter. The description also allows both the amount of truncation that occurs at each stage in the FFT and the presence of saturation to be passed as parameters. The truncation value determines how many bits the output at each stage is shifted to the right. The saturation value is a string which can only be set to “on” or “off”, and determines whether the outputs of each stage are saturated or allowed to wrap around. All partial results between stages of the FFT are computed with additional decimal places of precision (the default is 3). Thus, for an input width of 12 bits, all internal results have 15 bits (12 bits above the decimal point and 3 bits below). The final output simply strips off the additional decimal places such that the result is scaled to the same precision as the inputs.

The architecture implemented is a fully parallel implementation. This architecture was chosen for its simplicity (to facilitate learning about the FFT itself), and as a suitable base case for comparison. In addition, as mentioned above in the explanation of existing work, the 32-point FFT is small enough that a fully parallel implementation is well worth the area cost to achieve almost unbounded throughput via pipelining. Column and pipelined approaches, while very area efficient for large FFTs, inhibit the throughput by time multiplexing the reduced hardware. The parallel FFT requires one complex multiplication (four real multiplications and

Figure 4: Block diagram of transposed FIR

Page 8: Final Report

two real additions each) and two complex additions (two real additions each) for each butterfly in the flow graph, as shown below. Thus for our 32-point FFT, a total of 320 multipliers and 320 adders are needed. This is a direct mapping of the FFT algorithm into hardware.

Figure 5: Signal flow diagram of an FFT butterfly

Three iterations of the microarchitecture were implemented to investigate the effects of each improvement. The first used nothing but pure ripple addition for all operations. This was chosen as the absolute simplest design, and one which should have the least area and worst delay. The second design attempted to use the fastest combination of components1: carry-save for all arithmetic, carry-lookahead propagation to generate binary results at the end of each stage, and booth multipliers to half the number of partial products. The use of carry-save and carry-lookahead should also noticeably increase the area, but may be simultaneously offset by the reduction in partial products. The final design was not a performance improvement, but rather an accuracy improvement. Saturation was enabled on the third iteration to investigate its effect on the performance of the FFT. The original idea was that it may be essential for the system’s algorithm performance, and its cost in terms of circuit performance should be known for the FFT in advance. It should be noted that an architecture featuring carry-save arithmetic throughout all stages of the FFT was attempted, but obscure errors from Module Compiler have prevented it from being realized.

Carry-save with carry-lookahead propagation was much faster than the pure ripple architecture, with a modest increase in area and power. The surprise was that the addition of saturation to the carry-save architecture resulted in a severe penalty to all features: delay, power, and area. This result reinforced the need to evaluate saturation early – including it in any system will seriously degrade performance. The significant system simulation explained in the next sections demonstrates that saturation is not required to achieve a successful implementation of the algorithm, and therefore offers no benefit.

1 Actually, the fast carry lookahead adder was even faster (hence the name), but came with such a severe area penalty that it was not included. The design is not sensitive enough to delay to justify its use.

Page 9: Final Report

Figure 6: Serial path through butterfly architectures (parallel paths not shown)

Figure 7: Simulink model of FFT butterfly

In summary, the carry-save arithmetic with no saturation was the clear winner for implementing the parallel FFT. Although we originally intended to allow carry-save arithmetic (and consequently the redundant number representation) propagate from the FIR filters all the way through the FFT, Module Compiler errors did allow us to test this case. The final version of

Page 10: Final Report

the FFT, complete with additional internal precision and truncation between each stage, was fully verified with Simulink floating and fixed point simulations. For an input width of 12, coefficient width of 8, and internal precision of 15, the finite-precision hardware implementation came within 1.0 of the ideal floating point FFT for a random input sequence, which is an optimal accuracy for integral input precision.

Results Since our blocks are fully parameterizable, and it takes long time to run 32 branch simulation. To show the essence of filter bank FFT design, we have chosen 4 branches to show our results.

Choice of Test Vectors

Test vectors need to be able to check overflow in fixed-point domain, able to easily show the functionality of the system, and finally able to check the finite word length effect. Following these rules, we have chosen

Impulse delta(n) (esp. useful in analysis in frequency domain.Sum of various sine waves with various amplitudeUniformly distributed random vector.In the following pages of results, each test vector is clearly specified for each test. Fir Coefficients Determination We are using raised cosine filter with Roll-off=0.1, and critically sampled filter bank design. The sidelobes are designed to be below 40db. See the following figure for the result.Notice that 109 taps are needed for the whole effective filter. Also, be aware that when each channel goes out, since a 4-down sampler is involved, the frequency response at each output will be 4 times expanded. In our case here, they will all have exactly the same frequency response for an impulse signal.

System DescriptionIn terms of Filter bank it self (without FFT part), we have showed two logically equivalent system in FIR part early. We call them bank system 1 and 2. Either of them corresponds to a floating point and fixed point simulink model. These combined with FFT block will give us 2 floating point systems and 2 fixed point systems for the whole polyphase filter bank. Moreover, we have another equivalent system described below in simulink:Please ask Changchun if you have trouble to read above plot. He derived this equivalent system himself although he thinks it should be well known, e.g. [1]-[5].Our strategy to test the equivalency of these three systems are cross compare them in several test setup, in both floating and fixed point domain, as showed in test setup1 (also in test setup2 later). Test Setup1a) 1st row is system 1 fixed point; b) last row of scope shows SNR 2nd row is system2 fixed point; 3rd row is system2 floating pointFigure 8. Test setup 1.

Page 11: Final Report

For the following few pages, the scope output for each branch is shown. Each page shows the result corresponds to the input signal at the bottom of the page ( with also the spectrum given). Notice the scope is showing absolute values. Also the second input to each scope corresponds to the difference between system 1 fixed point and system2 fixed point (strickly zero as expected). The 4th row shows the difference between system 2 fixed point and system 2 floating point. They are within 4, and mostly within 1.All the result are exactly as expected, and they are intuitive to understand. Some of the important results are shown at the bottom of the page.

Page 12: Final Report

Input x(n) = 511×Sin( /16·n)+ 8×Sin( (/16+ /2)·n) + 8×Sin( (/16+ )·n) + 8×Sin( (/16+ 3/2 )·n)H(z)

1 2 3 4 14321

-2 -3/2· - -1/2· 1/2· 3/2· 20

Red: Branch #Blue: Signal

Page 13: Final Report

Input x(n) = 8×Sin( /16·n) H(z)

1 2 3 4 14321

-2 -3/2· - -1/2· 1/2· 3/2· 20

Red: Branch #Green: Signal

Page 14: Final Report

Input x(n) = 511×Sin( (/16+ /2)·n) H(z)

1 2 3 4 14321

-2 -3/2· - -1/2· 1/2· 3/2· 20

Red: Branch #Green: Signal

Page 15: Final Report

Input x(n) = 511×Sin( (/16+ ·3/2)·n) H(z)

1 2 3 4 14321

-2 -3/2· - -1/2· 1/2· 3/2· 20

Red: Branch #Green: Signal

Page 16: Final Report

Input x(n) = 511×Sin( (/16+ )·n) H(z)

1 2 3 4 14321

-2 -3/2· - -1/2· 1/2· 3/2· 20

Red: Branch #Green: Signal

Page 17: Final Report

Input x(n) = Uniform Random(-2000~2000)

SNR due to truncation (not including ADC and Finite WL coefficients) is ~ 50-60db– Obtaining SNR due to

finite WL coefficients .

– Obtaining SNR due to ADC noise at input.

– Tests the functionality of system-3

Page 18: Final Report

Test setup 2

Page 19: Final Report

Input x(n) = Uniform Random(-2000~2000)

Output SNR due to finite coefficient solely ~ 65db, Output SNR due to ADC is ~ 70-80db

Page 20: Final Report

The following two analyzer figures are obtained before down sampler as shown in setup2.

Page 21: Final Report

After down samplers, we see that each window get expanded and look the same. However all the data at each branches are coming from a different frequency domain region.

Page 22: Final Report

DiscussionAll system level simulations do exactly what are expected. The most interesting part in the simulation is to see the finite WL effect in various aspects. The following table summarized the SNR result. It is also clearly see that under any circumstances our input signal as low as 3 bits can at least has a 10db output. And if we are using the full scale of ADC, we expect to see a SNR higher than 50db at the output. Finally be aware that in 32-branch system. The result would be worse.ConclusionThe goal of the project was to build a parameterizable polyphase filter bank structure. The architecture of the design also had operated under the constraints of a space-based radar system with an 12-bit A/D input operating at 480MHz. We have accomplished this goal by demonstrating a system which can easily meet the timing requirement, and which can do so at increasingly low power consumption with greater pipeline depths (up to the practical limit where flip-flops dominate the power consumption). The performance of the system also does not come at the expense of accuracy, as the SNR through the entire system can stay above 50dB when input is fully scaled, including the effects of A/D quantization, finite coefficients, and fixed-point truncation. These features do cost a modest 12mm2 of area, which is not excessive, and will only become less important as technology continues to scale smaller and smaller. Future work related to this project would include integration with the system designers, exploration of the hardware parameters in the complete system, and realization of the design in a physical device. Furthermore, it is a great exercise to analytically model and understand the finite wordlength effects.ReferencesEE225c project proposal and midterm at http://bwrc.eecs.berkeley.edu/People/grad_students/ccshi/classes/ee225c.htm

By Changchun Shi and Kevin Camera.1. P.P. Vaidyanathan, Multirate Systems And Filter Banks, book, Chapter 4-5, 1993.

Page 23: Final Report

2. R. E. Crochiere, and L. R. Rabiner, Multirate Digital Signal Processing, book, Chapter 7, 1983.

3. M. Bellanger, G. Bonnerot, and M. Coudreuse, “Digital Filtering By Polyphase Network: Application to Sample Rate Alteration and FilterBanks”, IEEE Trans. On Acoust. Speech and Signal Proc., vol. ASSP-24, pp. 109-114, April 1976.

4. A. V. Oppenheim, R. W. Schafer, with J. R. Buck, Discrete-Time Signal Processing, book, chapter 6-8, 1999.

5. MC_unit09_sg.pdf file located at class website: https://saidar.eecs.berkeley.edu/~ee225c/restricted/

6. BWRC Simulink-VHDL simulation tutorial: http://bwrc/Research/IC_Design_Flow/Flow/function/tut/vhdlsimtut

Page 24: Final Report

AppendixThe following are more specific area, power, and delay results, as well as diagrams for

simulation models too detailed for use in the body of the report.

Floorplan

Although the original plan was to try to match the size of the FIR filter bank and the FFT, reducing the number of FIR taps that significantly results in unacceptable noise performance. Therefore, 9-tap FIRs remained, and the aspect ratios alone were matched. The connections between the blocks correspond to each FIR filter output and the 32 real inputs to the FFT. We realize the drawing is not to scale, but wanted to clearly indicate the 32 connection points between the blocks.

FFT Hardware Results

Carry-save(no saturation)

Ripple Carry-save (saturation)

Delay 18.7 ns 25.8 ns 29.1 nsPower 40 mW 33 mW 54 mWArea 1.93 m2 1.72 mm2 2.62 m2

Parameters Word width – w Additional internal precision – prec

Page 25: Final Report

Right shift between stages (truncation) – shamt Saturation enable – sat Carry-propagate adder style – atype Multiplier style – mtype Carry-save enable – cs

VerificationSince the final version of the FFT included the correct complex coefficients, increased internal precision, and truncation between stages, another set of verification simulations were performed, this time including a comparison against an ideal floating point FFT. The same methodology was used as before, where the VHDL output was imported into Simulink for a graphical comparison (courtesy of Prof. Brodersen).

Figure 9: Plot of fixed/floating/VHDL differences

Note in the verification results above that the VHDL and fixed point results differ by less than 1 at all times. This is an optimal level of accuracy, since the VHDL outputs are truncated to integer precision and do not represent the fractional component. In the floating and fixed point

Page 26: Final Report

comparison, you see that the finite precision error is also always less than 1, implying that the word length, coefficient length, and internal precision are adequate.

FIR Hardware Results

Architechures (arch=1=>direct form, 2=>transpose form) Area (mm2) Delay (ns) Power (W)

,arch=2,opmerge=yes,cs=on,mut=booth,fat=fastcla 10.0584 9.03 0.16

,arch=2,opmerge=yes,cs=on,mut=booth,fat=ripple 10.0584 9.03 0.16

,arch=2,opmerge=yes,cs=on,mut=non-booth,fat=fastcla 10.0584 9.03 0.16

,arch=2,opmerge=yes,cs=on,mut=non-booth,fat=ripple 10.0584 9.03 0.16

,arch=2,opmerge=yes,cs=off,mut=booth,fat=fastcla 13.245696 10.99 0.224

,arch=2,opmerge=yes,cs=off,mut=booth,fat=ripple 13.245696 10.99 0.224

,arch=2,opmerge=yes,cs=off,mut=non-booth,fat=fastcla 13.245696 10.99 0.224

,arch=2,opmerge=yes,cs=off,mut=non-booth,fat=ripple 13.245696 10.99 0.224

,arch=1,opmerge=yes,cs=on,mut=booth,fat=fastcla 8.433504 12.49 0.16

,arch=1,opmerge=yes,cs=on,mut=booth,fat=ripple 8.433504 12.49 0.16

,arch=1,opmerge=yes,cs=on,mut=non-booth,fat=fastcla 8.433504 12.49 0.16

,arch=1,opmerge=yes,cs=on,mut=non-booth,fat=ripple 8.433504 12.49 0.16

,arch=1,opmerge=yes,cs=off,mut=booth,fat=fastcla 8.481888 11.87 0.16

,arch=1,opmerge=yes,cs=off,mut=booth,fat=ripple 8.481888 11.87 0.16

,arch=1,opmerge=yes,cs=off,mut=non-booth,fat=fastcla 8.481888 11.87 0.16

,arch=1,opmerge=yes,cs=off,mut=non-booth,fat=ripple 8.481888 11.87 0.16

,arch=1,opmerge=yes,cs=off,mut=booth,fat=fastcla 8.481888 11.87 0.16

Parameters Number of branches of the filter bank – M Number of taps for each filter – taps Input word length for the bank – wX Output word length for each filter – wZ Internal word length used in filters – wI Input shift (scaling) before entering filter bank – Inshift Output shift (scaling) when outputting from filter bank – Outshift Macro architectures – direct form, transposed form Micro-architechtures – operator merge, carry save, booth encoding, fast carry look ahead

VerificationThe FIR filter bank did not change since the midterm report, and therefore verification results will not be duplicated here. Please refer to Changchun’s midterm report for a review of the hardware verification. However, make sure that right now before run coef.m, one needs to run gen.m which generate the fir coefficients using a raised cosine model. Then follow remind.m as before.