64_FFT_final

7/30/2019 64_FFT_final

1/8

1

A 0.18m VLSI Technology Based64 Points Fast Fourier Transform Kernel

Duo Ding

Joonsoo Lee

Yousof MortazaviReport for final project of VLSI-I Spring 2007

Abstract-In this report, we present a thorough VLSIimplementation of a 64-point FFT/IFFT IP core with signedfixed-point 16-bit word length accuracy, primarily for IEEE802.11a wireless Local Area Network applications. Such a kernelcould also be integrated into a vast range of modern ImagingRadar Systems and Real-time Signal Processing Systems. Onalgorithm level, our 64-point FFT is accomplished bydecomposing itself into a 2-D structure of 8-point FFTs.Compared with traditional radix-2 64-point FFT, such amechanism greatly reduces the work load of the complexmultiplier unit and results in much better system performancewith respect to processing speed and power consumptions, etc.Complex multiplication operations are realized by shifters andadders with double precision and no RAM cell is required forcoefficient storage. Our proposed FFT kernel is based on 0.18 mCMOS technology, simulated in Synopsys VCS environment andis compiled and synthesized in design_vision environment.Simulated core area of the chip is 2.0 mm

2. Dynamic power

consumption is 15mW at 68 MHz operating frequency and 1.8Vof power supply voltage. To sum up, our design greatlyoutperforms the original target specifications and our FFTkernels overall performance is satisfactory.

I. INTRODUCTIONIn most of todays wireless communication standards,

Orthogonal Frequency Division Multiplexing (OFDM) is used

in order to cope with the multipath fading wireless channel.

OFDM is based on the Fast Fourier Transform (FFT), which is

computationally intensive especially with large number of

inputs. On algorithm level, the complexity of FFT is

represented as O(N log N). As a result, baseband processors

are required to equip with a dedicated FFT processing unit that

is both fast and low power consuming. Power is of primary

importance due to mobility requirement in wireless receivers

and many more handset real-time signal processing devices

and imaging devices.

In this work, we have chosen a particularly low-power FFT

unit from the literature and implemented it in RTL. The FFT

unit is that of [1] which only requires 23 clock cycles tocompute, and occupies only 6.8 mm2 core area. Compared to

other hardware FFT implementations, the work of [1] offers

the most attractive specifications for wireless communication

applications and many other applications in signal processing

as well.

This paper is organized as follows. Design decisions are

discussed in section 2, and results are presented in section 3.

Finally the paper is summarized and concluded in section 5.

II. DESIGN DECISIONSA. Specifications

Figure 1. Interface Diagram

The figure above illustrates a high level overview of the

FFT Kernel interface, all the detailed descriptions of input/

output ports are given below in Table 1:

signal direction descriptions

CLK input System Clock

RESET inputThe FFT kernel resets itself

when RESET goes low

X [31:0] input

Serial input signal at rising edgeof CLK, each 32 bits long; it

takes 57 clock cycles to start

core FFT computation

MODE input

Mode controls the FFT/ Inverse

FFT functionality:

Mode = 0 output FFTMode = 1 output IFFT

Y[31:0] output

Serial output signal at rising

edge of CLK, each 32 bits long;

every 64 sets of outputs

represent a period of FFT/ IFFTcalculation

O_STB outputA data ready signal, when

O_STB =1 it means the output is

ready and valid data sets;

Table 1. Signal Specifications of FFT Core

Since VLSI based FFT IP Cores is an existing technology,

we have a clear set of target specifications, listed as follows is

a 0.25 m technology implemented 64-point FFT corepublished in 2004 [1]:

7/30/2019 64_FFT_final

2/8

2

Target Item Reference value

Core Area 6.8 mm2

Clock Frequency 20MHz

Dynamic Power 41 mW

Data Representation Signed fixed-point

No. Flip-Flops 7134

Core computational clock

cycles23

Points of FFT 64

Word Length 16

Table 2. Initial Target Specifications

B. Implementation(I) A Break-Down of 64 Point FFT Algorithm Background

The Discrete Fourier Transform W k of a complex timeseries w n where , 0,1, 2... 1n k N can be defined asthe following:

0

Nnk

N

n

W k w n W

(1)While the Inverse DFT takes the form of the following:

0

1 N nkN

k

w n W k W N

(2)

Where2 /j N

NW e , suppose that N M T , k s Tt ,

n l Mm , where 0,1,2...7l ; , 0,1... 1m t T Substituting these into equation (1) we could get the following:

1 1

0 0

M Tlt sl sm

M MT T

l m

W s Tt W W w l Mm W

(3)

As can be observed in equation (3), an M T dimensionalDFT could be breakdown to a pair of 2-D DFTs with and

T points respectively. Our proposal for the 64-point FFT

kernel comes from a direct application of such an algorithm:

7 7

8 64 8

0 0

8 8lt sl sm

l m

W s t W W w l m W

(4)

(II) Designing Blueprint An Architectural View

The block diagram of our 64-point FFT/ IFFT core is

illustrated in the following figure 2; here we divide the core

into four sub modules: Input/Output Unit, two 8-points FFT

Units, Complex Multiplier Unit and Internal Register Bank

Unit. Such a structure has its own unique advantages when

compared with other competing proposals, as will be further

discussed in the following section.

Figure 2. Block Diagram of the proposed 64 FFT

(III) Pipelining vs. Parallel Working

Notice that in such architecture, we allocate pipelining and

parallel working units in an evenly distributed manner, rather

than just sharing one physical functional unit and leaving

everything else to pipelining register bank.

Actually, there is another competing proposal for FFT

implementation in which theres only one butterfly unit

integrated, and a super register bank takes care of the

pipelining work load in a very delicate manner. Yet afterdiscussion, we decided that such a proposal would very likely

be a bad idea, since (1) from thermal analysis point of view: it

might work pretty well for 16-points FFT Unit, yet for 64-

points FFT with such a mechanism, huge percentage of work

load will fall on the pipelining unit alone and makes it very

hot meanwhile the rest of the core is quite cool, we think this

is one of the circuit design pitfalls that we should try to avoid.

(2) It may not scale well: the work load and complexity of the

pipelining unit accumulate dramatically when we later

integrate implemented IP cores to form more complicated

cores.

Due to such concerns and time pressure, we finally chose

the current proposal for our 64-points FFT kernelimplementation.

(IV) Design Environment and Technology Library

For Logic Design and Synthesis, we use design_vision

installed on Sun stations of ECE LRC. For simulation, we use

VCS (Verilog Compiler Simulator) Tool suite from Synopsys;

for verifications and testing, we use VCS, VirSim and

MATLAB environments. Since we employed singed fixed

point representation mechanism, most of the data format

conversions in the Verilog test bench interface with Fixed

Point Toolbox of Matlab with a version higher than 7.0.

The technology library linked for the compiling comes from

Lab3 of VLSI-I, which is HT018.db

(V) Modular Design

Module 1: Input/Output Unit Design

The following Table 3 lists in detail the basic input/output

port descriptions of the implemented I/P module.

7/30/2019 64_FFT_final

3/8

3

Table 3. Signal description of I/P module

Basically, I/P unit performs a serial to parallel conversion to

the input data and interfaces with the first 8-points FFT

module, meanwhile it receives control from the

Control_Counter, it also contains embedded buffers for

temporary data storage, since some of the parallel

multiplication needs more than one clock cycle to complete.This will be further elaborated in Multiplier Unit.

Figure 3. Overall structure of Input Module

The above block diagram in figure 3 illustrates the basic

working principles of the input unit, where we can see

necessary combinational logics (C.L block), swapping block

(SWAP) and internal counters. The swapping unit offers a

data path for the IFFT functionality and it is controlled by

mode, which is an input listed in Table 3.Combinational Logic block in Figure 3 is controlled by a 5

bit counter, such a counter properly paces the I/P unit, O/P

unit and Multiplier Unit. It is also one of the outputs of the

FFT processor, which offers the user a good strobe port for

better understanding of the internal working processes of the

64-points FFT kernel.

For a better and more thorough inspection into the Parallel

Conversion block, we could take a look atFigure 4 as follows,

Figure 4. Detail of Parallel Conversion Block

There are 3257 register array in the Parallel Conversion

block. Once every 8 target points are ready for parallel output,

these data will then be fed into the first 8-points FFT block onthe rising edge of the system clock, as shown infigure 4.

Module 2: 8-points FFT Units (1st and 2nd FFT units)


data_in input255 bits input

data

data_out output255 bits output

data

Table 4. Port descriptions of the 8-point FFT units

Table 4 summarizes the input/output port characterizations

of the 8-point FFT unit. This is a pure combinational logic

unit.

Figure 5. Flow Chart of Decimation-In-Time FFT (N=8)

Signal Descriptions I/O

CLK clock input

RST Reset signal input

Data_start Enabling signal input

Data_in 32-bit input data input

mode FFT/IFFT control input

Controlcounter

Interface port withcontrol counter

input

Start_count Enabling signal output

Data_out255-bit output data in

paralleloutput

7/30/2019 64_FFT_final

4/8

4

Figure 6. Basic building block of butterfly structureAs a purely combinational logic unit, the 8-point FFT takes

in parallel data of inputs, and assigns corresponding FFT

results to the output wires, as shown in figure 5 andfigure 6.

For such a design mechanism, each output is computed in a

parallel manner and no flops are employed. A total number of

12 butterfly structured subunits and 5 complex multiplier

subunits are placed. This will add extra area and power

consumptions to the butterfly units, yet both register bank and

multiplier unit benefit greatly from such a trade-off and

satisfactory overall performance of the kernel is guaranteed.

Within the 8-point FFT module, there are 5 complex

multiplications; here in our design, techniques are employed

to make sure that least possible number of actual complexmultiplications are carried out. For the 8-points FFT unit

specifically, theres only one complex multiplication used,

while all other multiplications are achieved by proper

swapping and assigning.

Figure 7. Double precision multiplication mechanism

Since signed fixed point representation mechanism is

employed throughout the design, complex multiplications are

calculated based on two categories: for positive numbers,

shifting and addition are carried out and 0s will be shifted

into the word; for negative numbers, shifting and addition are

carried out and 1s will be shifted into the word. As is shown

infigure7, notice that a double precision calculation is carried

out, which means we double the input length before the

shifting and addition processes, then truncated the result backinto 16 bit word length. By doing so, we actually found out

that accuracy is apparently enhanced in simulation results.

Module 3: Multiplier Unit

Figure 8. Block Diagram of the Multiplier Unit

For the 2-D break-down of the 64-points FFT algorithm,

complex coefficients have to be multiplied with the output of

the first 8-points FFT unit before feeding data into the second

FFT unit. Here 8 complex numbers have to be dealt with.

Techniques are employed so that minimal numbers ofoperations are carried out. Similar to the mechanism in 8-point

FFT unit, double precision method is used and also reusable

results are recycled and swapped. For a modular design

perspective, the eight constants are kept in sub-modules

respectively. All operations are monitored by controlling

signals. The following Table 5 gives a input/output port

summary of the multiplier unit:


COUNT inputcontrolling 5bit

counter

Input_data input 255 bits dataOutput_data output 255 bits data

Table 5. Port description of the Multiplier Unit

Module 4: Internal Register Bank

The internal register bank (CB) is integrated into the system

for temporary storage of the 64 complex data coming from the

multiplier unit. CB has 8 wired 255-bits inputs in parallel and

8 wired 255-bits outputs in parallel, which are directly fed into

the second 8-points FFT unit. At every clock cycle, the

appropriate data at the output of the CB gets aligned with the

target input of the second 8-point FFT unit. Since the second8-point FFT unit is pure combinational, 255 bits of input data

will be processed before the next cycle arrives. Therefore the

downward shifting in CB can be carried out each cycle

without being interrupted.

Essentially, the CB unit is the same to the input unit except

that there are no swapping blocks and buffering registers.

The following Table 6summarizes the input/output ports of

the CB unit:

7/30/2019 64_FFT_final

5/8

5


CLK input System clock

RST input Reset signal

Data_in input 255 bits input

COUNT input Control counter

Data_out output 255 bits output

Table 6. Port summary of CB unit

Module 5: Output Unit


CLK input system clock

RST input reset signal

Data_in input 255 bits input

mode input FFT/IFFT control

Data_out output 32 bits output

O_STB output data ready signal

Table 7. Port summary of O/P unit

Shown in Table 7 above is the basic input/output port

description of the O/P Unit.

Similar to I/P unit, the O/P unit converts the parallel signals

back to serial signals and interfaces with the user and/or LCD

display. There are no buffering registers in this module and

swapping function is selected by input port mode to send

out FFT/IFFT throughput.

C. OptimizationAs discussed in previous chapters, several optimization

techniques are employed during the process of module

designing. Actually, many many versions of structures,

modules and codes were modified and tested before we

finalized the design. Here we list the most important twotechniques among the many:

(I) Computing Accuracy

Previously, our multiplication functionalities are carried out

with 16 bits of accuracy, which is the same word length of

actual data passing through the FFT kernel. Yet simulations

show unsatisfactory errors of the core when compared with

expected outcomes from MATLAB7.0 simulator.

For such a problem, we doubled the bit length of each word

after it enters a complex multiplication block, and then

truncated the 32 bits of word back to 16 bits before output port.

With such mechanism, enhanced accuracy turns out to be

quite satisfactory. Further demonstration will be given in

Testing and Verification chapter that follows.

(II) Intermediate Results Recycling

Although a break down of 64-points of FFT greatly reduces

the computational complexity, there is still quite some amount

of complex operations going on. To further reduce the

calculation complexity for the 8-point FFT Units and

particularly the Multiplier Unit, necessary reuse of

intermediate results turns out to be a good idea for further

power reduction and speed accelerations. That explains

exactly why we have much less complex multiplications in

our design than it actually takes to build a 64-points FFT

kernel.

Further results and figures will be provided in the following

section of Testing and Verifications.

D. Testing and VerificationsOur testing bench of the 64-points FFT kernel interfaces

with both Synopsys and Matlab environments through file

operations. Our testing target is to show the calculation error,

therefore the actual spectrum of the testing cases are given

later in Section III. Our testing cases consist of six well known

time series signals in digital signal processing area, plus one

supper test case which involves 1000 randomized input data

sets. Using such a methodology, we covered more than 64000

input data testing and square error analysis, meanwhile

offering quite some direct applications of our implemented 64-

points FFT kernel in the field of static spectrum analysis.

Test case 1:

Input time series is a rectangle shaped pulse, detailed testing

case in 16-bit fixed point representation is listed inAppendix 2.

We analyzed and plotted the square error of the physical

unit output with respect to the standard output of our Matlab

simulator, as follows,

0 10 20 30 40 50 600

2

4

6

8x 10

-9 Testbench1 (MSE=-2.3283e-010+1.397e-009i)

points in frequency

errorvalue

Figure 9. MSE of the 64-FFT kernel for test case 1

As illustrated in Figure 9, max Mean Square Error is

suppressed below 810-9 and 82% of the physical output is

100% accurate.

Test case 2:

Input time series is a cosine shaped wave, detailed testing case

in 16-bit fixed point representation is listed inAppendix 2.

We analyzed and plotted the mean square error of the

physical unit output with respect to the standard output of our

Matlab simulator, as follows,

7/30/2019 64_FFT_final

6/8

6

0 10 20 30 40 50 60-1

-0.5

0

0.5

1x 10

-5 Testbench2 (MSE= -7.5437e-008-3.574e-008i)

points in frequency

errorvalue


As illustrated in Figure 10, max Mean Square Error is

suppressed below 710-6.

Test case 3:

Input time series is a cos sinj shaped wave, detailedtesting case in 16-bit fixed point representation is listed in

Appendix 2.We analyzed and plotted the mean square error of

the physical unit output with respect to the standard output of

our Matlab simulator, as follows,

0 10 20 30 40 50 60-1

-0.5

0

0.5

1x 10

-4 Testbench3 (MSE=1.2444e-006-5.8627e-007i)

points in frequency

errorvalue


Test case 4:

Input time series is a real constant value function, detailed

testing case in 16-bit fixed point representation is listed in

Appendix 2. The mean square error of the physical unit output

with respect to the standard output of our Matlab simulator, as

follows, accuracy is 100%.

0 10 20 30 40 50 60-1

-0.5

0

0.5

1Testbench4 (MSE=0)

points in frequency

errorvalue


Test case 5:

Input time series is a truncated pulse series, detailed testing

case in 16-bit fixed point representation is listed inAppendix 2.

We analyzed and plotted the mean square error of the physical

unit output with respect to the standard output of our Matlab

simulator, as follows,

0 10 20 30 40 50 60-2

-1

0

1

2

3

4

5

x 10-7 Testbench5 (MSE=-4.773e-009+6.6357e-009i)

points in frequency

errorvalue


Test case 6:

Input time series is a triangle shaped time series, detailed

testing case in 16-bit fixed point representation is listed in

Appendix 2. We analyzed and plotted the mean square error of

the physical unit output with respect to the standard output of

our Matlab simulator, as follows,

0 10 20 30 40 50 60-4

-3

-2

-1

0

1x 10

-7 Testbench6 (MSE=-3.0268e-009-5.1223e-009i)

points in frequency

errorvalue


As illustrated in figures above, the precision of the proposed

64-points FFT kernel is satisfactory with respect to the six

testing cases in signal processing application.

Massive input data test case

Corner cases are not enough to cover the range of a thorough

testing, so we construct 1000 data sets of inputs and

concatenate them into one massive-input-data-file then feed it

into the testing bench, where 1000 cycles of 64-points FFT

will be carried out continuously within the implemented FFT

kernel. The following figure shows the absolute error value of

the 64000 input cases, as demonstrated in the following

7/30/2019 64_FFT_final

7/8

7

Figure 15, the calculation precision is quite satisfactory for

fixed point representation mechanism:

0 1 2 3 4 5 6

x 104

0

1

2

3

4

5x 10

-3

number of points

absolute

errorvalue

Error between MATLAB & Module (random signal)

Figure 15. Absolute Error of the 64-FFT kernel for supper test case

III. RESULTSA. Functionality

For all the test cases we used in Testing and Verificationsector, here we give the spectrum plots, which are the actual

outputs of our implemented 64-points FFT kernel. Verilog

HDL codes for the FFT kernel is listed in Appendix 1.

Test case 1:

Input time series is a rectangle shaped pulse, therefore the

Amplitude Spectrum is a Sinc function shaped spectrum, as

shown inFigure 16below,

0 10 20 30 40 50 600

0.02

0.04

0.06

0.08

0.1

0.12

0.14

points in frequency

amplitudespectrum

Testbench1 (MSE=-2.3283e-010+1.397e-009i)

actual

expected

Figure 16. Amplitude spectrum of test case 1 by our FFT core

Test case 2:

Input time series is a Sinusoidal shaped waveform; therefore

the Amplitude Spectrum is a double-pulse function shaped

spectrum, as shown inFigure 17

0 10 20 30 40 50 600

0.5

1

1.5

points in frequency

amplitudespectrum

Testbench2 (MSE=-7.5437e-008-3.574e-008i)

actual

expected


Test case 3:

Input time series is a complex exponential shaped waveform,

therefore the Amplitude Spectrum is a single pulse shaped

spectrum, as shown inFigure 18 below,

0 10 20 30 40 50 600

0.5

1

1.5

2

points in frequency

amplitudespectrum

Testbench3 (MSE=1.2444e-006-5.8627e-007i)

actual

expected


Test case 4:

Input time series is a constant function, therefore the

Amplitude Spectrum is a pulse function shaped spectrum with

the pulse located at the zero frequency, as shown inFigure 19

below,

0 10 20 30 40 50 600

0.5

1

1.5

points in frequency

amplitudesp

ectrum

Testbench4 (MSE=0)

actual

expected


7/30/2019 64_FFT_final

8/8

8

Test case 5:

Input time series is a truncated pulse series, therefore the

Amplitude Spectrum is an amplitude modulated series of

pulses in frequency domain, as shown in Figure 20 below.

Such a signal is often used for Pulse-Doppler Radar Imaging

systems.

0 10 20 30 40 50 600

0.5

1

1.5

2

points in frequency

amplitudespectrum

Testbench5 (MSE=-4.773e-009+6.6357e-009i)

actual

expected


Test case 6:

Input time series is a triangular shaped waveform, therefore

the Amplitude Spectrum is a product of two Sinc functions , as

shown inFigure 21 below,

0 10 20 30 40 50 600

0.5

1

1.5

2

points in frequency

amplitudespectrum

Testbench6 (MSE=-3.0268e-009-5.1223e-009i)

actual

expected


From the figures above, we could observe that the static

spectrum is accurate to the best of our knowledge.

B. Timing / Area SynthesisFor an initial clock period of 20 ns, the synthesized area ofour 64-points FFT kernel is 2.0 mm2, timing is 14.8 ns, and

dynamic power consumption reaches 15 mW. A total of three

times of optimizations are tried out and the following Table 8

lists the optimized results. For detailed reports, please refer to

Appendix 3. More optimizations could be carried out if time

allows.

target item reference our design

core area 6.8mm2(.25m) 2.0mm2(.18m)

clock freq 20MHz 68MHz

No. registers 7134 5713

power 41mW 15mWTable 8. A comparison between our design and reference design

C. APR and Physical LayoutThe chip is placed and routed in Cadence/2006 Encounter

environment, connectivity is verified and theres no violation

or warning detected. A snapshot of the physical layout of the

64-points FFT kernel is displayed as follows inFigure 22:

Figure 22. APR physical layout of the 64-point FFT core

IV. SUMMARY AND CONCLUSIONSIn this paper, we have described the design and

implementation of a serial 64 point FFT suitable for wireless

and modern signal/image processing applications. We

described the modular design in register-transfer level (RTL),

and synthesized and optimized our modules using Design

Vision. We verified our design at various stages, namely at the

RTL level and post-synthesis. We used golden model testbenches where MATLAB was used to generate valid

input/output vectors. Then the input vectors were applied to

the FFT and the output was compared. We demonstrated that

our processor passed the functionality tests with more than

64,000 data points.

Our FFT chip operates well beyond the target frequency of

20 MHz and occupies only 2.0 mm2 in a 0.18m process.

Once the serial data is in the FFT unit, only 23 clock cycles

are required to produce the output. Therefore, the FFT can be

computed in less than a microsecond.

Overall, the current work allowed us to go through the

design cycles and learn to make important design decisions for

achieving the goal of small area, and minimal delay. It was

also a great exercise on teamwork and collaboration with

colleagues. As a result of this project, we feel were able to

apply many of the concepts learned in the course, hence, the

objectives of the course were fully fulfilled.

REFERENCES

[1] K. Maharatna, E. Grass, and U. Jagdhold A 64-Point Fourier TransformChip for High-Speed Wireless LAN Application Using OFDM,IEEE

JSSC. Vol. 39, No. 3, March 2004

64_FFT_final

Documents

Transcript of 64_FFT_final