My paper

A Modified Radix-24 SDF Pipelined OFDM Modulefor FPGA based MB-OFDM UWB Systems

M.Santhi, S.Arun Kumar, G.S.Praveen Kalish, K.Murali, S.Siddharth, G.LakshminarayananDepartment of ECE, National Institute of Technology, Thiruchirapalli.

[email protected] [email protected]

Abstract - The OFDM module in the MB-oFDM UWBtransmitter is necessarily operated at 528 MHz. This is really achallenging task because the OFDM in the UWB module has tocalculate 128-point IFFT. Earlier papers used radix-24 SDFalgorithm with parallel processing architectures of block size twoto achieve the required speed and implemented the module onASIC. In this paper a novel scheme "modified radix-24 SDFalgorithm" is proposed to achieve the calculation of 128-pointIFFT. In the proposed scheme, the order of the twiddle factorsequence is different compared to the earlier radix-24 SDFalgorithm. The change in twiddle factor sequence achieves easierimplementation of the CSD multiplier used for IFFT calculation.It is also proposed that the required speed can be achieved onFPGA itself without using paraDel processing architectures. Thiscan be done by pipelining the OFDM module as well as usingLPMs. This leads to reduction in area compared to the earlierapproach of using parallel processing architectures of block sizetwo. For improving the accuracy, in the proposed scheme theinternal wordlength is maintained at 13bits which is 7 bits morethan the input, to account for the overflows at each of the 7 stagesof the OFDM module. The proposed scheme with increasedcomplexity for better accuracy is tested on ALTERA Stratix IIIEP3SL50F484C2 device. From the implementation, it is verifiedthat the OFDM module achieves a maximum clock speed of 528MSamplesls. In general ASICs are three times faster than FPGA,operating the ASIC based OFDM module in 528 MHz with theproposed modified radix-24 SDF pipelined algorithm is verymuch easier.

Keywords - MB-OFDM, SDF, FFT, FPGA.

I. INTRODUCTION

Ultra wideband (UWB) communication systems, whichenable the delivery of data from a rate of 110 Mb/s at adistance of 10m to a rate of480 Mb/s at a distance of 2 m, areideally suited to application in short range wirelesscommunications because they can share a frequency band withexisting narrowband systems and offer a higher data rate than802.11 or Bluetooth [1]. One of the communication methodsfor IEEE 802.15.3a standard is Multiband OrthogonalFrequency Division Multiplexing (MB-OFDM), which offers528 MHz bandwidth [2][3]. MB-OFDM-based UWB not onlyhas reliably high-data-rate transmission in time-dispersive orfrequency-selective channels without having complex timedomain channel equalizers but also can provide high-spectralefficiency.

The FFT/IFFT processor is one of the modules having highcomputational complexity in the physical layer of the UWB

system, and the execution time of the 128-point FFT/IFFT inUWB system is only 312.5 ns. The power consumption andhardware cost can be saved in our processor by using thehigher radix FFT algorithm and less memory and complexmultipliers.

This paper is organized as follows. Section II describes thedesign issues of MB-OFDM UWB communication systems.Section III describes the proposed 128-point radix-24

FFT/IFFT algorithm. Section IV describes the proposed 128point radix-24 FFT/IFFT architecture. In Section V, theimplementation and performance of the proposed FFT/IFFTarchitecture are discussed. Conclusions and further work arepresented in Sections VI and VII respectively.

II. DESIGN ISSUES OF THE FFf PROCESSOR

A block diagram of the proposed physical layer of OFDMbased UWB system is shown in Fig. 1[4]. In the UWB system,the data rate is from 53.3 Mb/s to 480 Mb/s with code rates of113, 11/32, 112, 5/8, and 3/4. The bandwidth of the transmittedsignal is 528 MHz and the OFDM symbol duration is 312.5ns, including 60.61 ns for cyclic prefix duration and 9.47 nsfor guard interval duration [2][3]. Thus, an FFT/IFFT has tocompute one OFDM symbol within 312.5 ns and thethroughput rate of this specification in 128-point FFf/IFFT isup to 409.6 MSamples/s.

Various FFT architectures, such as single-memoryarchitecture, dual-memory architecture, pipelined architecture,array architecture, and cached-memory architecture, have beenproposed in the last three decades. In our view, the pipelinedarchitecture should be the best choice for UWB systems sinceit can provide high throughput rate with acceptable hardwarecost.

The pipelined FFT architecture typically falls into one of thetwo following categories: multipath delay commutator (MDC)and single-path delay feedback (SDF)[5]. In general, the Moescheme can achieve a higher throughput rate, while the SOFscheme needs less memory and hardware cost. In addition, thehigher radix FFT algorithm is difficult to be implemented inthe traditional MOC architecture. Table 1 compares thehardware requirements for various architectures. The proposedarchitecture based on radix 24 SOF architecture was selectedfor implementation owing to the low hardware cost andgreater area efficiency and can also provide an availablethroughput rate to meet the UWB specifications.

Proceedings of the 2008 International Conference on Computing, Communication and Networking (ICCCN 2008)978-1-4244-3595-1/08/$25.00 <02008 IEEE

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on May 14, 2009 at 02:04 from IEEE Xplore. Restrictions apply.

Fig. 1. Block diagram ofthe MB-OFDM UWB receiver systemTABLE 1 COMPARISON OF HARDWARE REQUIREMENTS FOR N-LENGTH FFT

WITH DIFFERENT ARCHITECTURES

ArchitectureComplex Complex Memory Control

Multiplier # Adder # size circuit

R2SDF log2(N)-2 log2(N) N-l simple

R2MDC lo~(N)-2 4Iog4(N) 3N/2-2 simple

R4SDF lo~(N)-1 log4(N) N-l medium

R4MDC 3(lo~(N)-1 ) 8Iog4(N) 5N/2-4 simple

R22SDF lo~(N)-1 4Iog4(N) N-l simple

R23 SDF logg(N)-1 4Iog4(N) N-l simple

R24SDF log16(N)-1 41og4(N) N-l simple

III. PROPOSED RADIX 24SDF ALGORITHM

A Discrete Fourier transform (DFf) of length of N (=128)is defined as

N-f.

x(k) =Lx(n)Wlk .k:: O.l..... N -1 (1)'tat

Where WN, the so called "twiddle factor", denotes the N-thprimitive root ofunity, with its exponent evaluated modulo N.The k is the frequency index, and the n is the time index. Inorder to derive the radix-24 algorithm, consider the first 4steps of decomposition [6]. Applying a 5-dimensional linearindex map, wherein the 5th dimension in itself is decomposedinto a 2 bit and 1 bit index, we have,

11 N .V .V .'in = <"2 ftt +"4J1.: +8"; + 16 114 + 64 n, +ftt >

k = < k1 + 2k: +4k; +SkI. +32k, .... 16k, > (2)The common factor algorithm (CFA) takes the form of

XOct +2k: +4k; + 8k~ + 32ks + 16k~)

~ f ~ ~ ~ ~ (i~V .v :N N N ~\=L £, L. l.. i.. ~ x >2"~ +."-- +it':J -+- ii~'" +ii,r.t +"a't ....r .......,"'J~ ':-0 ..~-o

:: L L [G(JII•.1'1,.k t' ":. Iei' k4}it:l"....Jlo)(k,....:k••~.j;(J]n.-Ot'la-O .'."';:'lI.....J(;:cI...~ (3)

if

(5)

Where H (n) denotes the second butterfly unit

H(1'I):: H(ra.kt ·Ic:>:: B(tLkJ+(-j)(.i4...:ft:JB(ft+i .kl)

Where B (n,kl) denotes the first butterfly unit as follows.

B(n.k1) =x(n) +(-l)tt'x(n+~)~

The algorithm can take complex constant multiplier instead ofprogrammable complex multiplier. The Canonic Signed Digit(CSD) constant multiplier contains the fewest number of nonzero bits, so it can be used to reduce the area and powerconsumption [7]. Fig. 2 shows the signal flow graph (SFG) ofthe 128-point radix-~4 SDF FFT alg~rithm.

Fig 2. Signal flow graph of the proposed R24SDP algorithm

IV. PROPOSED FFT ARCHITECTURE FOR THE MB-OFDMUWB SYSTEM

A block diagram of the proposed single data-path 128-pointR24SDF FFT/IFFT processor is shown in Fig. 3. Theproposed architecture consists of a memory block, butterflyunits (BFl, BF2), programmable complex multipliers, CSDcomplex constant multipliers, register files, and somemultiplexers. The FFT processor can be transformed to anIFFT block by performing the operation as shown in the Fig4. The output results of butterfly units are complex additionand complex subtraction of two input data x[n] and x[N/2+n],where N=l28.

Due to the spatial regularity of Radix-24 algorithm, thesynchronization control of the processor is very simple. A(log2N)-bit binary counter serves two purposes:synchronization controller and address counter for twiddlefactor reading in each stage. For first N/2 cycles, the 2-to-lmultiplexers in the butterfly module I (as shown in Fig.5.i)switch to position "0", and the butterfly is idle. The input datafrom left is directed to the shift registers until they are filled.On next N/2 cycles, the multiplexers tum to position "1", thebutterfly computes a 2-point DFT with incoming data and thedata stored in the shift registers.

ZI(n) = x(n) + x(n+N/2), 0 ~ n < N/2 (6)ZI(n + N/2) = x(n) - x(n+N/2)

The butterfly output ZI(n) is sent to apply the twiddlefactor, and ZI(n + N/2) is sent back to the shift registers to be"multiplied" in still next N/2 cycles when the first half of thenext frame of time sequence is loaded in. The operation of thesecond butterfly is similar to that of the first one, except the


Output

X(D~2.)..X(4).. -.J((eo)..X(52)C(1)..)C(3)r··,J((61)C(l53)rX(64)rX(Mi)r··,)((:l24)..x<U6).X(IIi5))C(fi7).. ....)({127)

:>

A(O)"A(n)"A(Uii)r-.A(62).,A(64).. ··-,.Il(94},A(t26),A(J.)"A(")"A(J.7)r·.A(~~·..,.A(9S),A(127)

Tlf11e>c8> PragriImmabIeCa~Muitipler

o CSD Complex Multiplier_____Data path

TllTle

Fig 3. Block diagram ofFFT/IFFT processor

"distance" of butterfly input sequence are just N/4 and thetrivial twiddle factor multiplication has been implemented byreal-imaginary swapping with a commutator and controlledadd/subtract operations, as in Fig. 5-ii, which requires two bitcontrol signal from the synchronizing counter. The data thengoes through a full complex multiplier, working at 75%

IFFT 11MFig 5.i Structure ofBFl

Fig 4 Block diagram of the proposed 128-point R24SDF FFT/IFFT processor

R

11N

R

utility, accomplishes the result of first level of radix-4 OFTword by word. Further processing repeats this pattern with thedistance of the input data decreases by half at eachconsecutive butterfly stages. After N-l clock cycles, thecomplete OFT transform result streams out to the right, in bitreversed order. The next frame of transform can be computedwithout pausing due to the pipelined processing of each stage.Radix-24 FFT algorithm based single-data-path architectures

has fewer multipliers than those of lower radix FFTalgorithms. For example, radix-24 algorithm has the samenumber of multipliers as the radix-22 algorithm but can reducean amount of multiplicative complexity by means of replacinga half of full complex multipliers with trivial constantmultipliers [8].In the CSD complex constant multiplier, themultiplication of the twiddle factors is processed according totheir scheduling in the signal flow graph. The output datagenerated by the BF in the sixth stage are multiplied by atrivial twiddle factor, -j, W(16) or W(48) before they are fedto the last stage.

The Simplification ofthe Complex Multiplication

Complex multiplication is the main design key in the FFTalgorithm. Consider the complex multiplication, the twoinputs should be the xr + i xi and the coefficient W =exp(j21t1N) = cosa + i sin a, and the result can be expressed by

Y = yr + i yi , where,yr= xr cos a - xi sin a = xi(cos a + sina) + (xi - xr) cos ayi = xi cos a+ xr sin a= xr(cos a - sin a)-(xi - xr) cos a (7)

Fig 5.ii Structure ofBF2

After the transform of the Eq.7, the complex multiplicationonly needs 3 real multiplications, 1 addition and 2 subtractionwhen the sum and the difference between the real and theimaginary parts are precomputed and stored in the ROM .Thisalgorithm is used for the programmable complex multiplier toreduce the hardware complexity and to increase the speed.

CSD Multiplier

Since the twiddle factors in the FFT processor are known inadvance, we propose the use of a multiplier-less architectureto perform the multiplication with the twiddle factors usingshift-and-add operations. The canonical sign digit (CSD)algorithm has been applied to this architecture to furtherreduce the number of shift and-add operations required. In thisarchitecture trivial multiplications are implemented withoutany multipliers by either passing the data, swapping the realand imaginary parts of the complex data or a sign change. Thedesign presented in the paper takes advantage of thesymmetries of the twiddle factors in the complex plane.

When the real and imaginary values of twiddle factors aresame, two CSO constant multipliers and two adder/subtractors are used to generate the output. When the real andimaginary values are not same, three CSO constant multipliers


are used. If inputs don't need to multiply with twiddle factorthe output results are generated from the input directly.

Pipelining

The radix 24 architecture was thoroughly analyzed to findpossible areas to be pipelined based on the design and thecritical path delays between various implemented blocks. Theprocessor was extensively pipelined to achieve the highworking frequency to meet the UWB specification.Shimming registers are also needed for control signals tocomply with thus revised timing.

v. IMPLEMENTATION AND PERFORMANCE

The word length of the proposed FFTIIFFT is 6-bit externalFFT data [9] for both the real and imaginary parts. The 2'scomplement representation of numbers is used in theprocessor. Due to overflow in each adder of the butterfly unit,13-bit internal FFT precision has been maintained. Thedetermined word length not only keeps the quantization noiseto the least but also can minimize the hardware complexity.After the appropriate word length of the proposed FFT/IFFTprocessor is chosen, the architecture of the processor wasmodeled in Verilog in an ALTERA Stratix III FPGA. Some ofthe modules were generated from the ALTERA MegawizardPlug-in Manager and others were written at the RTL level,including the top level wrapper file. It contains all theinstantiated modules and the connectivity information in RTL(VerilogHDL). The Timequest timing analyzer and Chipplanner (Floorplan and Chip editor) of QUARTUS II 8.0 wereapplied to analyze timing, hardware expenditure and so on.Vector waveforms associated with the RTL description werecreated and the stimulus provided in an external file. Using thevector waveform file, simulations were carried out for thedesign to validate the behavioral description. The results wereobtained incrementally, first for a sub block comprising of onemodule of the FFT. Finally the results were obtained for thewhole design comprising of seven such sub blocks, globalclock and dual port RAMs. The output of the Verilog coded

TABLE 2 IMPLEMENTATION RESULTS OF THE PROPOSED PROCESSOR

FamilyALTERA Stratix

ALTERA Stratix IIIII

Device EP3SL50F484C2 EP2s60FI020C4

ALUTs 7972/38000 (3%) 7822/48352 (16%)

ALMs 3986/19000(3%) 4375/19000(3%)

DSP block6/216 «3%) 6/288 (2%)

elements

Total memory bits3328/1880064«1

8192/2544192«1%)%)

Word length1:6 bits 1:6 bitsQ:6 bits Q:6 bits

Number of 7580/38000(20%0 7697/38000(20%)reldsters

Programmablecomplex 1 1

multipliers #Constant complex

2 2multipliers #Number of

28 28complex adders

Clock rate 528 MHz 350 MHz

Throughput rate 528 Msamples/s 350 Msamples/s

Critical path delay 1.87 ns 2.87 ns

architecture agreed with the output data of MATLAB and theFFT/IFFT in our UWB platform, which was designed on aEXCEL worksheet which clearly depicts the outputs with thesignal flow graph.

The implementation of the proposed FFT/IFFT processorwas carried out on a Stratix II EP2S60FI020C4 device andsimulated for ALTERA Stratix III EP3SL50F484C2. Theinput data is given through a dual port RAM and a PLL unit isused to give the required clock frequency. The output ischecked using a dual port RAM and the in-system memorycontent editor. Table 2 shows the performance and resourceusage of the implemented processor. This shows the processoris area efficient and so the entire MB-OFDM receiverItransmitter with the other modules can be accommodated in asingle chip. It has a significantly reduced number of complexmultiplication and complex addition. The critical path delayoccurs between the input RAM and first butterfly unit and sothe processor is capable of running at UWB speeds ifimplemented within a larger system.

All the previous implementations were on ASIC [9] andso comparison with them is not meaningful. Table 3 shows thecomparisons of performance of the different FFT processorsimplemented on FPGA. The validity and efficiency of theproposed architecture has been verified by extensivesimulation and implementation. Fig 6 shows theimplementation results of the proposed FFTIIFFT processor.

TABLE 3 COMPARISIONS OF THE Performance of DIFFERENT PROCESSORS

Family Frequency max

Altera FFT Megacore function on456 MHz

Stratix III [10]Proposed processor on ALTERA

350 MHzStratix II EP2s60FI020C4

Proposed processor on ALTERA528 MHz

Stratix III EP3SLSOF484C2

VI. CONCLUSION

An OFDM module implemented as 128-point FFT/IFFTprocessor for a FPGA-based MB-OFDM UWB system usingthe proposed modified radix-24 SDF pipelined algorithm hasbeen successfully implemented on ALTERA STRATIX IIIand STRATIX II FPGAs without using parallel processingarchitectures. The high speed is achieved by using extensivepipelining on Altera's LPM. The hardware costs of memoryand complex multiplier is saved by adopting delay feedbackand data scheduling approaches. In addition, the number ofcomplex multiplications is reduced effectively by using ahigher radix algorithm and using CSD complex multipliers.Also for improving the accuracy in the proposed scheme, theinternal wordlength is maintained at 13bits which is 7 bitsmore than the input, to account for the overflows at each of the7 stages of the OFDM module. The implementation resultsshow that the throughput rate is 350 MSamples/s at 350 MHzon ALTERA STRATIX II and 528 MSamples/s at 528 MHzon ALTERA STRATIX III device. The high throughput rateofthe OFDM module with increased internal wordlength of 13


bits from 6bits to improve accuracy is very well meeting theMB-OFDM UWB system's specifications.

Fig 6. Results of the implemented processor

VII. REFERENCES

[1] Time Domain, "UWB Applications, Demonstration & RegulatoryUpdate," Sept 2001 workshop, March 20,2001.

[2] A. Batra et aI., "Multi-band OFDM Physical Layer Proposal for IEEE802.15 Task Group 3a," IEEE P802.15-Q3/268r3, March 2004.

[3] A. Batra, J. Balakrishnan, G. R. Aiello, J. R. Foerster, A. Dabak, Designof Multiband OFDM System for Realistic UWB Channel Environment,"IEEE Trans. On Microwave Theory and Techniques, vol. 52, no. 9, pp.2123-2138, Sept. 2004.

[4] Y-W. Lin, H-Y. Liu, and C-Y. Lee, "A I-GS/s FFT/IFFT processor forUWB applications," IEEE Journal of Solid-State Circuits, vol. 40, no. 8,pp. 1726-1735, August 2005.

[5] S. He and M. Torkelson, iODesigning pipeline FFT processor forOFDM(de)modulation,i± in Proc. DRSI Int. Symp. Signals, Systems,and Electronics, vol. 29, Oct. 1998, pp. 257.262.

[6] J. Lee, H. Lee, S-I. Cho, S-S. Choi, "A High-Speed, Low-ComplexityRadix-24 FFT Processor for MB-OFDM UWB Systems," IEEE Inter.Symp. on Circuits and Systems, pp. 4719-4722,

[7] S-M. Kim, J-G. Chung, and K. K. Parhi, "Low Error Fixed-width CSDMultiplier with Efficient Sign Extension," IEEE Transactions onCircuits and Systems-II, vol. 50, no. 12, Dec. 2003.

[8] H.Lee, M.Shin "A High-Speed Low-Complexity Two-Parallel Radix-24

FFT/IFFT Processor for UWB Applications, " IEEE Asian Solid-StateCircuits Conference, November 2007

[9] R. S. Sherratt, S. Makino,"Numerical Precision Requirements on theMultiband Ultra-Wideband System for Practical Consumer ElectronicDevices" IEEE Transactions on Consumer Electronics, Vol. 51, No.2,MAY 2005.

[10] FFT MegaCore Function User Guide MegaCore Version 7.2www.altera.com


My paper

Engineering

Transcript of My paper