128 Point IFFT Processor Designed

A 128/256-Point Pipeline FFT/IFFT Processor for MIMO OFDM System IEEE 802.16e

Simeng Li, Huxiong Xu, Wenhua Fan, Yun Chen, Xiaoyang Zeng State Key Lab. of ASIC and System, Fudan University.

Shanghai, P.R.China Email:{082052001,082052056,09212020009,chenyun,xyzeng}@fudan.edu.cn

Abstract—In this paper, we present a novel 128/256-point FFT/ IFFT processor for the applications in IEEE 802.16e based on MIMO-OFDM. The pipeline FFT architecture is proposed to efficiently deal with 1-4 multiple data sequences, and increase the throughput. Furthermore, less hardware complexity is needed in our design compared with conventional individual parallel approach. The signal-to-quantization noise ratio (SQNR) is 42.7 dB. The proposed FFT has been designed in 0.13 µm technology with the core size of 1.470×1.469 mm2.

I. INTRODUCTION Multiple-input multiple-output (MIMO) technique has

been utilized in combination with OFDM technology for wireless communication systems to enhance the link throughput as well as the robustness of transmission over frequency selective fading channel. This technology has been employed in the physical layer specification of the emerging IEEE 802.16e standard to provide broadband wireless access services [1]. According to optional specification, Alamouti-scheme space-time block code (STBC) is adopted for 2×1 MISO transmission mode. In addition, the receiver dealing with 1-4 sequences can be designed to further improve the system performance. As a consequence, a 4 × 4 MIMO OFDM system for IEEE 802.16e WMAN is considered in this paper.

The receiver of MIMO-OFDM system contains four RFs, four analog-to-digital converters (ADCs), four FFTs, a MIMO equalizer, four De-QAM and de-interleaver, a de-spatial parser, a de-puncturer, a channel decoder, a synchronization block, and a channel estimation block [5]. However, the hardware cost is also increased significantly, because more memory and complex multipliers are needed to allow multiple data to be operated simultaneously.

Fast Fourier transform (FFT) and inverse fast Fourier transform (IFFT) are the crucial computational blocks to the baseband multicarrier demodulation and modulation in an OFDM system, respectively. Various FFT architectures have been proposed. Among them, the pipeline structure is suitable for a short-length FFT processor whose size is smaller than 512 since it can provide high throughput with moderate hardware cost. Multipath delay commutator (MDC) and single-path delay feedback (SDF) are two common

realizations of pipeline architecture [2]. Typically, MDC can achieve higher throughput but with higher hardware cost, whereas SDF features in an opposite way. Recently, several works adopting pipeline have been proposed to deal with multiple data sequences, such as mixed-radix multipath delay feedback (MRMDF) [5]. However, multiple butterfly units (BUs) are required within each pipeline stage.

In this paper, a 128/256-point pipeline FFT/IFFT is proposed to efficiently deal with 4 parallel sequences. The input reordering forms the data from the same channel into the same frame, so that the processor works for 1-4 channels simultaneously and also supports several throughput rates.

This paper is organized as follows. Section II describes the 128/256-point FFT algorithm and the IFFT algorithm. Section III focuses on describing the proposed FFT/IFFT architecture. Section IV compares its hardware cost and throughput rate with some existing FFT architectures in 128/256-point FFT. Conclusions are shown in Section V.

II. ALGORITHM The N-point discrete fourier transform (DFT) of an N-point

sequence {x(n)} is defined as

1 12 /

0 0( ) ( ) ( ) , 0 1,

N Nj nk N kn

n nX k x n W x n W k Nπ

− −−

= == = ≤ ≤ −∑ ∑ (1)

where x(n) and X(k) are complex numbers. The twiddle factor is

2

.j knkn N

NW eπ−

= (2)

The computational complexity of (1) is O(N2) when the required computations are directly executed. The computational complexity can be reduced to O(NlogrN) using the Cooley-Tukey FFT algorithm [3], where r denotes that the radix-r FFT algorithm is adopted. Obviously, the computational complexity decreases as the radix increases for the constant length DFT computation.

Based on Cooley-Tukey algorithm, 128-point FFT, (1) can be reformulated respectively, as

This work was supported by Shanghai Scientific and Technological Commission under Grand No.08700741100.

978-1-4244-5309-2/10/$26.00 ©2010 IEEE 1488

2 3 1 3 2 3 31 1 2 2

3 0 2 1

1 2 3

7 7 1( 8 )

64 128 3 2 1 2 8 80

1 2 3

( 2 16 )

( 2 16 ) ,

0...1; 0...7; 0...7.

k n k n n k nk n k n

n n n

X k k k

W W x n n n W W W

k k k=

+

=

+ +

⎧ ⎫⎡ ⎤⎪ ⎪= + +⎨ ⎬⎢ ⎥⎪ ⎪⎣ ⎦⎩ ⎭

= = =

∑ ∑ ∑

for

(3)

In 256-point FFT calculation, (1) can be reformulated as

2 3 1 3 2 3 31 1 2 2

3 0 2 1

1 2 3

7 7 7( 8 )

64 256 3 2 1 8 8 80

1 2 3

( 8 64 )

( 8 64 ) ,

0...7; 0...7; 0...7.

k n k n n k nk n k n

n n n

X k k k

W W x n n n W W W

k k k=

+

=

+ +

⎧ ⎫⎡ ⎤⎪ ⎪= + +⎨ ⎬⎢ ⎥⎪ ⎪⎣ ⎦⎩ ⎭

= = =

∑ ∑ ∑

for

(4)

The calculation of 128 FFT process can be decomposed into 1 stage of radix-2 butterfly calculation followed by 3 stages of radix-4 butterfly calculation, and 256 point into 4 stages of radix-4 butterfly calculation.

The IFFT of an N-point sequence ( ), 0,1,..., 1X k k N= − is defined as

1

0

1( ) ( ) .nk

N

kX n X k W

N−

−

== ∑ (5)

In order to implement the IDFT efficiently, (5) can be rewritten as

*1

*

0

1( ) ( ) ,N

nk

k

X n X k WN

−

=

⎧ ⎫= ⎨ ⎬⎩ ⎭∑ (6)

where * denotes the conjugate of the data. IFFT can be realized by using FFT algorithm with additional interchange operations and normalization [4].

III. PROPOSED FFT PROCESSOR FOR MIMO OFDM SYSTEM

A. Proposed Architecture The architecture of the proposed 128/256-point pipeline

FFT/IFFT processor is shown in Fig. 1. 128 or 256-point operation is controlled by radix mode, and the operation of FFT or IFFT is controlled by control signal. When IFFT is performed, conjugation of the input sequences will be taken and then be performed by the process in treating FFT, and then output will be conjugated and divided by N. By taking the advantage of pipeline FFT architecture, it can process 128 and

256-point FFT/IFFT with 1-4 parallel data sequences, in one stage of reconfigurable radix-2/22 FFT operation, followed by three stages of radix-4 FFT operation.

The proposed architecture can also support several channel. For 1, 2, 3, or 4 data path, four data from the same channel are formed into the same frame, and then performed the FFT operation in 4 parallel paths. In one channel mode, 4 parallel data are calculated simultaneously, so the throughput rate is 4 times than 4 channels mode at the same clock rate.

The pipeline architecture is introduced to increase the throughput. The optimized memory blocks in each stage are used for computational storage and I/O buffers.

In data reordering, an interleaver is introduced to reorder the input data of 4 paths [5]. As shown in Fig. 2, 4 individual data paths are reordered into 4×4 blocks, where data of the same index from input channel A, B, C, and D are in each frame. So that the following 4 stages of butterfly processor can receive 1 frame of 4 data from the same channel, and implement FFT operation in 4 parallel data sequences more efficiently. Moreover, the interleaver can work under the 1, 2 or 3 sequence modes.

B. Pipeline Radix2/22 Butterfly Unit in Stage 1 As shown in Fig. 3, stage 1 contains a control signal and

address generator, 3 blocks of register files each of which can

Dat

a R

eord

erin

g

Figure 1. Block diagram of proposed reconfigurable pipeline FFT

Figure 2. Block diagram of data reordering.

256

256

256Register File

Mapping

ROMTwiddle factor

MUX

Data in

MUX

MUX

Data in

Single data path Four data path

Data out

Radix-2/22

Mapping

Com

mutator Cu

mm

utat

or

Control signalRadix mode

Address generator

Figure 3. Block diagram of stage 1 Radix-2/22 FFT.

1489

store 256 complex data, 4 radix-2/22 butterfly unit, 4 complex multipliers here deal with the multiplications by the twiddle factors. Due to the periodical characteristic of the twiddle factor, only 1/8 of cosine and sine discrete points are stored in ROMs [6].

The function of the state controller is to manage the whole computation process and generate the read/write addresses and controlling signals for the register files and 4 radix-2/22 butterfly units. The input 4 parallel data sequence, which are reordered into 4×4 blocks, each of which contains data frame A, B, C and D of the same index, are stored into the register file. Computational data will be written back to the memory in the pipeline radix-4 schedule, and then output to next stage.

The radix-2/22 butterfly is the kernel computation unit in stage 1, as shown in Fig. 4. Designed from the SFG of radix-4 operation shown in Fig. 5, it is constructed from modified complex adders, multiplexers, as well as 3 data paths fed back to the registers to fetch the input data and store the temporary values. With a mode signal to control radix operation modes by bypassing the adders, the radix-4 butterfly unit can be configured into radix-2 mode. With this structure, the butterfly with approximate hardware resource of radix-4 may have the ability of reconfigurable radix-2/22 computation.

Four data in each frame will be operated in 4 radix-2/22 butterfly units respectively. Four butterfly units (BU4) can receive their input at each radix-2/22 calculation, as shown in Fig. 4.

For instance, the 1st radix-4 butterfly unit deals with the radix-4 operation of frame A index 0. In the first 192×4 clock cycles, 3N/4×4 sequence of data store in the registers. For

data frame A, it won't start the operation until the 0th (x[0]), the 64th (x[0 + 256/4]), the 128th (x[0 + 256×2/4]), and the 192nd (x[0 + 256×3/4]) points are ready. As the immediate 192nd input data come in, the operation of the radix-4 process is triggered. As soon as the radix-4 computation is done, the result delivered to the next pipeline stage after the multiplication with twiddle factors stored in the ROM, the data of the same index of next data frame B is collected and computed. After finishing the BU4 calculation of 4 frames in the same index, the next operation of the same data frame A, is triggered as soon as the 196th (x[0 + 256×3/4 + 4]) point is collected.

The other three BU4s act in the same way. Obviously, the operation is fully pipelined. There is no waiting latency or idle cycle. The throughput of the pipeline structure is maximized.

C. Pipeline Butterfly Unit in Stage2 and Stage 3 As shown in Fig. 6, this module consists of 2 stages of

radix-4 butterfly units, and complex multipliers for nontrivial twiddle factor multiplications in four parallel data path. The radix-4 butterfly unit, which is BU4 in Fig. 6 is directly designed from the radix-4 butterfly SFG in Fig. 5, without bypass to mode radix-2. Every stage in the pipeline structure needs 3 register banks and one ROM for twiddle factor storage. In stage 2 and 3, register here is used as delay buffers for 4 data paths in blocks.

D. Radix-4 Butterfly Unit in Stage 4 As shown in Fig. 7, due to the 4 parallel operations in the

previous stages, the radix-4 butterfly operation in stage 4 receives its 4 data input simultaneously from the 4 output sequence from stage 3. Thus delay buffer is no long needed in this stage. The radix-4 butterfly unit in this stage is mapped from the SFG in Fig. 5.

E. Memory Consideration To obtain certain precision and simple implementation of

hardware, fixed-point data format is introduced in this design.

Figure 4. Block diagram of the Radix-2/22 Butterfly unit in the stage 1.

0NW

1NW

2NW

3NW

Figure 5. SFG of the Radix-2/22 Butterfly unit in the stage 1.

(a)

(b)

Figure 6. Block diagram of pipeline stage 2 and stage 3 Radix-4 FFT: (a) Block diagram of the pipeline radix-4 butterfly unit and (b) in and out data

in 4 frames of each blocks.

1490

Thus the wordlength of input data is 10 bit and output data in 13bits.

According to (2), (3), and SFG of radix-4 in Fig. 5, every stage in the pipeline structure needs 3 banks of registers and a ROM for twiddle factors storage. Practically, large size registers are implemented using the register file. Take the first stage of radix-2/22 as an example, for radix-4 operation, x[n], x[n+N/4], x[n+N/2], x[n+3N/4] must be stored in three 256×26 bits register files, as complex symbols of 13bits in I and Q.

F. Modified Multiplier Twiddle factor multipliers (c.multiplier in Table I) in stage

2, 3, and 4 as complex multiplier are implemented as constant multipliers with address mapping multiplexers. Due to the periodical characteristic of the twiddle factor, only 1/8 of cosine and sine discrete points are stored [6]. Constant multiplication operation can be carried out using only these nine sets of constants by appropriate swapping of their real and imaginary parts and choosing the appropriate sign [10].The area of the multiplier is reduced.

IV. PERFORMANCE AND COMPARISON The design is synthesized with SMIC 0.13 µm 1.2V power

supply 1P8M technology. The maximum operation frequency is 125 MHz. The Core size is 1470×1469 µm2, including memory. Assume that the wordlength of input data is 10 bit and output data in 13bits, the SQNR of FFT processor is 42.7dB. The latency is 260 clock cycle.

IEEE802.16e has subcarrier space of 10.94kHz, and (128/512/1024/2048) subcarriers. For 128/256-point, working at the required frequency, this 128/256-point pipeline FFT/IFFT processor can meet the requirement of throughput rate of 802.16e, and support FFT calculation for various channels from 1 to 4, in MIMO OFDM application.

V. CONCLUSION In this paper, a 128/256-point novel FFT/IFFT processor

for IEEE802.16e MIMO OFDM system is designed by taking

advantage of the pipeline architecture and reconfigurable butterfly unit so as to achieve low power dissipation and small area. In order to operate 1-4 simultaneous data sequences, data reordering and grouping is introduced, and the processor can provide different throughput rates more efficiently. This design is synthesized with SMIC 0.13 µm 1.2V power supply 1P8M technology, with core size of 1470×1469 µm2.

REFERENCES [1] WiMAX Forum, Mobile WiMAX-Part I: A technical overview and

performance evaluations, Feb. 21, 2006. [2] Shousheng He, and M. Torkelson, “Designing pipeline FFT processor

for OFDM (de)modulation” URSI International Symposium on Signals, Systems, and Electronics, vol.29, pp. 257-262, Oct 1998.

[3] J. W. Cooley, and J.W. Tukey, ‘‘An algorithm for the machine calculation of complex Fourier series,’’ Math. Compt., vol. 5, no. 5, pp. 87-109, 1965.

[4] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975.

[5] Yu-Wei Lin and Chen-Yi Lee, “Design of an FFT/IFFT Processor for MIMO OFDM Systems,” IEEE Trans. Circuits and Systems I, vol. 54, no. 4, pp. 807-815, Apr. 2007.

[6] Yu-Wei Lin, Hsuan-Yu Liu and Chen-Yi Lee, “A dynamic scaling FFT processor for DVB-T applications,” IEEE J. Solid-State Circuits, vol. 39, no. 11, pp. 2005-2013, Nov. 2004.

[7] Fang-Li Yuan, Yi-Hsien Lin, Chih-Feng Wu, Muh-Tian Shiue and Chorng-Kuang Wang “A 256-Point Dataflow Scheduling 2×2 MIMO FFT/IFFT Processor for IEEE 802.16 WMAN” in Proc. IEEE ASSCC, Nov. 2008, pp.309-312.

[8] Ludwig Schwoerer and Ernst Zielinski, “Optimized FFT Architecture for MIMO Applications”, in Proc. European Signal Processing Conference (EUSIPCO), Sep. 2005.

[9] Koushik Maharatna, Eckhard Grass, and Ulrich Jagdhold, “A 64-Point Fourier Transform Chip for High-Speed Wireless LAN Application Using OFDM”, IEEE J. Solid-State Circuits, vol. 39, no. 3, pp.484-493, Mar. 2004.

TABLE I. COMPARISON OF DIFFERENT DESIGNS

4-parallel pipeline

proposed

2-parallel MRDS

[7]

MRMDF [5]

Folding Processing

[8]

Technology 0.13µm 0.18µm 0.13µm - Number of FFT Size 128/256 256 128/64 256

Algorithm Radix-

2/22 Radix-4

Radix-4 Radix-8

Radix-2 Radix-8

Radix-22

Sequence 1- 4 2 1-4 2

Delay Element 1008 (4N-16)

510×2 (2N-2)

512 (4N-4)

510×2 (2N-2)

C. Multiplier 12 2 6 3

Wordlength(bits) I:10/O:13 I:11/O:15 12 -

Clock Rate R 2R R 2R

Throughput Rate 4R 2R 4R 2R

Area (µm2) 1470×1469

- 660×2142

-

Sequence 1Sequence 2Sequence 3Sequence 4

BU4Out 1Out 2Out 3Out 4

Stage 4

D C B A

Time

Figure 7. Block diagram of radix-4 in stage 4, n=0…64 from each data channel

1491

128 Point IFFT Processor Designed

Documents

Transcript of 128 Point IFFT Processor Designed