“Twiddle-factor-based FFT

8
Twiddle-Factor-Based FFT Algorithm with Reduced Memory Access Yingtao Jiang Department of Electrical & Computer Engineering University of Nevada, Las Vegas Las Vegas, NV 89119 USA [email protected] Ting Zhou ASIC Design Division Gennum Corporation Kanada, Ontario Canada [email protected] Yiyan Tang and Yuke Wang Department of Computer Science University of Texas at Dallas Richardson, TX 75083 USA {yiyan, yuke}@utdallas.edu Abstract In microprocessor-based systems, memory access is expensive due to longer latency and higher power consumption. In this paper, we present a novel FFT algorithm to reduce the frequency of memory access as well as multiplication operations. For an N-point FFT, we design the FFT with two distinct sections: (1) The first section of the FFT structure computes the butterflies involving twiddle factors j N W ( 0 j ) through a computation/partitioning scheme similar to the Hoffman coding. In this section, all the butterflies sharing the same twiddle factor will be clustered and computed together. In this way, redundant memory access to load twiddle factors is avoided. (2) In the second section, the remaing ) 1 ( - N butterflies involving the twiddle factor 0 N W are computed with a register-based breadth-first tree traversal algorithm. This novel twiddle-factor-based FFT is tested on the TI TMS320C62x digital signal processor. The results show that, for a 32-point FFT, the new algorithm leads to as much as 20% reduction in clock cycles and an average of 30% reduction in memory access than that of the conventional DIF FFT. 1. Introduction In the field of digital signal processing, the Discrete Fourier Transform (DFT) plays an important role in the analysis, design, and implementation of discrete-time signal-processing algorithms and systems [1]-[10][12]- [20]. For instance, the DFT can be used to calculate a signal’s frequency response, find a system’s frequency response from the system’s impulse response, and serve as an intermediate step in more elaborate signal processing techniques. The Fast Fourier Transform (FFT) is an efficient class of computational algorithms of the DFT. FFT algorithms are based on the fundamental principle of decomposing the computation of the DFT of a sequence of length N into successively smaller DFTs, all with comparable improvements in computational speed. The study of FFT algorithms not only has a long history and large bibliography, and it is still an exciting research field where new results are used in practical applications. Efficient FFT algorithms were first discovered by Gauss [7], and later by Runge and Konig [13]. The importance of FFT algorithms was not fully recognized until its rediscovery by Cooley and Tukey [4] in 1960s. Since then, the research in FFT has been proliferated, to name a few, higher radix algorithms [2], mixed-radix [15], prime-factor [8], Winograd (WFTA) [20], the split-radix Fourier transform algorithms [16][17], recursive FFT algorithm [19], and the combination of decimation-in-time and the decimation-in-frequency FFT algorithms [14]. The structures of the FFT computation are all organized in the same way defined in [4]. There are many ways to measure the complexity and efficiency of the proposed FFT algorithms, and a final assessment depends on both the available technology and the intended applications. However, by careful analysis, we can see that there is a memory access problem with previously proposed approaches. For instance, unless the processor where the FFT runs provides a large number of registers, repeated access to the memory to load some twiddle factors are unavoidable under proposed FFT algorithms. It has been recognized that memory access is expensive due to long latency and high power consumption. In this paper, we propose an algorithm that can remove the redundant memory access in the calculation of DFT. For an N-point FFT, we consider two distinct cases: 0 N W and j N W ( 0 j ). The FFT structure is, therefore, organized as two concatenated sections. The first section computes those butterflies involving twiddle factors j N W ( 0 j ). In this section, once a twiddle factor j N W is loaded, it will be used until there is no need for its value in the following computation. In this way, we show that of an N-point Radix-2 FFT, only (N/2-1) memory accesses are needed as classical approaches may require (N-1) memory accesses to load twiddle factors for computation. The power saving can be quite significant when N is a Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS02) 1530-2075/02 $17.00 ' 2002 IEEE

Transcript of “Twiddle-factor-based FFT

Page 1: “Twiddle-factor-based FFT

Twiddle-Factor-Based FFT Algorithm with Reduced Memory Access

Yingtao JiangDepartment of Electrical &

Computer EngineeringUniversity of Nevada, Las Vegas

Las Vegas, NV 89119USA

[email protected]

Ting ZhouASIC Design DivisionGennum Corporation

Kanada, OntarioCanada

[email protected]

Yiyan Tang and Yuke WangDepartment of Computer Science

University of Texas at DallasRichardson, TX 75083

USA{ yiyan, yuke}@utdallas.edu

Abstract

In microprocessor-based systems, memory access isexpensive due to longer latency and higher powerconsumption. In this paper, we present a novel FFTalgorithm to reduce the frequency of memory access aswell as multiplication operations. For an N-point FFT,we design the FFT with two distinct sections: (1) The firstsection of the FFT structure computes the butterflies

involving twiddle factors jNW ( 0≠j ) through a

computation/partitioning scheme similar to the Hoffmancoding. In this section, all the butterflies sharing the sametwiddle factor will be clustered and computed together. Inthis way, redundant memory access to load twiddlefactors is avoided. (2) In the second section, the remaing

)1( −N butterflies involving the twiddle factor 0NW are

computed with a register-based breadth-first treetraversal algorithm. This novel twiddle-factor-based FFTis tested on the TI TMS320C62x digital signal processor.The results show that, for a 32-point FFT, the newalgorithm leads to as much as 20% reduction in clockcycles and an average of 30% reduction in memoryaccess than that of the conventional DIF FFT.

1. Introduction

In the field of digital signal processing, the DiscreteFourier Transform (DFT) plays an important role in theanalysis, design, and implementation of discrete-timesignal-processing algorithms and systems [1]-[10][12]-[20]. For instance, the DFT can be used to calculate asignal’s frequency response, find a system’s frequencyresponse from the system’s impulse response, and serve asan intermediate step in more elaborate signal processingtechniques. The Fast Fourier Transform (FFT) is anefficient class of computational algorithms of the DFT.FFT algorithms are based on the fundamental principle ofdecomposing the computation of the DFT of a sequenceof length N into successively smaller DFTs, all withcomparable improvements in computational speed.

The study of FFT algorithms not only has a longhistory and large bibliography, and it is still an excitingresearch field where new results are used in practicalapplications. Efficient FFT algorithms were firstdiscovered by Gauss [7], and later by Runge and Konig[13]. The importance of FFT algorithms was not fullyrecognized until it s rediscovery by Cooley and Tukey [4]in 1960s. Since then, the research in FFT has beenproliferated, to name a few, higher radix algorithms [2],mixed-radix [15], prime-factor [8], Winograd (WFTA)[20], the split-radix Fourier transform algorithms [16][17],recursive FFT algorithm [19], and the combination ofdecimation-in-time and the decimation-in-frequency FFTalgorithms [14]. The structures of the FFT computationare all organized in the same way defined in [4].

There are many ways to measure the complexity andefficiency of the proposed FFT algorithms, and a finalassessment depends on both the available technology andthe intended applications. However, by careful analysis,we can see that there is a memory access problem withpreviously proposed approaches. For instance, unless theprocessor where the FFT runs provides a large number ofregisters, repeated access to the memory to load sometwiddle factors are unavoidable under proposed FFTalgorithms. It has been recognized that memory access isexpensive due to long latency and high powerconsumption. In this paper, we propose an algorithm thatcan remove the redundant memory access in thecalculation of DFT.

For an N-point FFT, we consider two distinct cases:0NW and j

NW ( 0≠j ). The FFT structure is, therefore,

organized as two concatenated sections. The first section

computes those butterflies involving twiddle factors jNW

( 0≠j ). In this section, once a twiddle factor jNW is

loaded, it will be used until there is no need for its valuein the following computation. In this way, we show that ofan N-point Radix-2 FFT, only (N/2-1) memory accessesare needed as classical approaches may require (N-1)memory accesses to load twiddle factors for computation.The power saving can be quite significant when N is a

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS�02) 1530-2075/02 $17.00 © 2002 IEEE

Page 2: “Twiddle-factor-based FFT

very large number. In the second section to compute the

rest butterflies involving the twiddle factor 0NW , which

accounts for a total of (N-1) butterflies, the main concernis to construct a tree structure to minimize the frequencyof read/write operations to store the intermediate results.To this end, we propose a breadth-first traversal

algorithm. As 10 =NW , for these (N-1) butterflies, no

multiplication operation is needed in the computation.It is fair to say that this novel twiddle-factor-based

algorithm lead to efficient implementations and a widerange of applications, such as low power highperformance ASIC designs. We test the proposedalgorithm in TI TMS320C62x fixed-point digital signalprocessor (DSP). The experimental results show that thenew algorithm requires fewer clock cycles to compute theN-point FFT than conventional FFT approaches.Furthermore, we can expect that the power consumptionin the new approach shall be significantly less than theconventional FFT schemes due to the reduction of power-hungry memory access and multiplication operations.

The rest of this paper is organized as follows. Insection 2, the conventional Radix-2 FFT algorithm isbriefly reviewed. The new twiddle-factor-based FFTalgorithm is described in section 3. Some practical issuesare addressed in section 4. Experimental results arepresented in section 5. The conclusions are summarized insection 6.

2. Discrete Fourier Transform and FFT

The Discrete Fourier Transform (DFT) of discretesignal )(nx can be directly computed as

∑−

=−==

1

01,...,1,0)()(

N

n

nkN NkWnxkX (3)

where N

jN

eW NjN

πππ 2sin

2cos/2 −== − and NW is known as

the phase or twiddle factor and 12 −=j . Here )(nx and

)(ωX are sequences of complex numbers.

An efficient method of computing the DFT thatsignificantly reduces the number of required arithmeticoperations is called FFT [1]-[10][12]-[20]. An FFTalgorithm divides the DFT calculation into many short-length DFTs and results in huge savings of computations.If the length of DFT vRN = , i.e., is the product ofidentical factors, the corresponding FFT algorithms arecalled Radix-R algorithms. Assume FFT length is 2M,where M is the number of stages. The radix-2 DIF FFTdivides an N-point DFT into 2 N/2-point DFTs, then into4 N/4-point DFTs, and so on. That is, the radix-2 DIFFFT expresses the DFT equation as two summations, thendivides it into two equations, each of which computesevery two output samples. To arrive at a two-point DFT

decomposition, let nrN

nrN WW 2/2 = and following equation

are derived by

∑−

=−=++=

12/

02/ 1

2,...,1,0)]

2()([)2(

N

n

nkN

NkW

NnxnxkX

(4)

∑−

=−=+−=+

12/

02/ 1

2,...,1,0)]

2()([)12(

N

n

nkN

nN

NkWW

NnxnxkX

(5)Above equations are frequently represented in

butterfly format. The butterfly of a Radix-2 algorithm isshown in Fig. 1.a. The complete flow graph of an N-pointRadix-2 FFT can be constructed by applying the basicbutterfly structure (Fig.1.a) recursively, where

,...8,4,2=N For an N-point Radix-2 FFT, it has N2log

stages. Within stage s, for Ns 2log...,,2,1= , there aresN 2/ groups of butterflies, with 12 −s butterflies per

group. The computation of the 8-point DFT, for instance,can be accomplished by the algorithm depicted in Fig. 1.b.

3. The New Twiddle-Factor-Based FFTAlgorithm

It can be seen from Fig. 1.b that, unless sufficientlylarge number of registers is available, in most practical

situations, the twiddle factor 28W will be loaded from the

memory to the CPU twice in both Stages 1 and 2. Suchredundant memory access is repeatedly seen in loadingother twiddle factors and therefore, becomes a seriousproblem when computing a large FFT. In this section, wepresent the twiddle-factor-based FFT algorithm, whichcan reduce the number of memory access as well as thenumber of multiplication operations.

18W

08W

1−

1−

1−

1−

)0(x

)1(x

)2(x

)3(x

)0(X

)2(X

)1(X

)3(X

28W

1−

1−

1−

1−

08W

08W

08W

08W

08W

28W

28W

38W

1−

1−

1−

1−)4(x

)5(x

)6(x

)7(x

)4(X

)5(X

)6(X

)7(X

08W

-1

)(ny

)2

(N

ny +

)(nx

)2

(N

nx +

0NW

(a )

(b)

Fig. 1 Flow graph of FFT

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS�02) 1530-2075/02 $17.00 © 2002 IEEE

Page 3: “Twiddle-factor-based FFT

Theorem 1: The total number of butterflies involving the

twiddle factor 0NW is 1−N for an N-point Radix 2 FFT.

Proof: This proof is performed on an N-point DIF(Decimation-In-Frequency) FFT.

At the stage 1, there is only one butterfly requiring 0NW

At the stage 2, there are two butterflies requiring 0NW

At the stage k, there are 12 −k butterflies requiring 0NW

There are in total of N2log stages in the FFT structure.

Therefore, the total number of butterflies that require 0NW

is 12...21 1log2 −=+++ − NN . Theorem 2: The total number of butterflies involving the

same twiddle factor jNW ( 0≠j ) is 12 1 −+k for an N-

point Radix 2 FFT, where 12mod2

=k

jand

1log,...,2,1,0,02mod2

21−==− Nk

jk

.

Proof: (1) If 12/,...7,5,3,1 −= Nj , jNW will only

appear on the first stage.Similarly, from Eqs. (5) and (6), we can see that underthe situation

12mod2

=k

j and ,...2,1,0,02mod

2 1==− k

jk

,

the appearance of jNW ( 0≠j ) will span from the Stage 1

up to the Stage 1+k .

(2) For a twiddle factor jNW ( 0≠j ), it appears in the first

stage once; it appears on the second stage twice as thereare two butterflies requiring this twiddle factor. This

continues to the kth stage, where k2 groups of twiddlefactors exist to share the same twiddle factor. Therefore,

the total number of butterflies that require jNW ( 0≠j ) is

122...21 1 −=+++ +kk . The C-like pseudo code of the proposed twiddle-

factor-based algorithm is shown in Fig. 2. For an N-pointFFT, this algorithm consists of two concatenated sections

to deal with two distinct cases: 0NW and j

NW ( 0≠j ).

Section 1: In the first section of the FFT structure, those

butterflies with twiddle factors jNW ( 0≠j ) involved are

computed. In this section, the major concern is theminimization of the number of memory access to load the

twiddle factors. That is, once a twiddle factor jNW is

loaded, it will be repeatedly used until there is no need forits value in the following computation.

For an N-point FFT, the binary index of a datasample will l ook like this, )...( 01 BBB kk − , where

1log2 −= Nk . All butterflies involving the twiddle factorj

NW ( 0≠j ) will be computed at super-stage )1( +k ,

such that

12mod2

=k

jand

)1(log,...,2,1,0,02mod2

21−==− Nk

jk

.

The proposed twiddle-factor-based algorithm can beviewed as a skewed version of popular DIF FFT structure.We, therefore, use the term “super-stage” to reflect thefact that at each super-stage ss, the butterflies to becomputed span the stages ss,...,2,1 in the classical DIF

FFT. Section 1 consists of )1(log2 −N super-stages.

(1) At the first super-stage, all the data samples withbinary indices )1...( 121 BBBB kk − are computed.

Among these )2/(N data samples, any of the two with

indices )1...( 121 BBBB kk − and )1...( 121 BBBB kk − can

pair together to form a butterfly. The twiddle factor

involved in this butterfly is jNW , where j corresponds

to the decimal value of the binary )1...0( 121 BBBk − . In

total, )4/(N butterflies are to be computed in this

super-stage.

// n: n-point FFT// x: input data samples// x[2k + 1] ---- imaginary part of kth sample// x[2k] ---- real part of the kth sample// w: pre-computed twiddle factors// w[2k + 1] ---- imaginary part of kth twiddle factor// w[2k] ---- real part of kth twiddle factorvoid radix2_fft( int n, float* x, float* w){ int n2 = 0; int start = 1; int step = 2;

// Section 1: Compute butterflies with twiddle factors // W(N, j), j <> 0. for (proc = n; proc > 2; proc /= 2) { // super stage n2++; for (twiddle = start; twiddle < n/2; twiddle += step) { // load one twiddle factor and repeatedly use it co = w[twiddle*2 + 1]; // load twiddle factor: cos si = w[twiddle*2]; // load e twiddle factor: sine int n3 = n4 = n;

for (stage = 0; stage < n2; stage++) { // stage n4 /= 2; for (i0 = twiddle >> stage; i0 < n; i0 += n3) { //butterfly computation

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS�02) 1530-2075/02 $17.00 © 2002 IEEE

Page 4: “Twiddle-factor-based FFT

i1 = i0 + n4; re0 = x[2 * i0] + x[2 * i1]; im0 = x[2*i0 + 1] + x[2*i1 + 1]; re1 = x[2 * i0] - x[2 * i1]; im1 = x[2*i0 + 1] - x[2*i1 + 1]; x[2 * i0] = re0; x[2*i0 + 1] = im0; x[2 * i1] = re1*co - im1*si; x[2*i1 + 1]= re1*si + im1*co; } n3 = n4; } }start *= 2; step *= 2; } n2++;

// Section 2: Compute the butterflies with// twiddle factor W(N, 0) n3 = n4 = n;

for (stage = 0; stage < n2; stage++) { n4 /= 2; for (i0 = 0; i0 < n; i0 += n3) { i1 = i0 + n4; re0 = x[2 * i0] + x[2 * i1]; im0 = x[2*i0 + 1] + x[2*i1 + 1]; re1 = x[2 * i0] - x[2 * i1]; im1 = x[2 *i0 + 1] - x[2*i1 + 1]; x[2 * i0] = re0; x[2*i0 + 1] = im0; x[2 * i1] = re1; x[2*i1 + 1] = im1; } n3 = n4; }}/* Note that in Section 2, no multiplication operation is needed. Furthermore, the algorithm used in this section is not an optimized one in terms of memory access. If a small number of registers are allocated to save temporary values, we will show an algorithm (Fig. 4) that can help to partition the system so that the intermediate memory access for read/write can be significantly reduced.*/

(2) At the second super-stage, all the data samples withbinary indices )10...( 21 BBB kk − and )1...( 121 BBBB kk − are

computed. Apparently, there are )2/4/( NN + data

samples falling into this category. Any two data sampleswith binary indices )10...( 21 BBB kk − and

)10...( 21 BBB kk − , or )1...( 121 BBBB kk − and

)1...( 121 BBBB kk − , can pair together to form a butterfly.

The twiddle factor involved in this butterfly is jNW , where

j is the decimal value of binary index )10...0( 21 BBk − . In

total, )4/8/( NN + butterflies are to be computed in this

super-stage. That is, a quarter of the butterflies in the firststage and a half of the butterflies in the second stage of theoriginal DIF FFT are computed.(3) Within the ss-th super-stage, for

1log...,,2,1 2 −= Nss , all the data samples with binary

indices )0...10...( 11 +− sskk BBB , )0...10...( 1 sskk BBB − ,

)0...10...( 11 −− sssskk BBBB , …, )10...( 21 BBB kk − , and

)1...( 121 BBBB kk − are computed. In this case, the

butterflies are originated from the Stage 1 all the way upto Stage ss in the corresponding DIF FFT. There are

12/ +ssN different twiddle factors involved in this super-stage.

Under this algorithm, according to Theorem 1, we cansee that only )12/( −N irredundant memory accesses are

needed. The classical approach, the DIF FFT shown inFig. 1, however, may require as many as )1( −N memory

accesses to load twiddle factors for computation unless thesize of the register file is comparable to the size of theinput samples. Very large size of register files, however,are barely seen in current microprocessor designs.

As an illustrative example, Table 1 lists thecomputation order of the 16-point FFT with indexing andpairing information presented in binary format.Altogether, there are 3 super-stages and 17 butterflies.The overall computation structure of this 16-point FFTbased on the proposed algorithm can be seen in Fig. 3.

Table 1. The computation order of a 16-point Twiddle-factor-based FFT: Indexing and Pairing

Super-Stage 1:Original FFT

Super-Stage 2:Original FFT

Super-Stage 3:Original FFT

)1( 123 BBB and

)1( 123 BBB

)10( 23BB and

)10( 23BB

)1( 123 BBB and

)1( 133 BBB

)100( 3B and

)100( 3B

)10( 23BB and

)10( 23 BB

)1( 123 BBB and

)1( 123 BBB

Fig. 2 Pseudo code of the proposed twiddle-factor-basedFFT algorithm

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS�02) 1530-2075/02 $17.00 © 2002 IEEE

Page 5: “Twiddle-factor-based FFT

From the above discussion, we can see that we cancompute the FFT structure in the way similar to theHoffman coding. This resolves the data dependence andthe verification of this algorithm can be viewed as“decoding” of the Hoffman codes.

It can be seen that four loops are required in thissection of computation, while traditional approaches mayrequire three. This loop overhead, however, can be easilyabsorbed in current processors with multiple data paths,such as TI TMSC62x DSP [18].Section 2: In the second section, the rest butterflies

involving the twiddle factor 0NW are computed. Note that

no multipli cation is needed in computing these )1( −N

butterflies (Theorem 1) as 10 =NW . All these butterflies

are organized as a binary tree and there are N2log stages.

The memory access of this section can be significantlyreduced if a few user-visible data registers are available.Depending on the size of the given registers (M) to saveintermediate results, we can traversal the binary tree withan algorithm shown in Fig. 4, where the visit to a noderefers to a 2-point butterfly computation.

// N: N-point FFT// x: input samples// x[2k] -- real part of the kth data sample// k= 0, 1, 2,..., (n-1).// x[2k + 1] -- imaginary part of the kth data sample// r: r pair of registers// j: r = 2^jvoid section2(int N, float* x, float* w, int n1){

// The number of the stages to be calculated // in the prolog int n5 = n1 % 2; // n1: N = 2^n1 int n2 = N;

// The Prolog of the tree structure for (proc = 0; proc < n5; proc++) { n3 = n2; n2 >>= 1; for (bu = 0; bu < n; bu += n3) { // Calculate the butterfly // butterfly_cal(x[0], x[1], x[2], x[3]): Calculate // the butterfly with two specified points, x[0], x[2] // are the real parts of the points; x[1],x[3] are // the imaginary parts of the points. butterfly_cal(x[2bu], x[2bu+1], x[2(bu+n2)], x[2(bu+n2)+1]); } }// The Kernel of the tree structure// If there are r pairs of registers, denoted as// Reg_real [0: (r-1)] and Reg_im[0: (r-1)],// to be used, (r-1) butterflies can be computed .// Reg_real and Reg_im are for real and imaginary parts,// respectively. Immediate results can be saved in given// registers, rather than writing them back to the memory.for (proc = N >> n5; proc > 1; proc >>= j) { int n4 = proc >> j; int index = 0;

for (group = 0; group < 2^(n1 – n3) ; group++){ // Fetch the points from memory to registers for (i = 0; i < r; i++) { Reg_real[i] = x[2index]; Reg_im[i] = x[2index + 1]; index = index + n4; } m=r; // Calculation the butterflies: (r-1) in total for (i = 1; i <= j; i++) { // r = 2 j -- levels p = m; m = m / 2; for (q = 0; q<r; q=q+p) { bufferly_cal(Reg_real[q], Reg_im[q], Reg_real[q+m], Reg_im[q+m]); } } // Store the points back to the memory for (i = 0; i < r; i++) { x[2index] = Reg_real[i]; x[2index + 1] = Reg_im[i]; index = index + n4; }}

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(8)

x(9)

x(10)

x(11)

x(12)

x(13)

x(14)

x(15)

W [ 1 ]

W [ 3 ]

W [ 5 ]

W [ 7 ]

W [ 2 ]

W [ 2 ]

W [ 6 ]

W [ 6 ]

W [ 6 ]

W [ 2 ]

W [ 4 ]

W [ 4 ]

W [ 4 ]

W [ 4 ]

W [ 4 ]

W [ 4 ]

W [ 4 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

W [ 0 ]

Sect ion 1 Sect ion 2

SuperStage

1. stage2. Twiddle factor

3. butterf ly

bold l ines: mult iply with -1

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(8)

x(9)

x(10)

x(11)

x(12)

x(13)

x(14)

x(15)

Fig. 3 Structure of a 16-point FFT based on the proposedalgorithm.

Fig. 4 Tree traversal algorithm for section 2 computation in the algorithm shown in Fig. 2

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS�02) 1530-2075/02 $17.00 © 2002 IEEE

Page 6: “Twiddle-factor-based FFT

Here we assume that the M is an exponential of 2

(i.e., kM 2= ), and there are two cases to be considered:(1) 0mod)(log2 =kN , and (2) 0mod)(log2 ≠kN . In

the first case, this partition algorithm (Fig. 4) transformsthe original binary tree into a complete tree with eachparent node has M immediate children. In the second case,except the top ]mod)[(log2 kN levels, all the rest of nodes

in the binary tree are transformed to construct a reducedtree with each parent node has M immediate children. Thereduced tree is traversed in a breadth-first manner.Although in this section, a depth-first traversal algorithmcan be designed due to weak data dependence amongbutterflies, we feel a breadth-first approach is more viablefor easy parallel computation and less memory access.

The second section of the 16-point FFT can be seen inFig. 3.b and redrawn in Fig. 5.a, where each nodeindicates a butterfly. If four pairs of registers, M = 4, areallocated to store the immediate data, the binary tree thenis transformed into a quadric-tree, in concert with the Case1 mentioned above. This new quadric-tree consists of 5nodes, as opposed to 15 nodes in the original binary tree.Each node of the quadric-tree is made from threebutterflies, and the intermediate results within these threebutterflies will be saved in the dedicated register file, notthe main memory. For a 32-point FFT, the merged treewill consist of 3 levels and 11 nodes, as demonstrated inFig. 5.b.

4. Practical Considerations

In this section, we present a variety of details that shallbe considered in the implementation of the proposedalgorithm in various platforms.

4.1. Memory Interleaving and AddressingPatterns

There are two main ways in which memory systemsare usually designed to match a high-performanceprocessor. The first is to reduce the memory effectiveaccess time by reducing the number of accesses thatactually reach the memory. Caches, or a set of registers, orsome type of buffering are designed for this purpose. Thesecond approach, memory interleaving, is to replace asingle memory unit by several memory units (banks),organized in such a way that independent and concurrentaccess to several units is possible. The technique worksbest with predictable memory-access patterns [11].

For the interleaving to work, it is necessary that everyM temporally adjacent words accessed by the processor bein different banks. In our proposed FFT algorithm, it isdesirable to design the memory architecture of

)2/(N words (twiddle factors) interleaved in M banks.

We shall assign a memory address j to memory bank]mod)2/[( Mj to achieve maximum memory bandwidth

for concurrent access with least amount of conflicts. Forinstance, at the super-stage 1, twiddle factors of

12

...,,5,3,1, −= NjW j

N , are to be loaded for computation

(Figs. 2 and 3). It is preferable that these data can bestored in different memory banks. Fig. 6 illustrates thestorage pattern of the twiddle factors used in a 16-pointFFT with 4 memory banks. Apparently, these twiddlefactors can be fetched in parallel at each super-stage.

To access a memory location, the memory systeminterprets each processor-generated address as consistingof a pair <bank_address, displacement_in_bank>. That is,in our case, for a memory of )2/(N words interleaved in

M banks, the high-order )log2

(log 22 MN − bits along

with the least significant bit are used to select a wordwithin the selected bank and the bank is selected with therest of M2log bits.

4.2. In-Order Computation

The idea of our proposed twiddle-factor-based FFTalgorithm, actually, can be borrowed to modify manyexisting FFT algorithms to squeeze out the redundant

Group 1

Group 2 Group 5

(a)

Group 1

Group 2 Group 3

Group 4 Group 11

(b)

Fig. 5 An example of the merging of the binary tree

0:

1:

W o r dAddress

Bank 0 Bank 1 Bank 2 Bank 3

0

16W1

16W

2

16W 4

16W3

16W 5

16W

6

16W7

16W

Fig. 6 Assignment of addresses in memory interleaving

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS�02) 1530-2075/02 $17.00 © 2002 IEEE

Page 7: “Twiddle-factor-based FFT

memory access. For instance, the FFT structure shown inFig. 1 can be easily modified to account for in-ordercomputation requirement at a cost of more memory usageduring the computation. The data dependence can still beviewed and checked using the timing diagram as shown inTable 1.

4.3. Scaling

In practice, the arithmetic operations involved inFFTs are sometimes carried out using fixed-point or blockfloating point arithmetic. Although fixed-point arithmeticleads to a fast and inexpensive implementation, it islimited in the range of numbers that can be represented,and is susceptible to problems of overflow that may occurwhen the result of an addition exceeds the permissiblenumber range. To deal with this problem, scaling has to beperformed to prevent the occurrence of overflow.Although in the new algorithm, we have introduced theconcept of super-stage, the scaling in our algorithm has tobe performed at the end of each butterfly computation, notat the super-stage level (Figs. 2 and 3). This simplescaling scheme can be easily embedded into the main flowof the algorithm shown in Figs. 2 and 4.

5. The Experimental Results

In this section, we conduct the experiment to evaluatethe performance of our twiddle-factor-based FFTalgorithm along with a classical DIF FFT algorithm. Thetest platform is TI TMS320C6211 fixed-point digitalsignal processor [18] with an enhanced VLIW (Very LongInstruction Word) architecture.

Notice that our algorithm is dependent on the size ofregisters, denoted as M, available for temporary storage(Fig. 4). We have chosen three sizes: M=4, M=8, andM=16 in the test. Altogether, we have designed four FFTprograms in C: three based on our approach and onebased on DIF FFT. The sizes of FFTs under test rangefrom 8 points, 16 points, 32 points, all the way up to 1024points. Since the trend of the results from different FFTsizes is similar, for the sake of the space, here we onlyreport the results collected from 32-point FFTs (Table 2).Four different compilation options are considered. It canbe seen that, the memory access of our approach is around30 per cent less than that of the conventional DIF FFT. Inall cases with different options of the compilationoptimization, we have observed that the clock cycle isreduced by as much as 20 per cent. Furthermore, we haveseen more dramatic performance improvement andreduction of memory access in larger FFTs.

Table 2. Results of 32-pt FFTs: Twiddle-factor-basedapproach vs. DIF

FFT TotalCycleCount

CycleReduction

(%)

Load/Store

Load/StoreReduction

(%)No optimization by compiler

Traditional DIF FFT 88825 0 480/320 0Novel FFT (4 pair reg.) 88247 1 312/280 26Novel FFT (8 pair reg.) 86025 3 294/272 30

Novel FFT (16-pair reg.) 84012 5 286/264 31Optimization option –o0 register

Traditional DIF FFT 29860 0 480/320 0Novel FFT (4 pair reg.) 24257 19 312/280 26Novel FFT (8 pair reg.) 24058 20 294/272 30

Novel FFT (16-pair reg.) 28112 6 286/264 31Optimization option –o1 local

Traditional DIF FFT 21745 0 480/320 0Novel FFT (4 pair reg.) 18068 17 312/280 26Novel FFT (8 pair reg.) 18788 14 294/272 30

Novel FFT (16-pair reg.) 29620 -36 286/264 31Optimization option –o2 function

Traditional DIF FFT 26081 0 480/320 0Novel FFT (4 pair reg.) 24172 7 312/280 26Novel FFT (8 pair reg.) 23477 10 294/272 30

Novel FFT (16-pair reg.) 26669 -2 286/264 31Optimization option –o3 file

Traditional DIF FFT 26081 0 480/320 0Novel FFT (4 pair reg.) 24172 7 312/280 26Novel FFT (8 pair reg.) 23477 10 294/272 30

Novel FFT (16-pair reg.) 26669 -2 286/264 31

6. Conclusions

In this paper, we have presented a novel twiddle-factor-based FFT algorithm with redundant memoryaccess removed. The first section of the new FFT structure

computes those butterflies with twiddle factors jNW

( 0≠j ). In this section, once a twiddle factor jNW is

loaded, it will be repeatedly used until there is no need forits value in the following computation. In this way, weshow that in an N-point Radix-2 FFT, only )12/( −N

memory accesses are needed whereas classical approachmay require as many as )1( −N memory accesses to load

twiddle factors for computation. This new FFT structurecan be viewed as a structure similar to the idea ofHoffman coding. However, we also show that 4 nestedloops are needed as compared to 3 in the classical method.This effect has littl e impact in the computing speed withcareful design or with the help of an efficient compiler. Inthe section to compute the rest butterflies involving the

twiddle factor 0NW , which accounts of a total of )1( −N

butterflies, the main concern is to construct a tree structureto minimize the number to access and store data in theintermediate arrays of the FFT. It has shown in this paper,depending on the given size of the temporary registers andthe input samples, different optimized computationstructures are necessary. This novel structure of thealgorithm should lead to efficient implementations and awide range of application, as demonstrated in animplementation based on the TI TMX 320C62x DSP. Theresults show that the new algorithm requires significantlyless clock cycles to compute the n-point FFT. Thanks to

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS�02) 1530-2075/02 $17.00 © 2002 IEEE

Page 8: “Twiddle-factor-based FFT

the substantial reduction of memory access, considerablepower saving can be expected.

According to the transposition theorem [12], we canobtain a transposed structure, where Sections 1 and 2 aswell as the directions of all branches in the network (Figs.2-4) are reversed. In a more general term, the idea of ourproposed algorithm can be borrowed to modify manyexisting FFT algorithms to squeeze out the redundantmemory access and arithmetic operations.

7. References

[1] D. H. Bailey, “FFTs in External or Hierarchical Memory,”NASA Tech. Report RNR-89-004, 1989.

[2] G. D. Bergland, “A Raidx-Eight Fast-Fourier TransformSubroutine for Real-Valued Series,” IEEE Trans. AudioElectroacoust. vol. 17, no. 2, pp. 138-144, June 1969.

[3] C. S. Burrus and T. W. Parks, DFT/FFT and ConvolutionAlgorithms and Implementation, NY: John Wiley & Sons,1985.

[4] J. W. Cooley and J. W. Tukey, "An algorithm for themachine calculation of complex Fourier series," Math.Computat., vol. 19, pp. 297-301, 1965.

[5] P. Duhamel, and H. Hollmann, “Split Radix FFTAlgorithm,” Electronics Letters, vol. 20, pp.14-16, Jan. 5,1984.

[6] P. Duhamel, “Implementation of ‘Split-Radix’ FFTAlgorithms for Complex, Real, and Real-Symmetric Data,”IEEE Acoustics, Speech, and Signal Processing Magazine,vol. 34, pp. 285-295, Apr. 1986.

[7] M. Frigo and S. G. Johnson, "The fastest Fourier transformin the west," Tech. Rep. MIT-LCS-TR-728, Laboratory forComputer Science, MIT, Cambridge, MA, Sept. 1997.

[8] M. T. Heideman, D. H. Johnson, and C. S. Burrus, "Gaussand the History of the FFT," IEEE Acoustics, Speech, andSignal Processing Magazine, vol. 1, pp. 14-21, Oct. 1984.

[9] D. Pl. Kolba and T. W. Parks, “A Prime Factor FFTAlgorithm Using High-Speed Convolution,” IEEE Trans.Acoust., Speech, Signal Processing, vol. 25, no. 4, pp. 281-294, Aug. 1977.

[10] K.-S. Lin, ed., Digital Signal Processing Applications withthe TMS 320 Family, vol. 1. Englewood Cli ffs, N.J.:Prentice Hall , 1987.

[11] A. R. Omondi, The Microachitecture of Pipelined andSuperscalar Computers. Boston: Kluwer AcademicPublishers, 1999.

[12] A. V. Oppenheim and C. M. Rader, 2nd ed., Discrete-TimeSignal Processing. Upper Saddle River, NJ: Prentice-Hall,1989.

[13] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T.Vetterling. Numerical Recipes: The Art of ScienticComputing. Cambridge University Press, 1986.

[14] A. Saidi, “Decimation-in-Time-Frequency FFTAlgorithm,” Proc. IEEE International Conference onAcoustics, Speech, and Signal Processing, pp. III :453-456,April 19-22 1994.

[15] R. C. Singleton, “An Algorithm for Computing the MixedRadix Fast Fourier Transform,” IEEE Trans. AudioElectroacoust. vol. 1, no. 2, pp. 93-103, June 1969.

[16] H. V. Sorensen and C. S. Burrus, “A New EfficientAlgorithm for Computing a Few DFT Points,” IEEE Trans.Acoust., Speech, Signal Processing, vol. 35, no. 6, pp. 849-863, June 1987.

[17] D. Takahashi, “An Extended Split-Radix FFT Algorithm,”IEEE Signal Processing Letters, vol. 8, no. 5, pp. 145-147,May 2001.

[18] Texas Instrument, TMS320C62x DSP LibraryProgrammer' s References, SPRU402.

[19] A. R. Varkonyi-Koczy, “A Recursive Fast FourierTransform Algorithm,” IEEE Trans. Circuits and Systems,II, vol. 42, pp. 614-616, Sep. 1995.

[20] S. Winograd, “On Computing the Discrete FourierTransform,” Math. Comput., vol. 32, no. 141, pp. 175-199,Jan. 1978.

Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS�02) 1530-2075/02 $17.00 © 2002 IEEE