A Pipelined FFT Architecture for Real-Valued Signals

2634 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 12, DECEMBER 2009

A Pipelined FFT Architecture for Real-Valued SignalsMario Garrido, Keshab. K. Parhi, Fellow, IEEE, and J. Grajal

Abstract—This paper presents a new pipelined hardware archi-tecture for the computation of the real-valued fast Fourier trans-form (RFFT). The proposed architecture takes advantage of the re-duced number of operations of the RFFT with respect to the com-plex fast Fourier transform (CFFT), and requires less area whileachieving higher throughput and lower latency.

The architecture is based on a novel algorithm for the computa-tion of the RFFT, which, contrary to previous approaches, presentsa regular geometry suitable for the implementation of hardwarestructures. Moreover, the algorithm can be used for both the deci-mation in time (DIT) and decimation in frequency (DIF) decompo-sitions of the RFFT and requires the lowest number of operationsreported for radix 2.

Finally, as in previous works, when calculating the RFFT theoutput samples are obtained in a scrambled order. The problemof reordering these samples is solved in this paper and a pipelinedcircuit that performs this reordering is proposed.

Index Terms—Decimation-in-frequency, decimation-in-time,fast Fourier transform (FFT), memory reduction, pipelined archi-tecture, real-valued signals, reordering circuit.

I. INTRODUCTION

T HE FAST Fourier transform (FFT) is one of the mostimportant algorithms in the field of digital signal pro-

cessing, used to efficiently compute the discrete Fourier trans-form (DFT). For the computation of the FFT, pipelined hard-ware architectures [1]–[9] are widely used because they offerhigh throughput and low latency, as well as a reasonably lowarea and power consumption. This makes them attractive for alarge variety of applications, specially when they present real-time requirements. Thus, in order to provide solutions to presentand future applications, hardware designers keep improving thesignal processing capabilities of pipelined architectures of theFFT.

The FFT internally operates over complex numbers and pre-vious works offer efficient designs for the computation of theFFT of complex input samples (CFFT). However, they are notoptimized for the computation of the FFT of real input samples(RFFT). Indeed, when the input samples are real the spectrumof the FFT is symmetric [10] and approximately half of the op-erations are redundant [11].

The RFFT plays an important role in different real-time ap-plications. In medical applications such as electrocardiography

Manuscript received June 06, 2008; revised November 14, 2008. First pub-lished March 10, 2009; current version published December 31, 2009. Thispaper was recommended by Associate Editor V. Öwall.

M. Garrido and J. Grajal are with the Department of Signal, Systems and Ra-diocommunications, Unversidad Politécnica de Madrid, 28040 Madrid, Spain(e-mail: [email protected]; [email protected]).

K. K. Parhi is with the Department of Electrical and Computer Engineering,University of Minnesota, Minneapolis, MN 55455 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TCSI.2009.2017125

(ECG) [12] or electroencephalography (EEG) [13], the powerspectral density (PSD) of various real-valued signals has to beestimated. This requires calculation of the RFFT repetitively onmany overlapping windows of the signals, where a specializedhardware implementation can make use of a higher-speedclock to meet real-time constraints. Moreover, in implantableor portable devices [12], [13], a dedicated hardware can savepower consumption. On the other hand, the RFFT is a keyelement in technologies based on discrete multitone (DMT)modulation, such as asymetric digital subscriber line (ADSL)or very high bit-rate digital subscriber line (VDSL) [14], [15].Nowadays, high signal processing capabilities are required forsecond-generation standards (ADSL2/2 [16] and VDSL2[17]). Moreover, DMT has been used at rates of 24 Gb/s inlocal area networks (LAN) [18]. Finally, very high performanceis also necessary in digital wideband receivers [19], [20].Currently, the RFFT has to be computed in real time over adataflow of 2 GSamples/s [20].

Thus, in order to meet the increasing demand on real-timecapabilities of new applications, much research has been car-ried out on pipelined architectures of the CFFT. On the otherhand, for those applications with real input signals, a dedicatedpipelined RFFT architecture can lead to savings in area andpower consumption, while offering high signal processingcapabilities. However, to the best of our knowledge no hard-ware-efficient pipelined architecture for the computation ofthe RFFT based on the Cooley–Tukey algorithm [21] has beenproposed yet. There exist only some pipelined architectures[22], [23] based on the Bruun algorithm [24]. However, theBruun algorithm has not been widely adopted since it wasdemonstrated [25] that the noise is significantly higher thanthat in the Cooley–Tukey algorithm [21].

The lack of specific pipelined architectures for the RFFT isdue to the fact that the specific algorithms proposed for thecomputation of the RFFT [11], [26]–[30], do not lead to reg-ular geometries, which are necessary for designing pipelined ar-chitectures. These specific approaches describe programs basedon removing the redundancies of the CFFT when the input isreal, and can be used to efficiently compute the RFFT in a dig-ital signal processor (DSP) or in in-place architectures [31]. Amemory-based or in-place architecture consists of a memoryand a processing unit. The data are loaded, processed and storedagain in the memory until all the operations of the algorithmsare performed. This kind of architecture allows the design ofcircuits with low area and power consumption, but it is not suit-able for real-time applications.

There exist other techniques described in [32], [33], whichtake advantage of the CFFT to calculate the RFFT. On onehand, the doubling algorithm uses the CFFT to compute twoRFFTs simultaneously. On the other hand, the packing algo-rithm forms a complex sequence of length taking the even

1549-8328/$26.00 © 2009 IEEE

Authorized licensed use limited to: K.S. Institute of Technology. Downloaded on January 30, 2010 at 23:09 from IEEE Xplore. Restrictions apply.

GARRIDO et al.: A PIPELINED FFT ARCHITECTURE FOR REAL-VALUED SIGNALS 2635

and odd indexed samples of the real input sequence of length ,and calculates the -point CFFT of the complex sequence. Inthese CFFT-based techniques, additional operations are neces-sary to obtain the final results from the outputs of the CFFT. Thepacking algorithm has been used to implement some in-place ar-chitectures [15], [34]. In these architectures the hardware of theCFFT can be reused for computing the post-processing stage.However, no pipelined architecture has been proposed, not onlybecause some of the adders and multipliers saved in the CFFTneed to be used for post-processing, but mainly because the sam-ples need to be reordered before the post-processing stage [15],increasing the memory and complicating the control.

Both in the specific algorithms for the computation of theRFFT and in the CFFT-based ones, the output samples are pro-vided in different orders [27], which are different from the bit-reversal one of the CFFT [35]. The sorting of the outputs of theRFFT is also a problem not solved in the literature so far.

In this work we propose the first pipelined architecture forthe computation of the RFFT based on the Cooley–Tukey al-gorithm. It combines the advantages of the pipelined architec-tures with the reduction of operations achieved by the specificalgorithms. This is possible due to the proposed algorithm forthe computation of the RFFT which solves the irregularities ofthe RFFT. Moreover, this approach is valid for both decimationin time (DIT) and decimation in frequency (DIF) decomposi-tions and is generalizable for any number of points , whichis power of 2. Furthermore, the problem of the output order ofthe samples is solved and a pipelined circuit that performs thereordering is proposed.

In the next section we briefly review the RFFT and theexisting techniques to compute it. In Section III we developthe algorithm that allows the design of regular hardware ar-chitectures for the computation of the RFFT, and the novelproposed pipelined architecture is presented Section IV. Next,in Section V the architecture is evaluated and compared toprevious approaches, and some conclusions are drawn inSection VI. Finally, the reordering of the output samples andthe proposed solution are discussed in Appendix A.

II. THE RFFT

The -point DFT of a sequence is defined as

(1)

where .The RFFT considers the input sequence, , to be a real

sequence, i.e., . It is easy to demonstrate [10] thatif is real, then the output is complex conjugate of

. Consequently, an RFFT can be considered a conventionalFFT with the additional conditions

(2)

(3)

These two conditions that distinguish the CFFT and the RFFTdo not lead, however, to a direct simplification of (1) for theRFFT, and the specific algorithms for the computation of theRFFT require complicated developments. Thus, other simpler

techniques that use the CFFT to compute the RFFT are some-times preferred. Next, these approaches are reviewed.

A. Computation of the RFFT Using the CFFT

1) Direct Use of the CFFT: The first idea when it is neces-sary to compute an FFT over a real input signal is to use theCFFT. As the real numbers are a subset of the complex ones,the trivial solution is to set the imaginary part of the input tozero. Although this procedure does not make an efficient use ofthe resources, it is very simple and it is not necessary to modifythe CFFT. Indeed, it is the solution adopted for many if not allreal-time applications.

2) Doubling Algorithm: Another alternative is to take ad-vantage of the CFFT to simultaneously compute the FFT of tworeal signals and as it is ex-plained in [32], [33]. This process requires to form the signal

and use an N-point CFFT to obtain. It is important to notice that both

and are complex, so they cannot be obtained di-rectly from . Therefore, additions are required toseparate the outputs, in addition to the operations of the CFFT.

3) Packing Algorithm: Given a real signalit is also possible to compute the RFFT using

an -point CFFT [32], [33]. This technique is sometimescalled packing algorithm because it takes the odd and evenindexed samples of the signal and form the complex signal

and. Then the -point CFFT is applied to obtain

. As in the doubling algorithm,additions are required to separate the outputs of

the CFFT. Moreover, in the packing algorithm it is necessaryto include an additional stage to compute the outputs of theRFFT, which requires extra additions andmultiplications.

B. Specific Algorithms for the Computation of the RFFT

The greater reduction in the number of operations is obtainedby using the specific algorithms for the computation of theRFFT. Most of them are obtained from the CFFT by applyingthe properties of the RFFT in order to remove the redundantoperations. The first proposed algorithms were defined for thedecimation in time (DIT) decomposition of the FFT. The DITFFT has the property that the samples at each intermediatestage can be computed using the conventional FFT [10]. Con-sequently, (3) can be applied at each stage, and only one halfof the intermediate outputs must be calculated, whereas the restcan be obtained by conjugating those intermediate values.

This is the basic idea of algorithms proposed for split-radix[11], [26], radix-2 [27], [30] and high radices [28]. These algo-rithms are, however, not valid for the decimation in frequency(DIF) decomposition of the FFT because it is not possible toapply the property (3) at each stage. On the other hand, it hasbeen also demonstrated that it is possible to obtain the same sav-ings for the DIF decomposition using an alternative algorithm[29] that makes use of linear-phase sequences.

In general, the number of multiplications in all of these algo-rithms is reduced to half of that required for the CFFT, and thenumber of additions is less than half the additions of the



CFFT. Likewise, only half of the memory is needed. Thus, thereare only slight differences among all of them in the number ofoperations and in the order in which the computations are per-formed.

III. PROPOSED ALGORITHM FOR THE COMPUTATION OF THE

RFFT

A. Basis of the Algorithm

Fig. 1 shows the flow graph of an -point FFT for the caseof , radix , and decomposed according to thedecimation in frequency (DIF) [10]. The graph is divided into

stages and each of them consists of a set of butter-flies and rotators. The numbers at the input and the output of thegraph represent respectively the index of the input and outputsamples, whereas each number, , in between the stages indi-cates a rotation by

(4)

If we consider that the inputs are complex, all the internalnodes and outputs of the graph are needed for the computationof the CFFT, and the regularity of the flow graph leads to ef-ficient pipelined architectures [2]. On the other hand, if the in-puts are real, it is possible to simplify the graph according to theproperties of the RFFT, as explained next.

1) The first simplification is to consider that in the real FFT. According to this, outputs

of the FFT are redundant and can be removed. Most ap-proaches [11], [27], [31] obtain either the frequencies withindexes orand, if necessary, calculate the rest of the frequencies byconjugating these results. However, considering that

for the indexes andusing the property (3)

(5)

leads to a more efficient architecture.Consequently, the set of frequencies can be ob-tained by conjugating frequencies , as mentionedin [26] for the split-radix RFFT. According to Fig. 1, sam-ples are the last quarter of the outputs; so all thebutterflies used exclusively to compute these samples maybe removed, which are represented by the lower darkenedregion.The same concept can be applied to samples ,which can be computed by conjugating samples .Generalizing this idea, samples can becomputed by conjugating the samples ,where , for all

. Thus, all the darkened regions inFig. 1 can be removed and only outputs of theFFT need to be computed.

2) The second simplification refers to the fact that. According to this, every piece of data is real until it is ro-

tated for the first time. In Fig. 1, all the additions performedbefore the data reach the boxed components are real and,

Fig. 1. Flow graph of a 16-point DIF FFT. The darkened regions and the boxedcomponents are considered in the simplifications of the new algorithm for theRFFT.

thus, the number of additions in that area is halved with re-spect to the CFFT. Once the data reach the first rotations,the data are necessarily complex until the end of the FFT.

3) One further simplification of the boxed components maybe carried out. Let’s assume that the first stage of the FFTis and the last one is , and arethe outputs of stage . According to this and Fig. 1, in thesecond stage it is not necessary to compute the data

, and samples arecalculated as

(6)

This calculation requires 2 rotations and 1 addition. How-ever, expanding the expression and taking into account thatboth and are real samples, wecan obtain

(7)

The resulting expression only requires one rotation and noadditions because and are real androtations by 1, and are trivial, taking into account thatthey can be calculated by interchanging the real and imaginaryparts and/or changing the sign of the data. This simplificationcan be extended to all the boxed components, which reduceseven more the number of rotators and adders of the FFT.



Fig. 2. Simplified flow graph of a 16-point FFT for real input samples, beforea regular structure is obtained.

Fig. 2 represents the flow graph of the RFFT once the ex-plained simplifications have been carried out. All the data pre-vious to the boxed components are real, and only real butterfliesare necessary for this region. On the other hand, according to(7) the boxed components do not require any operation but thetrivial rotation , since the additions that appear in the graphdo not have to be performed because one of the inputs is real andthe other one is purely imaginary. Finally, the complex rotationsappear after the boxed components and the butterflies need to becomplex. Consequently, it is easy to see from Fig. 2 that the totalnumber of real additions is or, consideringthat

(8)

Likewise, the number of nontrivial complex multiplicationscan be obtained from the flow graph as

(9)

With regard to the output samples, all the frequencies tocan be computed from the obtained data by conjugating

those frequencies with index greater than .The explained simplifications can also be applied to the DIT

FFT. In this case it is first necessary to redraw the typical flowgraph of the DIT FFT so that the inputs are in natural order andthe outputs in bit-reversal [36]. The result only differs from thatof Fig. 2 in the rotations performed at the different stages. Inthis case, the number of nontrivial complex multiplications canbe calculated as

(10)

Fig. 3. Proposed flow graph of a 16-point DIF RFFT. All edges are real andthe boxed numbers represent rotations of the RFFT according to (4). The signof the data is changed in edges where �� is depicted.

Fig. 4. Proposed flow graph of a 16-point DIT RFFT. All edges are real andthe boxed numbers represent rotations of the RFFT according to (4). The signof the data is changed in edges where �� is depicted.

Finally, as in the other specific designs of RFFT, the ob-tained structures are irregular because the symmetries of theCFFT have been used to reduce the number of operations. Al-though the flow graph in Fig. 2 could be used to implement anin-place architecture, it is not suitable for the implementation ofa pipelined architecture.

B. Obtaining Regularity

The last step of the algorithm consists of transforming theflow graph in Fig. 2 to obtain a flow graph with regular geom-etry. The obtained structure is depicted in Fig. 3 for the DIFFFT, whereas the result for the DIT decomposition is shown inFig. 4. Contrary to Figs. 1 and 2, all the edges in Figs. 3 and4 are real, i.e., the data have been separated into their real and



TABLE INUMBER OF OPERATIONS OF THE ALGORITHMS FOR THE COMPUTATION OF THE RFFT

imaginary components. Consequently, in the flow graph the ro-tations have two inputs and two outputs. They are representedby the boxes with a number inside, which indicate the rotationangles according to (4). The upper inputs and outputs are usedfor the real part of the samples and the lower ones for the imag-inary part. On the other hand, those trivial rotations by inthe boxed components of Fig. 2 only operate over a real sample,which is equivalent to multiply it by and input it in the imag-inary part of the rotator, as it is done in Figs. 3 and 4. Finally,the output samples of the flow graph include a letter to indicateif the value corresponds to the real part or the imaginarypart of the output. Moreover, the complex butterflies afterthe boxed components in Fig. 2 have been unfolded in the newstructure.

Contrary to the flow graph in Fig. 2, in the structures of Figs. 3and 4 every stage has inputs and outputs and all of them arereal samples. Moreover, at every stage butterflies operate oversamples whose indexes differ the same quantity, and nontrivialrotations are placed in particular positions. These regularitiesallow the development of efficient pipelined hardware architec-tures, as explained in Section IV. On the other hand, it can be no-ticed that, as in the other specific algorithms of the RFFT and theCFFT-based ones, the outputs are not available in bit-reversedorder. However, it is possible to sort them out as explained inAppendix A.

C. Comparison to Other Algorithms for the Computation ofthe RFFT

Table I compares the number of operations of the differentapproaches for the computation of the RFFT and, consequently,their efficiency when they are implemented in DSPs. Thenumber of real multiplications has been calculated accordingto the criterion used in previous approaches [11]: Rotations by45 , 135 , 225 and 315 require 2 real multiplications andthe rest of nontrivial rotations are calculated with three realmultiplications. On the other hand, the additions in Table Iinclude those of the butterflies and those required for thepost-processing in the CFFT-based algorithms.

As it is shown, the number of operations in all the approacheshas the same order of magnitude. Among them, the specific al-gorithms based on split-radix require less operations than thosebased on radix-2. The same is true for CFFT-based algorithms.

On the other hand, for the same radix, the specific algorithms arebetter than the packing algorithm both in the number of multi-plications and additions. Likewise, they are also better than thedoubling algorithm. This is not only because for the doublingalgorithm two signals need to be computed in parallel, but alsobecause it needs twice the number of multiplications and morethan twice the number of additions.

Comparing the specific algorithms for the computation of theRFFT, the proposed approach is the only one that can be usedfor both the DIT and DIF decompositions of the FFT. Moreover,the DIF version of the algorithm has the lowest number of oper-ations reported for radix-2. The reason why the DIF version hasless operations than the DIT one is due to the fact that the sim-plification of (7) can only be applied to the DIF decomposition.

However, the most relevant advantage of the proposed algo-rithm over the previous approaches is that it offers a regularstructure suitable for the implementation of hardware architec-tures. It may be noted that a regular structure for RFFT withfewest number of operations has not been presented so far.

IV. PIPELINED ARCHITECTURE OF THE RFFT

Fig. 5 shows the proposed pipelined architecture for a16-point DIF RFFT, obtained from the graph of Fig. 3. It is aradix-2 feedforward pipelined structure that maximizes the useof the multipliers, as well as achieves a throughput of 4 samplesper clock cycle. As in the flow graph, all the edges carry realsamples. Therefore, the radix-2 butterflies process tworeal inputs and, thus, they only consist of a real adder and a realsubtractor. Likewise, the rotators have two inputs and twooutputs, as in the flow graph. The upper input receives the realpart of the sample to be rotated and the lower one the imaginarypart. Finally, the rest of circuits are shuffling structures usedfor reordering the samples according to the dataflow. They arecomposed of buffers and multiplexors. In a general case ofan -point RFFT, with power of two, the circuit requires

real butterflies, rotators, andbuffers or memories of a total size .

Considering both Figs. 3 and 5, the first stage computes onlyreal butterflies. According to the input order of the data, theupper butterfly of the structure computes the pairs of samples(0, 8), (1, 9), (2, 10) and (3, 11), whereas the lower butterflyoperates samples (4, 12), (5, 13), (6, 14) and (7, 15).



Fig. 5. Proposed pipelined architecture for the computation of the 16-point DIF RFFT. The architecture is composed of radix-2 real butterflies ��, rotators��, and shuffling structures, which include buffers and multiplexors.

On the other hand, both in the architecture and in the flowgraph, the upper outputs of the butterflies of the first stage areconnected to a butterfly of the second stage, whereas the loweroutputs are rotated in the second one. Previous to the rotator, itis necessary to compute a trivial rotation of . These rotationsin the flow graph can be embedded either in the rotator or in thebutterfly of the same stage.

At the second stage, the butterflies operate over each pair ofsamples and , and the rotations are calculatedover and , for . Conse-quently, the indexes of two samples operated together alwaysdiffer by and the architecture only requires a butterfly inthis stage. On the other hand, at the third stage, the indexes oftwo samples operated together differ by . Those data in theflow graph for which a butterfly does not have to be calculated,pass through the corresponding butterfly in the architecture.

The shuffling structure depicted in Fig. 6, which correspondsto the second stage of the architecture of Fig. 5, shows how theorder of the data required at the second stage of the 16-pointRFFT is transformed to the order of the third stage. The structureconsists of buffers and multiplexors. In the Figure, the length ofthe buffers is , and the numbers of the inputs and the out-puts represent the index of each piece of data. Thus, the struc-ture receives in parallel samples from the second stage, whoseindexes differ by , and provides in parallel samplesadapted to the third stage, where the indexes differ by .

The shuffling is performed in two steps. The first one inter-changes the intermediate inputs and the second step interleavesthe data. Considering the upper circuit of the second step, in-dexed samples (0, 1, 2, 3) and (8, 9, 10, 11) are received re-spectively at each of the inputs. Initially the control of the mul-tiplexors is set to “0” and, thus, samples with indexes (0, 1) arestored in the output buffer and (8, 9) in the input one. Next, thecontrol signal switches to “1” and indexed samples (0, 1) areprovided at the output in parallel with (2, 3), whereas (8, 9) passto the output buffer and (10, 11) are stored in the input one.These samples will be provided in parallel at the output whenthe multiplexor switches again to “0”.

In a general case of an -point RFFT, the shuffling structureat a stage requires buffers of length ,and, as can be observed in Fig. 5, for the first stage of the RFFT,

, only the first step of the shuffling structure is required.

Fig. 6. Shuffling structure of the pipelined RFFT. This examples shows theshuffling of the second stage of a 16-point RFFT.

On the other hand, the third stage of the architecture in Fig. 5includes a switch before the rotator. This switch is necessary inevery stage . As it is shown in Fig. 3 the rotationsrequired by the boxed components of Fig. 1 operate over sam-ples whose indexes differ the same quantity as the butterflies atthe same stage. For these rotations, the switch does not swap theinputs. However, for the rest of rotations the switch is activatedand the samples that come from the lower output of the upperbutterfly are routed through the rotator.

Finally, Fig. 7 represents the hardware architecture for theDIT RFFT. It only differs from the DIF structure in Fig. 5 onthe placement of the rotator, due to the fact that the positions ofthe rotations in the flow graphs of Figs. 3 and 4 are different.

V. COMPARISON AND ANALYSIS

Table II compares the proposed structure to other efficientpipelined architectures for the case of computing an -pointRFFT. The proposed design is the only specific approach forthe computation of the RFFT and, thus, it takes advantage ofthe reduced number of operations required by the RFFT withrespect to the CFFT. The other approaches are not specific forthe RFFT and can be used to calculate the CFFT.

The table shows the trade off between area and performance.The area is measured in terms of the number of complex rota-tors, adders and memory addresses, whereas the performance



Fig. 7. Proposed pipelined architecture for the computation of the 16-point DIT RFFT. The architecture is composed of radix-2 real butterflies ��, rotators��, and shuffling structures, which include buffers and multiplexors.

TABLE IICOMPARISON OF PIPELINED HARDWARE ARCHITECTURES FOR THE COMPUTATION OF AN N-POINT RFFT

is represented by the throughput and the latency. In all cases,the throughput is that for which each architecture has been de-signed, and the number of components and the latency are mea-sured from the butterflies of the first stage to those of the laststage, i.e., circuits for reordering the input and output sampleshave not been considered. This criterion shows the lowest areaand the highest performance that each architecture can obtain.As it can be observed in the table, feedback (FB) architecturesand some feedforward (FF) ones [5], [7] require less hardwarecomponents and achieve a throughput of one sample per clockcycle. On the other hand, the parallelization of feedforward ar-chitectures has the advantage of a higher throughput and a lowerlatency due to an increase in area.

The proposed architecture is in the group of the feedforwardones. It uses radix-2 but can process 4 samples in parallel,achieving a higher performance than feedback designs, both interms of latency and throughput. On the other hand, comparedto other radix-2 feedforward architectures, the proposed designdoubles the throughput and halves the latency and the datamemory, while keeping the same number of rotators and adders.Therefore, although the new approach uses radix-2, it achievesthe same throughput as other radix-4 feedforward architectures.Compared to these high-throughput architectures, the proposedone obtains a significant reduction in the number of rotators,adders and memory addresses.

The reduction in memory is an important advantage of theproposed architecture. It only needs a total of real memory

addresses for the samples or, equivalently, complex ones.This involves a great reduction in area, taking into accountthat, except in case of computing a low number of samples, thememory takes up most of the area of the circuit. Moreover, thememory used to store the coefficients for the rotations has beenreduced to . Thus, the low area makes the architecturevery suitable for the implementation of long-length transforms.

Finally, it is interesting to analyze the case that input samplesarrive in natural order. Under these circumstances feedforwardarchitectures require input memories for reordering the sam-ples, whereas feedback structures do not need any additionalhardware. For the proposed design it can be used a memory of

real samples or, equivalently, complex samples. As thethroughput of the architecture is 4 samples per clock cycle, thememory will be filled in cycles, leading to a total latency ofthe circuit of cycles. Consequently, using the same memoryas a feedback architecture, the proposed design can process 4times more samples in half the time.

As a conclusion, the high performance and the low area of thenew architecture make it very attractive for the computation ofthe RFFT in real-time applications.

VI. CONCLUSION

The previous approaches on the RFFT had demonstrated thatit requires half the operations of the CFFT. This paper showsthat this reduction of operations is not only theoretical but alsocan be applied to the design of efficient hardware architectures



Fig. 8. Problem of the reordering of the output samples of the RFFT.

for the computation of the RFFT. This is possible due to thenovel algorithm proposed in this paper for the computation ofthe RFFT, which obtains a regular geometry similar to that of theCFFT, and is valid for both the DIF and DIT decompositions ofthe RFFT. Based on this algorithm, a new pipelined architecturefor the computation of the RFFT is presented. It processes 4samples in parallel and requires significantly less memory thanother efficient pipelined structures. The low area and latency,and the high throughput of the circuit make it attractive for thecomputation of the FFT in any application that processes realsamples. Moreover, the circuit is very efficient for long-lengthtransforms due to the reduction in memory, and it is also verysuitable for real-time applications because of its high throughputcapabilities.

APPENDIX AREORDERING OF THE OUTPUT SAMPLES

The scrambled order of the output samples is an inherentproblem of the FFT. In the CFFT the outputs are obtained inthe so called bit-reversal order [35]. In in-place architectures itis possible to reuse the memory to sort out the output samples,which increases the latency and reduces the throughput of thesystem.

On the other hand, in pipelined architectures an extra memoryof addresses is necessary to perform the bit-reversal, whichincreases the area and the latency. Samples are stored in thememory in natural order using a counter for the addresses andthen they are read in bit-reversal by reversing the bits of thecounter. Moreover, it is possible to use this strategy for calcu-lating the bit-reversal of a series of FFTs. In this case, the firstFFT is stored in natural order. Next, the count is in bit-reversaland, thus, the frequencies are provided in the correct order, whilethe outputs of the second FFT are stored in the memory in bit-re-versal order. Since the bit-reversal is an inversion operation, i.e.,

, the outputs of the second FFT must be readin natural order to sort out the frequencies. At the same time, thethird FFT is stored in natural order, which completes the cycle.

Nevertheless, for reordering the outputs of the RFFT a morecomplicated algorithm is required. Next the algorithm is ex-plained to be performed in-place and then it is shown that it is

Fig. 9. Solution to the reordering of the output samples of the RFFT.

possible to perform it in pipeline and the corresponding circuitis presented.

Fig. 8 shows how to sort out the outputs of the RFFT to obtainthem in bit-reversal order for the case of a 32-point RFFT, whereonly frequencies 0 to 16 are necessary.

The first column (index) represents the order of arrival of theoutput samples. The second column (output frequencies) indi-cates the indexes of the output frequencies of the RFFT. Fre-quencies and are both real and they aregrouped together at the output of the RFFT. The goal is to ob-tain the frequencies in the order of the first column provided theorder of the second column.

First of all, the data at frequencies greater than mustbe conjugated to obtain all the frequencies in the interval

, which appear in the third column (output order).Next, the output order given in the second column must be

converted into bit-reversal order. To achieve this goal, it is firstnecessary to realize that the output order and the samples in bit-reversal have certain similarities: the order of the first four sam-ples (indexes ) is the same, the following four

can be obtained by shuffling the samples in the sameposition, and the last half of the samples canalso be obtained by shuffling. It can be generalized for highernumber of points: the bit-reversal order of all samples in posi-tions can be ob-tained by shuffling the samples in those positions according tothe output order.

Given any of those sets of samples, the sorting can be per-formed in two stages. First, it is necessary to notice that the firstand the second half of the samples in the sets are interleaved.According to this, the intermediate order shown in the fourthcolumn of Fig. 8 is obtained by deinterleaving these groups ofdata. Next, the bit-reversal order is obtained by reversing theorder of the last half of the samples of each set.

Fig. 9 shows the deinterleaving and reversing proceduresfor the case under study. Assuming that arethe bits of the index , the deinterleaving moves a samplein position to po-sition , where

. This is performed by successively inter-changing the bits from to . Note that



Fig. 10. Basic circuit for the reordering of the output data.

Fig. 11. Structure for the reordering of the output data of a 32-point RFFT.

in this figure for and for , soeach group of samples is treated differently.

On the other hand, the reversing moves the data in posi-tions to position

. The reversingcan be performed by negating the position bit by bit and inter-changing the corresponding data.

This procedure can be easily performed in place, but it is alsopossible to implement a pipelined hardware structure for the re-ordering of the samples. It is necessary to realize that every stageof the deinterleaving and the reversing interchanges samples inpositions and , where is a constant. The simplecircuit in Fig. 10 performs this exchange. If the multiplexer isset to “1” the samples will be provided at the output in the sameorder as in the input, whereas setting it to “0” the input samplein position is forwarded to position , while the data in po-sition is fed back to the buffer and will appear at the outputin position .

Joining the stages of the deinterleaving and the reversing,the circuit in Fig. 11 performs the reorder of the outputs to ob-tain them in bit-reversed order. As it can be seen, it only usessix complex registers for a 32-point RFFT. In a general case, a

-point RFFT requires buffers or a memory ofcomplex data to obtain the outputs in bit-reversal, and a memoryof complex data to obtain them in natural order.

REFERENCES

[1] L. Yang, K. Zhang, H. Liu, J. Huang, and S. Huang, “An efficientlocally pipelined FFT processor,” IEEE Trans. Circuits Syst. II, Exp.Briefs, vol. 53, no. 7, pp. 585–589, Jul. 2006.

[2] S. He and M. Torkelson, “Design and implementation of a 1024-pointpipeline FFT processor,” in Proc. Custom Integr. Circuits Conf., SantaClara, CA, May 1998, pp. 131–134.

[3] M. A. Sánchez, M. Garrido, M. L. López, and J. Grajal, “Implementingthe FFT algorithm on FPGA platforms: A comparative study of parallelarchitectures,” presented at the XIX Int. Conf. Design Circuits Integr.Syst. (DCSI 2004), Bourdeaux, France, Nov. 2004.

[4] W.-C. Yeh and C.-W. Jen, “High-speed and low-power split-radixFFT,” IEEE Trans. Signal Process., vol. 51, no. 3, pp. 864–874, Mar.2003.

[5] Y.-N. Chang, “An efficient VLSI architecture for normal I/O orderpipeline FFT design,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol.55, no. 12, pp. 1234–1238, Dec. 2008.

[6] C. Cheng and K. K. Parhi, “High-throughput VLSI architecture for FFTcomputation,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 10,pp. 863–867, Oct. 2007.

[7] G. Bi and E. V. Jones, “A pipelined FFT processor for world-sequentialdata,” IEEE Tran. Acoust., Speech Signal Process., vol. 37, no. 12, pp.1982–1985, Dec. 1989.

[8] E. E. Swartzlander, W. K. W. Young, and S. J. Joseph, “A radix 4delay commutator for fast Fourier transform processor implementa-tion,” IEEE J. Solid-State Circuits, vol. 19, no. 5, pp. 702–709, Oct.1984.

[9] Y.-W. Lin and C.-Y. Lee, “Design of an FFT/IFFT processor forMIMO OFDM systems,” IEEE Trans. Circuits Syst. I, Reg. Papers,vol. 54, no. 4, pp. 807–815, Apr. 2007.

[10] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Pro-cessing. Englewood Cliffs, NJ: Prentice Hall, 1989.

[11] H. V. Sorensen, D. L. Jones, M. T. Heideman, and C. S. Burrus, “Real-valued fast Fourier transform algorithms,” IEEE Trans. Acoust., SpeechSignal Process., vol. 35, no. 6, pp. 849–863, Jun. 1987.

[12] M.-H. Cheng, L.-C. Chen, Y.-C. Hung, and C. M. Yang, “A real-timemaximum-likelihood heart-rate estimator for wearable textile sensors,”in Proc. IEEE 30th Ann. Int. Conf. EMBS, Aug. 2008, pp. 254–257.

[13] R. F. Yazicioglu, P. Merken, R. Puers, and C. Van Hoof, “Low-powerlow-noise 8-channel EEG front-end ASIC for ambulatory acquisitionsystems,” in Proc. European Solid-State Circuits Conf., 2006, pp.247–250.

[14] O. W. Ibraheem and N. N. Khamiss, “Design and simulation of asym-metric digital subscriber line (ADSL) modem,” Proc. ICTTA, pp. 1–6,Apr. 2008.

[15] H.-F. Chi and Z.-H. Lai, “A cost-effective memory-based real-valuedFFT and Hermitian symmetric IFFT processor for DMT-based wire-line transmission systems,” in Proc. IEEE Int. Symp. Circuits Syst.,May 2005, vol. 6, pp. 6006–6009.

[16] Asymmetric Digital Subscriber Line (ADSL) Transceivers ExtendedBandwidth ADSL2 (ADSL2plus), ITU-T Rec. G.992.5, 2005.

[17] Very High Speed Digital Subscriber Line Transceivers 2 (VDSL2),ITU-T Rec. G.993.2, 2006.

[18] S. C. J. Lee, F. Breyer, S. Randel, M. Schuster, J. Zeng, F. Huijskens,H. P. A. van den Boom, A. M. J. Koonen, and N. Hanik, “24-Gb/stransmission over 730 m of multimode fiber by direct modulation of an850-nm VCSEL using discrete multi-tone modulation,,” in Proc. Opt.Fiber Commun. Conf, 2007, Paper PDP 6.

[19] G. Lopez-Risueno, J. Grajal, and A. Sanz-Osorio, “Digital channel-ized receiver based on time-frequency analysis for signal interception,”IEEE Trans. Aerosp. Electron. Syst., vol. 41, no. 3, pp. 879–898, Jul.2005.

[20] Y.-H. G. Lee and C.-I. H. Chen, “Dual thresholding for digital wide-band receivers with variable truncation scheme,” in Proc. IEEE ISCAS,May 2009, pp. 920–923.

[21] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calcula-tion of complex Fourier series,” Math. Comput., vol. 19, pp. 297–301,1965.

[22] R. Storn, “A novel radix-2 pipeline architecture for the computation ofthe DFT,” in Proc. IEEE ISCAS, Jun. 1988, vol. 2, pp. 1899–1902.

[23] Y. Wu, “New FFT structures based on the Bruun algorithm,” IEEETrans. Acoust., Speech Signal Process., vol. 38, no. 1, pp. 188–191,Jan. 1990.

[24] G. Bruun, “z-Transform DFT filters and FFT’s,” IEEE Trans. Acoust.,Speech Signal Process., vol. 26, no. 1, pp. 56–63, Feb. 1978.

[25] R. Storn, “Some results in fixed point error analysis of the Bruun-FTTalgorithm,” IEEE Trans. Signal Process., vol. 41, no. 7, pp. 2371–2375,Jul. 1993.

[26] P. Duhamel, “Implementation of split-radix FFT algorithms forcomplex, real, and real-symmetric data,” IEEE Trans. Acoust., SpeechSignal Process., vol. 34, no. 2, pp. 285–295, Apr. 1986.

[27] G. D. Bergland, “A fast Fourier transform algorithm for real-valuedseries,” Commun. ACM, vol. 11, no. 10, pp. 703–710, Oct. 1968.

[28] G. D. Bergland, “A radix-eight fast Fourier transform subroutine forreal-valued series,” IEEE Trans. Audio Electroacoust., vol. AU-17, no.2, pp. 138–144, Jun. 1969.

[29] B. R. Sekhar and K. M. M. Prabhu, “Radix-2 decimation-in-frequencyalgorithm for the computation of the real-valued FFT,” IEEE Trans.Signal Process., vol. 47, no. 4, pp. 1181–1184, Apr. 1999.

[30] D. Sundararajan, M. O. Ahmad, and M. N. S. Swamy, “Some resultsin fixed point error analysis of the Bruun-FTT algorithm,” IEEE Trans.Signal Process., vol. 41, no. 7, pp. 2371–2375, Jul. 1997.

[31] J. A. Hidalgo-López, J. C. Tejero, J. Fernández, E. Herruzo, and A.Gago, “New architecture for RFFT calculation,” Electron. Lett., vol.30, no. 22, pp. 1836–1838, Oct. 1994.

[32] E. O. Brigham, The Fast Fourier Transform and Its Applications.Englewood Cliffs, NJ: Prentice-Hall, 1988.



[33] W. W. Smith and J. M. Smith, Handbook of Real-Time Fast FourierTransforms. : Wiley-IEEE Press, 1995.

[34] A. Wang and A. P. Chandrakasan, “Energy-aware architectures for areal-valued FFT implementation,” in Proc. Int. Symp. Low Power Elec-tron. Des., Aug. 2003, pp. 360–365.

[35] L. R. Rabiner and B. Gold, Theory and Application of Digital SignalProcessing. Englewood Cliffs, NJ: Prentice-Hall, 1975.

[36] M. G. Gálvez, “Desarrollo de algoritmos para detección de señales entiempo real mediante FPGAs,” M.S. thesis, ETSI de Telecomunicación,UPM, Madrid, Spain, Dec. 2004.

Mario Garrido received the M.S. degree in electricalengineering and the Ph.D. degree from the TechnicalUniversity of Madrid (UPM), Madrid, Spain, in 2004and 2009, respectively.

His research interests include the design and opti-mization of VLSI architectures for signal processingapplications.

Keshab K. Parhi (S’85–M’88–SM’91–F’96)received the B.Tech. degree from the Indian Instituteof Technology, Kharagpur, the M.S.E.E. degree fromthe University of Pennsylvania, Philadelphia, andthe Ph.D. degree from the University of California atBerkeley, in 1982, 1984, and 1988, respectively.

He has been with the University of Minnesota,Minneapolis, since 1988, where he is currentlyDistinguished McKnight University Professor in theDepartment of Electrical and Computer Engineering.His research addresses VLSI architecture design

and implementation of physical layer aspects of broadband communications

systems, error control coders and cryptography architectures, high-speed trans-ceivers, and ultra wideband systems. He is also currently working on intelligentclassification of biomedical signals and images, for applications such as seizureprediction, lung sound analysis, and diabetic retinopathy screening. He haspublished over 450 papers, has authored the text book VLSI Digital SignalProcessing Systems (Wiley, 1999) and coedited the reference book DigitalSignal Processing for Multimedia Systems (Marcel Dekker, 1999).

Dr. Parhi is the recipient of numerous awards including the 2004 F.E. Termanaward by the American Society of Engineering Education, the 2003 IEEE KiyoTomiyasu Technical Field Award, the 2001 IEEE W.R.G. Baker prize paperaward, and a Golden Jubilee award from the IEEE Circuits and Systems Societyin 1999. He has served on the editorial boards of the IEEE TRANSACTIONS ON

CIRCUITS AND SYSTEMS—I: REGULAR PAPERS,TRANSACTIONS ON CIRCUITS

AND SYSTEMS—II: EXPRESS BRIEFS, TRANSACTIONS ON VERY LARGE SCALE

INTEGRATED (VLSI) SYSTEMS, Signal Processing, Signal Processing Letters,and Signal Processing Magazine, and served as the Editor-in-Chief of the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS (2004–2005term), and currently serves on the Editorial Board of the Journal of VLSI SignalProcessing. He has served as technical program cochair of the 1995 IEEEVLSI Signal Processing workshop and the 1996 ASAP conference, and as thegeneral chair of the 2002 IEEE Workshop on Signal Processing Systems. Hewas a distinguished lecturer for the IEEE Circuits and Systems society during1996–1998. He was an elected member of the Board of Governors of the IEEECircuits and Systems society from 2005 to 2007.

J. Grajal was born in Toral de los Guzmanes (León),Spain, in 1967. He received the M.S. and Ph.D.degrees in electrical engineering degrees from theTechnical University of Madrid (UPM), Madrid,Spain, in 1992 and 1998, respectively.

Since 2001 he has been an Associate Professorat the Department of Signals, Systems, and Ra-diocommunications of the Technical School ofTelecommunication Engineering, UPM. His re-search activities are in the area of hardware-designfor radar systems, radar signal processing and

broadband digital receivers for radar and spectrum surveillance applications.


A Pipelined FFT Architecture for Real-Valued Signals

Documents

Transcript of A Pipelined FFT Architecture for Real-Valued Signals