Approach to the hardware implementation of digital signal...

9
Approach to the hardware implementation of digital signal processors using Mersenne number transforms W.C. Siu, AP(HK), M.Phil., C.Eng., M.I.E.R.E., Mem.I.E.E.E., and A.G. Constantinides, B.Sc.(Eng.), Ph.D., C.Eng., M.I.E.E. Indexing terms: Signed processing, Mersenne number transforms Abstract: In this paper Mersenne number transforms are converted into cyclic convolutions, in which form they are amendable to simple hardware interpretation. Such realisation structures are proposed that can make the computation of Mersenne number transforms very fast indeed. This new approach can be extended to the implementation of other number theoretic transforms, in particular to Fermat number transforms, and is also applicable to the fast implementation of discrete Fourier transforms. 1 Introduction The number theoretic transform (NTT) is defined on finite fields or rings of real numbers [1-4]. As such, the trans- form involves no complex-number multiplications and has properties similar to those of the discrete Fourier trans- form. Thus it can be used as a method for fast and error- free calculation of cyclic convolutions applicable for example to the implementation of digital signal processors. Of the various versions of the NTT proposed, the Fermat number transform (FNT) which has been thor- oughly investigated by Agarwal and Burrus [3] and Mer- senne number transform (MNT) which was introduced by Rader [2] are the most promising cases. Since the trans- form length N of the Fermat number transform is a power of two, a fast implementation for it similar to the fast fourier transform (FFT) [5-7] with radix two is possible. For the computation of FNT, arithmetic operations must be carried out in modulo (2 b + 1). Usually a, the primitive root of unity of order N, is taken as two or a power of two. However multiplication of a" by a number in a register does not only involve left shift of the contents in the regis- ter by n bits, but one must also subtract n overflow bits. In order to simplify the computation, Agarwal and Burrus [3] made their implementation of FNT using a 6-bit rep- resentation of integers which can introduce some errors to the computation. McClellan [8] defined a new code rep- resentation for the integers modulo a Fermat number F, using normal binary weighting with digits ± 1. McClellen's approach was simplified by Leibowitz [9] who proposed the diminished-one representation to represent the number system used in FNT. Both of these methods effectively convert the arithmetic to be similar to, and of the same complexity as, ordinary l's complement arithmetic. Although the code conversion between binary represent- ation of data to the new code representation is relatively simple, steps or extra hardware are still necessary to carry out the conversion. Since the transform lengths of the Mersenne number transforms are not highly composite, there is no FFT-type algorithm for the computation, and for this reason little Paper 29I5E. received I Ith May 1983 The authors are with the Department of Electrical Engineering, Imperial College of Science & Technology, London SW7 2BT, England. Mr. Siu is on leave from the department of Electronic Engineering, Hong Kong Polytechnic, Hung Horn, Hong Kong 10 attention has been given to them. Mersenne number trans- forms have the advantage of very efficient arithmetic (which could be in one's-complement form). For example, the required multiplications by powers of two are per- formed by simple bit rotations. One possible method, introduced by Reed and Truong [10-11], of retaining the arithmetic modulo of a Mersenne number and to utilising the FFT-type algorithm, is to define the number theoretic transform in a Galois field. Unfortunately, not all the roots of unity of these complex MNTs are simple, and some general multiplications [10-12] are required for the com- putation of the transform. Nussbaumer [13] suggests an alternative solution by introducing the pseudo Mersenne transform. Two-step modulo arithmetic (on two moduli), in addition to larger word length, has to be used to carry out the computation. Furthermore the resultant transform is not always highly composite. Recently, Thomas, Larsom and Keller [14] have defined a generalised number theo- retic transform which allows for independent transform length and modulus base. However, this method sacrifices the very important circular-convolution property of the transform. Previously, the main concern has been to look for highly composite transform lengths, and most of the imple- mentations have relied heavily on FFT-type algorithms [8-17]. These FFT-type algorithms, no doubt, significantly reduce the number of operations in the computation of the number theoretic transform. However, due to the recent advances in the fabrication technology of electronic devices, the arithmetic units are fast enough for most implementations. The control section is probably the largest and the most expensive section. If FFT-type algo- rithms are used for the implementation of the number theoretic transform, the control of the process flow is unavoidably more complicated. As shown on the timing diagram of Reference 8, at least half of the total time spent on the FNT butterfly is used for 'read' and 'write' oper- ations on the data. In this paper, we convert the Mersenne number trans- form into the cyclic-convolution form and propose simple- hardware structures to carry out their fast implementation. This new approach can be extended to other number theo- retic transforms, for instance to Fermat number trans- forms, and is also very suitable for the fast computation of discrete Fourier transforms. IEE PROCEEDINGS, Vol. 131, Pt. £, No. /, JANUARY 1984 Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Transcript of Approach to the hardware implementation of digital signal...

Page 1: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

Approach to the hardwareimplementation of digital signal

processors using Mersenne numbertransforms

W.C. Siu, AP(HK), M.Phil., C.Eng., M.I.E.R.E., Mem.I.E.E.E., and A.G.Constantinides, B.Sc.(Eng.), Ph.D., C.Eng., M.I.E.E.

Indexing terms: Signed processing, Mersenne number transforms

Abstract: In this paper Mersenne number transforms are converted into cyclic convolutions, in which formthey are amendable to simple hardware interpretation. Such realisation structures are proposed that can makethe computation of Mersenne number transforms very fast indeed. This new approach can be extended to theimplementation of other number theoretic transforms, in particular to Fermat number transforms, and is alsoapplicable to the fast implementation of discrete Fourier transforms.

1 Introduction

The number theoretic transform (NTT) is defined on finitefields or rings of real numbers [1-4]. As such, the trans-form involves no complex-number multiplications and hasproperties similar to those of the discrete Fourier trans-form. Thus it can be used as a method for fast and error-free calculation of cyclic convolutions applicable forexample to the implementation of digital signal processors.

Of the various versions of the NTT proposed, theFermat number transform (FNT) which has been thor-oughly investigated by Agarwal and Burrus [3] and Mer-senne number transform (MNT) which was introduced byRader [2] are the most promising cases. Since the trans-form length N of the Fermat number transform is a powerof two, a fast implementation for it similar to the fastfourier transform (FFT) [5-7] with radix two is possible.For the computation of FNT, arithmetic operations mustbe carried out in modulo (2b + 1). Usually a, the primitiveroot of unity of order N, is taken as two or a power of two.However multiplication of a" by a number in a registerdoes not only involve left shift of the contents in the regis-ter by n bits, but one must also subtract n overflow bits. Inorder to simplify the computation, Agarwal and Burrus[3] made their implementation of FNT using a 6-bit rep-resentation of integers which can introduce some errors tothe computation. McClellan [8] defined a new code rep-resentation for the integers modulo a Fermat number F,using normal binary weighting with digits ± 1. McClellen'sapproach was simplified by Leibowitz [9] who proposedthe diminished-one representation to represent the numbersystem used in FNT. Both of these methods effectivelyconvert the arithmetic to be similar to, and of the samecomplexity as, ordinary l's complement arithmetic.Although the code conversion between binary represent-ation of data to the new code representation is relativelysimple, steps or extra hardware are still necessary to carryout the conversion.

Since the transform lengths of the Mersenne numbertransforms are not highly composite, there is no FFT-typealgorithm for the computation, and for this reason little

Paper 29I5E. received I Ith May 1983The authors are with the Department of Electrical Engineering, Imperial College ofScience & Technology, London SW7 2BT, England. Mr. Siu is on leave from thedepartment of Electronic Engineering, Hong Kong Polytechnic, Hung Horn, HongKong

10

attention has been given to them. Mersenne number trans-forms have the advantage of very efficient arithmetic(which could be in one's-complement form). For example,the required multiplications by powers of two are per-formed by simple bit rotations. One possible method,introduced by Reed and Truong [10-11], of retaining thearithmetic modulo of a Mersenne number and to utilisingthe FFT-type algorithm, is to define the number theoretictransform in a Galois field. Unfortunately, not all the rootsof unity of these complex MNTs are simple, and somegeneral multiplications [10-12] are required for the com-putation of the transform. Nussbaumer [13] suggests analternative solution by introducing the pseudo Mersennetransform. Two-step modulo arithmetic (on two moduli),in addition to larger word length, has to be used to carryout the computation. Furthermore the resultant transformis not always highly composite. Recently, Thomas, Larsomand Keller [14] have defined a generalised number theo-retic transform which allows for independent transformlength and modulus base. However, this method sacrificesthe very important circular-convolution property of thetransform.

Previously, the main concern has been to look forhighly composite transform lengths, and most of the imple-mentations have relied heavily on FFT-type algorithms[8-17]. These FFT-type algorithms, no doubt, significantlyreduce the number of operations in the computation of thenumber theoretic transform. However, due to the recentadvances in the fabrication technology of electronicdevices, the arithmetic units are fast enough for mostimplementations. The control section is probably thelargest and the most expensive section. If FFT-type algo-rithms are used for the implementation of the numbertheoretic transform, the control of the process flow isunavoidably more complicated. As shown on the timingdiagram of Reference 8, at least half of the total time spenton the FNT butterfly is used for 'read' and 'write' oper-ations on the data.

In this paper, we convert the Mersenne number trans-form into the cyclic-convolution form and propose simple-hardware structures to carry out their fast implementation.This new approach can be extended to other number theo-retic transforms, for instance to Fermat number trans-forms, and is also very suitable for the fast computation ofdiscrete Fourier transforms.

IEE PROCEEDINGS, Vol. 131, Pt. £, No. /, JANUARY 1984

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Page 2: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

2 Mersenne number transform and cyclicconvolution

The Mersenne number theoretic transform and its inverseare defined as

X(k) = (1)

(2)

whereM = 2P — 1, a Mersenne numbera = root of unity of order P and is chosen to be 2 for

our discussionP = prime number

The expression <C>M is the residue of the number Cmodulo M. Eqns. 1 and 2 can be defined for both prime orcomposite Mersenne numbers.

Since the transform length P is a prime number, it ispossible to find a primitive root g to generate all nonzeroelements inside the field modulo P. Using the mappings ofk^> (yk}p and n—> <# "}P, eqn. 1 can be reordered as

(3)

(4)

(5)

(6)

and

x(^kyR) = x(0) + [11=1

for k = 1,2 P — 1. We can write eqn. 4 as

X«gk>P) = x(0) + X'«cjk>P)

where

for k = 1, 2, . . . , P — 1. Eqn. 6 represents a backward cir-cular convolution of length (p - 1). That is the convolution

{x(cj °, a9 (7)

where © means circular convolution, and the indices andsubscripts are modulo P. In terms of matrix notation, wehave

('«/' '>P)J

By making the substitution of — t = k — n, eqn. 6 can be written asp 1

A. \\g /p) — ) X\\y /pJOC \y)

{.V n = 1, 2, 3, 4} = {x«y-"}P): n = 1, 2, 3, 4}

= {x(3), x(4), x(2), x(l)}and

{a9": n = 0, 1, 2, 3} = {a1, a2, a4, a3}

This implies that the convolution is now written

{.v(3), x(4), x(2), x(l)}©{a\ a2, a4, a3}

As stated earlier, we restrict a = 2, hence eqns. 8 and 10become

~X'(2yX\4)X'(3)

d'X'(2)~X\4)

X\3)X'(\)

"21

22

24

.23

~x(4)x(2)x(l).A'(3)

23 24

2' 23

22 21

24 22

x(2)x(l)x(3)x(4)

22'2 4

23

31.

Ml)x(3)x(4)x(2)

"x(3)x(4)

x(2)

L-v(l)

x(3)"x(4)

x(2)

-v(l).

-

" 2 3 "24

22

. 2 1 .

(11)

It is evident from the above equation that the square datamatrix is circulant.

Now let us consider the convolution of two sequences,{x{n):n = 0, 1 , . . . , P - 1} and {h{n): n = 0, 1 P — 1},

y(m)= (12)

for m = 0, 1, . . . , P — 1. For the computation of eqn. 12using MNT, y(m) can be found from the inverse Mersennenumber transform (IMNT),

= ^ I Y(k)a (13)fc = 0

for in = 0, 1, . . . , P - 1. The coefficients Y(k) are the prod-ucts of the term-by-term multiplications of correspondingX(k) and H(k), and H(k) is the MNT of the sequence {h(n):n = 0, 1 , . . . , P - 1}.

Now, let us make the following mapping to eqn. 13:

in —» < — <

(8)

which becomes, for MNT coefficients,

x«g-p>P)

,-2P + 2\

<9~2>P

L aN

(10)

Note that in this form the square matrix is circulant.A simple example will illustrate the above steps. Con-

sider the sequence {x(0), x(l), x(2), x(3), x(4)} where P = 5.In this case two is a primitive root which can be used togenerate the nonzero elements inside the field modulo five.Hence the sequence in eqn. 7 can be written as

IEE PROCEEDINGS, Vol. 131, Pt. E, No. I, JANUARY 1984

from which we obtain

•K (14)

11

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Page 3: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

or

•' * \ J /Pi p {_,

for m = 1, 2 , . . . , P - 1.

(15)

or"

a<3

,<92P>P

Iff = 5, we get

" y'(2)'/(I)>''(3)

. >''(4).

"21

23

24

.2 2

22

21

23

24

2 4

22

21

23

23"2 4

22

21

T(2)7(4)7(3)

Td)Eqns. 8, 10, 11 and 16 are four essential expressions thatare used for the hardware design in the next Section.

3 Hardware description

Our hardware design for the computation of Mersennenumber transforms and inverse Mersenne number trans-forms relies mainly on the fact that the basic equations ofthe MNT and IMNT can be converted into convolutionforms. Fig. 1 shows the register arrangement of a five-point

i

register A

21

- 1

1—1

!

[_ i1 \

?gister 3

22

11 M

tc>

r

intr

register 2

2A

bu s n

"n1

I

ndrif

\

register 123

»r

• iM

nVbuffer

Fig. 1 Rec/ister arrangement for a 5-point Mersenne number transformunit

Mersenne number transform unit. The registers used inthis circuit are D-type registers (or flip-flops) with three-state outputs and data enable input controls such that alloutputs and inputs can be tied to the same bus.

The multiplication of a power of two in MNT arith-metic means a circular shift only. However, circular shiftsmay still be time consuming for very fast implementationsof digital signal processors. In our design, a circular shift isdone at the same time at which the data is added fromregisters into the accumulator (not shown on Fig. 1). Forexample, if register 3 is activated (by output control lines)to send data out to be added into the accumulator, theinput data enable line of register 2 should also be activat-ed. The contents in registers 3 and 2 should be equal to theoriginal data multiplied by 22 and 24, respectively. The cir-cular shift of two bits (24~2 = 22) when the contents inregister 3 are loaded into register 2 is done simply byrearranging (shifting circularly) the pin connection of regis-ter 2 as shown in Fig. 1. This shows that the shifting timeis entirely saved.

Suppose that the contents in registers 1-4 are x(4), x{2),x{\) and x(3), respectively, and the accumulator has beenloaded with x(0) initially. The addition of the contentsfrom register 1 to the accumulator will transfer x(4) to the12

buffer, the addition of the contents from register 2 willtransfer x(2) to register 1 and so on. Hence after four addi-tions, X{2) will be stored inside the accumulator. Exactlythe same procedure (addition starting from register 1,register 2, ..., register 4) is required for finding other

(16)

by virtue of the circulant form of the data matrix. The cor-rectness of the operation may be seen by considering eqn.11. The total number of additions for the whole transformis P2 for a P-point MNT, if the initial loading of x(0) to theaccumulator is taken to be equivalent to one addition.Notice that the control of data flow is very simple and noscrambling of data is needed at any intermediate step ofthe transformation.

The initial setting of the registers can be done byloading registers 1-4 with x(4), x(2), x(l) and x(3), respec-tively, and then performing one self-read/write (load andstore from the same register) operation to each registerindividually.

Fig. 2 shows a pipeline arrangement of a five-point con-volver using a Mersenne number transform. The upper-register set and adder 1 are used for the Mersenne numbertransform. The lower register set and adder 3 are used forthe inverse Mersenne number transform. Accumulator 2,register B and adder 2 are used for the computation of theproduct of H{k) and X{k) by the shift-add technique. After

register M register A3 register A2 register A1buffer

+J adder 1

accumulator1

shiftregister A

ROM or RAMH(k)

register B

adder 2

accumulator2

control

register C

adder

t tregister CA

23register C3

21

1 !

i

i

register C 2 register C1buffer

1 f tFig. 2 Circuit diagram of a convolver using Mersenne number transform

IEE PROCEEDINGS, Vol. 131, Pt. E, No. 1, JANUARY 1984

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Page 4: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

the first X(k) has been obtained in accumulator 1, it isshifted down into shift register A which is used to controlthe addition of H(k) into accumulator 2. If there is an 'one'at the rightmost bit of the shift register A, H(k) is addedinto accumulator 2 and then shifted to the left circularly. Ifthere is a zero at the right-most bit of the shift register A,accumulator 2 shifts to the left circularly without addition.These shift-add operations are done simultaneously withthe MNT unit.

Similarly, immediately after the first product ofX(k)H(k) has been obtained the inverse Mersenne numbertransform can be started. The circuit for IMNT is mainlyconstructed according to eqn. 16. Instead of rotating theinput words as shown in the MNT unit, the result registers(of y'(m)s) are rotated to the left after each addition. Forexample, when the contents of register C3 is added withregister C, the sum will be entered into register C4. Sincerotations are done on the result registers, the multiplica-tions of 2"' of the input data is implemented by the multi-plication of 2p~m to the result registers. Similar to MNT,the multiplication by powers of two is simply done byhardware and the initial setting of the result registers cansimply be done by one self-read/write operation to eachregister individually. After the last X(k) has been calcu-lated, registers A1-A4 are released from their original use.At the same time the IMNT unit should also be at the lastcycle of its operation. The data in register C are addedwith the contents in registers C1-C4; the results can thenbe entered into register A1-A4 in such a way that non-rotated results are obtained.

Hence it is seen that this pipeline convolver requiresapproximately P2T ns for the computation of a P-pointcyclic convolution using the Mersenne number transform,where T is the time for one addition. Using 17-bit adderssimilar to those used in Reference 8 for which the time forone addition is 21 ns, the total convolution time is approx-imately 172 x 21 ns or 7 ^s for a 17-point convolution.Notice that the whole design is very simple. It involves avery simple control unit, 3 adders, 41 registers or accumu-lators, 17 words of ROM or RAM and some buffers only.

4 Speed optimised convolver using Mersennenumber transform

In the previous Section, the Mersenne number transformconvolver has been built on a pipeline and serial-datamovement basis. Compared with multipliers, adders aremuch less expensive; hence we may use more adders toconstruct a parallel processor to speed up the computa-tion. Fig. 3 shows a 17-point fast Mersenne number con-volver using parallel adders. The principle of operation isvery similar to that described in the previous Section.Register A holds the input data which is added into regis-ters B1-B17 in parallel. The results of each parallel addi-tion are stored into the right-hand-side registers,respectively. This arrangement is equivalent to the shiftingof these sum registers to line up properly for the next addi-tion. After the last x(n) has been added, the sums areshifted down to registers C1-C17 and then sent serially tothe multiplier for the multiplication of H(m)s. After the firstproduct has been obtained, it is sent to register A for com-putation of the inverse Mersenne number transform. Com-paring eqn. 8 with eqn. 16, it is clear that the IMNT andMNT essentially have the same structure. The only differ-ence is that the directions of data flows (or equivalently a"flow) are opposite. Hence for the IMNT, the results of eachparallel addition are stored into the 'left'-hand-side regis-ters (not shown on the diagram), respectively.

The convolution time for this design is approximatelyequal to the time for 17 additions and 17 multiplications.

registerB17

registerB16

registerC17

registerC16

registerB2

registerB1

registerC2

registerC1

ROM or RAMH(k)

Fig. 3 Fast Mersenne number convolver using parallel adders

A fast Mersenne number multiplier can be constructed by(i) modifying a commercially available multiplier for MNTarithmetic or (ii) using MSI or LSI to construct a dedi-cated MNT multiplier [18, 19] or, indeed, (iii) fabricating adedicated IC for this purpose. Fig. 4 shows a possible con-struction of a 17-point MNT multiplier using the quarter-square multiplier

Xm Hm = X- [_(Xm Hm)2 - (Xm -

This multiplier requires 86 ns for each multiplication (noattempt has been made to optimise the speed of thedesign). Hence the total execution time for a 17-point con-volution for using this multiplier is 17 x (86 + 21) ~ 1.8^s. Notice that the execution time of the present designdepends mainly on the speed of the multiplier and adders.

\'8

R0M1246 words

•17

—*"{

i'9

ROM 2512 words

'17

\'8

ROM 3256 words

- '17

'9r

ROM U512 words

'17

'17

Fig. 4 17-bit Mersenne number multiplier

1EE PROCEEDINGS, Vol. 131, Pt. £, No. /, JANUARY 1984 13

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Page 5: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

5 Long convolution from short convolution

Possible word lengths for the Mersenne number transformimplementation of convolvers using the present techniquesare {2, 3, 5, 7, 11,13, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61,67}. These are also the possible sequence lengths for theMNTs. However, the sequence lengths of the MNT can beextended to 2P when —2 is chosen as the root of unity. Inthis case

X(k) = ( X x(n)a"k) (17)\ » = 0 /M

where

a = - 2

M = 2P - 1

for /c = 0, 1, . . . , 2P — 1. Eqn. 17 can be mapped into atwo-dimensional form [20] through the followingmapping:

n = 2n1>2P

2 • 2~1n1>

where

PP~l = 1 mod 2

= 0 mod P

2 • 2 ' 1 = 1 mod P

= 0 mod 2for

« „ * , = 0 , 1, . . . , P - 1

n2, fc2 = 0 , 1

Hence the two-dimensional form becomes

2 - 1 P - l

;i2 = 0 «i = 0

where a,, a2 = a2, ap, respectively.

(18)

Rewrite eqn. 18 into vector form

**, = t (Axm)«22k2

M2=0

for/c2 = 0, 1.

where

(19)

~xk2

_xk

. 0

. 1

. p - 1-

Xn2

*n2

0

1

p - 1-

A =

a? a?

a? a!

a, a1,We now have

V — A *• A %•

for a2 = 1, a2 = — 1.

14

(20)

(21)

/4x0 and ^Xj represent simply two P-point Mersennenumber transforms. Hence the circuits discussed above aresuitable for the implementation of a 2P-length convolutionand the total computation time is approximately twice asmuch as for the case of a P-length convolution.

For longer and composite lengths, one may transformthe convolution into a two-dimensional cyclic convolutionform

y(m2 , mx) =p2-i

fi2 = 0 n\ = 0

where

m2 = 0, 1, . . . , P2 - 1

m, = 0 , l , . . . , P j - 1

and use two different Mersenne number transforms tocarry out the computation [21]. This method needs onemultiplication per point; however, the price to be paid forusing this approach is that arithmetic with double wordlengths has to be used on the second dimension of the con-volution. On the other hand, one may also use MNT forthe computation on one dimension and use Winograd'seight short-convolution algorithms [22-24] for the compu-tation in the other dimension. This method requires a totalof PiM2 multiplications for the whole computation, whereM2 is the number of multiplications for a P2-point cyclicconvolution using Winograd's short-convolution algo-rithms.

6 Extension to Fermat number transforms andother number theoretic transforms

Since

nfc = 1 In2 + k2 - (n - /c)2]

the Mersenne number transform of eqn. 1 can be expressedin the following form:

p - i

X(k) = { ak l2 X Wn)a"/ 2]an = 0

" 2 / 2 ] a - ( " - f c ) 2 / 2 (22)

For the existence of eqn. 22, the square root of a shouldexist in order to carry out the computation. For Mersennenumber transform with a = 2, >J2 always exists and isgiven by

D/2

which is in the desirable form of a power of 2.Similarly, the inverse Mersenne number transform, eqn.

13, can also be written as

(23)

Similar to eqns. 6 and 15, eqns. 22 and 23 represent theMersenne number transform and the inverse Mersennenumber transform in circular convolution forms. Unlikethe multiplications of W"2'2 and W^2'2 in the chirp z-transform [25, 26], for which complex multiplications arerequired, the multiplication of a"2/2 and ak2/2 required inthe above form can simply be implemented by shifts only.It is interesting to note that for actual implementation ofthe MNT pair, eqns. 22 and 23 for example, the terms afc2/2

and a~fc2/2 cancel out. This leaves the terms a"2/2 anda-m2/2 t 0 b e implemented. Besides the additional shifts

IEE PROCEEDINGS, Vol. 131, Pt. £, No. 1, JANUARY 1984

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Page 6: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

have to be used to implement a"2'2 and a m2/2, anotherdisadvantage of this formulation is that the sequence| a (n-fc)2/2. fc, n = 0, 1, . . . , P - 1} ineqn. 22 is different instructure from the sequence {a(m~fc)2/2: k, m = 0, 1, . . . ,P = 1} in eqn. 23. Hence, slightly different circuits have tobe used for the MNT and the IMNT in the present form.This disadvantage may be removed by rearranging eqns.23 as follows:

Eqn. 27 can also be written as

y(m) = (icr2 / 2 'x1< (24)

Hence the sequences {a'{n~k)2'2: k, n = 0, 1, . . . , P - 1}and {or(m+fc)2/2:/c, m = 0, 1, . . . , P - 1} now have the samestructure and can be generated by the same originalsequence but rotating in the opposite directions, respec-tively. However, the product of afc2/2 in eqn. 22 and afc2/2 ineqn. 24 involves no cancellation effect in this case, andhence some extra shifts have to be used to compute thispart.

The most desirable and attractive feature of this formu-lation is that the input sequence and the output sequenceare handled in their natural order which simplifies thedesign of the control unit. Notice also that eqns. 22-24 donot restrict the sequence length to being a prime number,implying that this formulation is also suitable for othernumber theoretic transforms.

For example, a Fermat number transform pair can bewritten as

X{k) = ( a (25)

«„)-<-«•<

for

n, /c = 0, 1, . . . , N - 1

F, = 22' + 1 a Fermat number and t = 1,2,...

Tf a is chosen as 2, .Jl can be represented by [3, 27]

a1/2 = = 22'"2(22'"1 -

This is not the simplest representation of a. It would bebetter to choose a to be four or a power of four, but thiswill unavoidably shorten the sequence lengths available forFermat number transforms. It should be further remem-bered that arithmetic used for a Fermat number trans-forms is more complicated than the arithmetic used inMersenne number transforms. Hence special arithmetic [8,9] may have to be used in the computation of Fermatnumber transform, thereby making the FNT less attractivethan the MNT.

7 Discrete Fourier transform

Recently Siu and Conatantinides [28] have shown thatnumber theoretic transforms can be used to calculate dis-crete Fourier transform (DFT) very effectively. If q is aprime number, the g-point DFT is defined as

Y(k) = Zx(n)W"ok

« = o

where

k = 0,\,...,q-\

Wo = e~i[2ntq)

IEE PROCEEDINGS, Vol. 131, Pt. E, No. I, JANUARY 1984

(27)

where

n= 1

(28)

(29)

(30)

g = primitive root used to generate all nonzero elementsinside the field modulo q.

The backward circular convolution in eqn. 30 can alsobe expressed as

{x o ,x , ,

w w* w*V V ( q - D / 2 - H vv 0 •> VY 1 » • • • • >

where

Wn =

(31)

(32)

(33)

for n = 0, 1, . . . , q — 2, and * and © mean complex conju-gate and cyclic convolution, respectively.

This length-(g — 1) convolution sums can be computedby any number theoretic transform. Alternately the realand imaginary parts of the number theoretic transformedresults of {Wo, Wlt . . . , ^ _ 1 ) / 2 _ l 5 W*, W*, . . . ,W(<j- I)/2 - I } a r e z e r o [28], and hence the total number ofreal multiplications required is (q — 1) only for a g-pointDFT.

If Mersenne number transforms are used to carry outthe computation of eqn. 31, a transform length of 2P mustbe used because {q — 1) is an even number. The root ofunity of order 2P is —2. Table 1 shows some possible DFT

Table 1 : Possible Basic DFT lengths computed by MNT

DFT length MNT length Moduloq 2P 2 P - 1

2 3 - 12 5 - 1

2 1 1 - 12 2 3 - 12 2 9 - 12 4 1 - 12 5 3 - 1

lengths and the corresponding Mersenne numbers that canbe used for the computation. Eqns. 20 and 21 can be usedto calculate the MNT of the sequence {x0, xu ..., xq_2}.The method employed to calculate the MNT of {Wo, Wx,. . . , Wq_2} is unimportant, since this result should be calcu-lated before the construction of the DFT processor.However it is interesting to point out that if a two-dimensional formulation similar to eqn. 18 is used, trans-formed results of Ws corresponding to eqn. 20 are purelyreal and the results corresponding to eqn. 21 are purelyimaginary. Note also that due to the symmetry property ofthe DFT, only the first-half length-(g — 1) inverse trans-form need be computed. The other half can be obtained bytaking the conjugate of the first-half inverse transform.Hence a total of two length-p inverse Mersenne numbertransforms are enough for the computation of a length(2P+ 1)DFT.

Let us consider, for example, the DFT of the sequence{x(0), x(\), ..., x(22)}, i.e. q = 23. In this case five is aprimitive root which can be used to generate all nonzero

15

571123475983107

461022465882106

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Page 7: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

elements inside the field modulo 23. Hence eqn. 31becomes

, x(7), x(6), x(15), x(3), xx(18), x(22), x(9), x(l 1), x(16), x(17), x(8), x(20), x(4),x(\0),x(2\x(5),x(\)}®{Wl

0, W50, W

20, W

l0°, W4

0,l°, W8

0, W'o\ W^, Wx0\ W9

0, Wl*, W50*, W2

0*,l0 t 20 l1l 0 o ^Wl

00*, Wt*, W2

00*, W8

0*,

From eqns. 19-21, we obtain

0

Wx0

6*, Wl1*, W90*}

straints. A possible method to resolve this problem is tochoose a suitable word length first and then use suitablemultidimensional techniques to obtain higher sequencelengths. Consider, once again, the discrete Fourier trans-form length to be a prime number q as before, then {q — 1)must be a composite number and let {q — 1) = N2{2P). Ag-point DFT can always be converted into a cyclic convol-ution of (q — 1) points. This {q — l)-point cyclic convolu-

A =

1111111111

122

242 6

2 8

210

21

23

25

27

29

12 4

28

21

25

29

22

2 6

210

23

27

126

21

27

22

28

23

29

2 4

2io

25

128

25

22

210

27

24

21

29

2e

23

1210

29

28

27

26

25

2 4

23

22

21

121

22

23

24

25

2 6

27

2 8

29

2io

123

26

29

21

24

27

22

25

28

125

210

24

29

23

28

22

27

21

2 6

127

232io

26

22

29

25

21

28

2 4

129

27

25

23

21

21

28

26

24

22

xl =

, x(7), x(15), x(19), x(21), x(22), x(ll), x(17), x(20), x(10), x(5)]r

, x(8), x(4), x(2), x(l),x(12), x(6), x(3), x(13), x(18)]r

and

Similarly

1

l , W20, l6, W9

0, W50*,

l0*, w20

0*, wl1*, wlol*y

x = [Wl0*, Wl*, l*, Wl6*,

w50, wi°, w2

0°, wl1,and

Wo = Aw0 + Awi

Wx = Aw0 — Aw{

In order to use modulo arithmetic, the W"o terms have tobe normalised to integers. Multiplying these terms by 10and rounding off the results to integers, we obtain

'2039'3899026415691891941

271657189

j\999'7136475427403j 1500710757441798271742795271216

It can be seen that the vector Wo is purely real and thevector Wx is purely imaginary. Making term-by-termmultiplications of Xo, X{ and Wo, Wx, respectively, weobtain Yo and Yx; hence Yo should be purely real and Y^should be purely imaginary. In a manner similar to theforward transform, the inverse transform can be computedby just two 11-point inverse Mersenne transforms.

It is well known that the Mersenne number transformsare limited by word-length and sequence-length con-

16

tion can also be converted into a two-dimensional cyclicconvolution of lengths N2 and (2P), respectively, if N2 and2P are relatively prime. Mersenne number transform canthen be applied to the dimension with 2P in length. Noticethat the conjugate property is still maintained in thisdimension for the Mersenne number transformed results ofl^s [29]. The other dimension can be computed by someefficient short-convolution algorithms such as those byWinograd [24] or Agarwall and Cooley [23].

For example, a length-67 DFT can be computed by a66-point cyclic convolution which can be converted into a3 x 22 two-dimensional cyclic convolution. The 22-pointcyclic convolution can be computed by MNT and thethree-point cyclic convolution can be computed by one ofthe short cyclic convolution algorithms [23-24]. For thecomputation of a three-point cyclic convolution, fourmultiplications are required; hence 1.33 realmultiplications/point are required for the computation of a67-point DFT.

Since q is a prime number, it is possible to combine theeight efficient short DFT algorithms, namely 2, 3, 4, 5, 7, 8,9 and 16, to carry out the computation of discrete Fouriertransform using multidimensional formulations [20, 30].The computation can then be simplified by using Win-ograd's Fourier-transform algorithm [24] or the prime-factor algorithm [31]. On the other hand, two MNT-basedDFT modules can also be combined to form a long DFTtransform. In general any number of mutually-primesequence length short DFTs and MNT-based DFTmodules can be combined together to facilitate the compu-tation of long discrete Fourier transforms.

8 Conclusion

Due to the lack of FFT-type algorithm for the computa-tion, the Mersenne number transform has received muchless attention compared to the Fermat number transformover the last few years. However the hardware design pre-sented in this paper shows that the MNT convolvers aresimpler in structure than the corresponding case for FNT

IEE PROCEEDINGS, Vol. 131, Pt. £, No. /, JANUARY 1984

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Page 8: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

convolvers. This is mainly due to the simplicity in MNTarithmetic and due also to the fact that the MNT caneasily be converted into convolution form. This featureleads to our design for the very fast implementation ofMersenne number transforms.

The nature of shift adds in our design has some simi-larities to the basic operations in distributed arithmetic[32-36]. However, in distributed arithmetic a relativelylarge ROM size has to be used to form tables for the calcu-lation. In using multi-dimensional techniques to evaluateconvolutions (or DFTs), different sets of impulse responsecoefficients (or Ws in DFTs) are used for the computationof short convolutions (or DFTs), and hence the ROM sizemay be excessively large for practical implementation.However, for the approach presented in this paper, theROM or RAM size required to store Hs (or Ws in DFTs)is approximately equal to the sequence length of the con-volution (or DFT) and this is small enough for all practicalcases.

Although MSI ICs were used in this paper to illustratethe speed achievable under the present design, this newimplementation approach is recommended as a basis forarchitectures for VLSI digital signal processors. In such arealisation, no doubt, the speed of the computation canfurther be increased.

Mersenne number transforms can be used to implementdigital filters and the discrete Fourier transform very effec-tively. This is due to (i) the fact that MNT provides ameans for constructing digital signal processors withoutroundoff error, and (ii) the present design which gives afast and efficient approach to compute MNT/IMNT thatallows a drastic processing word-load reduction over theFFT-type approach.

9 R e f e r e n c e s

1 POLLARD, J.W.: The fast Fourier transform in a finite field', Math.Comput., 1971, 25, pp. 365-374

2 RADER, CM.: 'Discrete convolution via Mersenne transform', IEEETrans., 1972, C-21, pp. 1269-1273

3 AGARWALL, R.C., and BURRUS, C.S.: 'Fast convolution usingFermat number transform with application to digital filtering', ibid.,1974, ASSP-22, pp. 87-97

4 SIU, W.C.: 'Number theoretic transform and its applications todigital signal processing'. Proceedings of IERE HK Section Work-shop on Adv. Micro, and DSP, Hong Kong, Sept., 1982, pp. 76-101

5 COOLEY, J.W., and TUKEY, J.W.: 'An Algorithm for the machinecalculation of complex Fourier series', Math. Comput., 1965, 19, pp.297 301

6 BOGNER, R.E., and CONSTANTINIDES, A.G.: 'Introduction todigital filter' (John Wiley & Sons, 1975)

7 CAPPELLINI, V., CONSTANTINIDES, A.G., and EMILIANI, P.:'Digital filters and their applications' (Academic Press, 1978)

8 McCLELLAN, J.H.: 'Hardware Realisation of a Fermat numbertransform', IEEE Trans., 1976, ASSP-24, pp. 189-198

9 LEIBOWITZ, L.M.: 'A Simplified binary arithmetic for Fermatnumber transform', ibid., 1976, ASSP-24, pp. 199-202

10 REED, I.S., and TRUONG, T.K.: The use of finite fields to computeconvolutions', ibid., 1975, IT-21, pp. 203-213

11 REED, I.S., and TRUONG, T.K.: 'Complex integer convolutionsover a direct sum of Galois fields', ibid., 1975, IT-21, pp. 657-661

12 NEVIN, R.L.: 'Application of Rader-Brenner FFT algorithm tonumber-theoretic transforms', ibid., 1977, 25, pp. 196-198

13 NUSSBAUMER, H.J.: 'Digital filtering using complex Mersennetransforms', IBM J. Res. Dec, 1976, pp. 498-504

14 THOMAS, J.J., LARSOM, G.N., and KELLER, J.M.: 'Number theo-retic transforms with independent length and moduli', IEEE Trans.,1983, ASSP-31, pp. 215-217

15 AGARWAL, R.C., and BURRUS, C.S.: 'Number theoretic trans-forms to implement fast digital convolution', IEEE Proc, 1975, 63,pp. 550-560

16 NUSSBAUMER, H.J.: 'Relative evaluation of various number theo-retic transforms for Digital filtering applications', IEEE Trans., 1978,ASSP-26, pp. 88-93

17 KRAATS, R.H.V., and VENETSANOPOULOS, A.N.: 'Hardwarefor two-dimensional digital filtering using Fermat number trans-forms', ibid., 1982, ASSP-30, pp. 155-162

18 TAYLOR, F.J.: 'Large moduli multipliers for signal processing', ibid.,1981, CAS-28, pp. 731-736

19 JULLIEN, G.A.: implementation of multiplication modulo a primenumber, with application to number theoretic transforms', ibid., 1980,C-29, pp. 899-905

20 BURRUS, C.S.: index mapping for multi-dimensional formulation ofthe DFT and convolution', ibid., 1977, ASSP-25, pp. 239-242

21 SIU, W.C., and CONSTANTINIDES, A.G.: 'On the cyclic convolu-tion of long sequences using number theoretic transform', IEE Proc.G, Electron. Circuits & Syst. (to be published)

22 WINOGRAD, S.: 'Some bilinear forms whose multiplitive complexitydepends on the field of constants'. IBM T.J. Watson Res. Or., 1975,NY, IBM Res. Rep., RC5669

23 AGARWAL, R.C., and COOLEY, J.W.: 'New algorithms for digitalconvolution', IEEE Trans., 1977, ASSP-25, pp. 106-124

24 WINOGRAD, S.: 'On computing the discrete Fourier transform',Math. Comput., 1978,32, pp. 175-199

25 BLUESTEIN, L.I.: 'A linear filtering approach to the computation ofdiscrete Fourier transform', IEEE Trans., 1970, AU-18, pp. 451-455

26 RABINER, L.R., SCHAFER, R.W., and RADER, CM.: The chirpz-transform algorithm', ibid., 1969, AU-17, pp. 86-92

27 DICKSOM, L.E.: 'History of the theory of numbers' (Carnegie Insti-tute, Washington, 1919)

28 SIU, W.C., and CONSTANTINIDES, A.G.: 'Very fast discreteFourier transform'. Paper presented at the IEE seminar on digitalfilters, London, 7th-8th April, 1983, pp. 1-29

29 SIU, W.C.: 'On the computation of discrete Fourier transform usingFermat number transform'. Research Report, Digital Signal Pro-cessing Section, Imperial College of Science and Technology, London,March 1983

30 GOOD, I.J.: The relationship between two fast Fourier transforms',IEEE Trans., 1971, C-20, pp. 310-317

31 KOLBA, D.P., and PARKS, T.W.: 'A prime factor FFT algorithmusing high-speed convolution', ibid., 1977, ASSP-25, pp. 91-103

32 BURRUS, C.S.: 'Digital filter structures described by distributedarithmetic', ibid., CAS-24, 1977, pp. 674-680

33 CHU, S., and BURRUS, C.S.: 'A prime factor FFT algorithm usingdistributed arithmetic', ibid., 1982, ASSP-30, pp. 217-226

34 TAN, B.S., and HAWKINS, G.J.: 'Speed-optimised microprocessorimplementation of a digital filter', IEE Proc. E Comput. & DigitalTech., 1981, 128, (3), pp. 85-93

35 SIU, W.C.: 'Microprocessor-based implementation of digital signalprocessors using distributed arithmetic'. Proceedings, IERE HKSection Workshop on Adv. Micro. & DSP, Hong Kong, Sept. 1982,pp. 148-157

36 PELED, A., and LIU, B.: 'A New Hardware realisation of digitalfilters', IEEE Trans., 1974, ASSP-22, pp. 456-462

Wan-chi Siu received the associateship ofHong Kong Polytechnic in electronic engi-neering in 1975 and the M.Phil, degree inelectronics from the Chinese University ofHong Kong in 1977. From 1975 to 1980, hetaught and subsequently became an elec-tronic engineer in the Department of Elec-tronics of the Chinese University of HongKong. Since 1980, he has been with theDepartment of Electronic Engineering ofthe Hong Kong Polytechnic as a lecturer.

He is now on leave from Hong Kong Polytechnic and is with theDepartment of Electrical Engineering, Imperial College ofScience and Technology, England. His research interests are intransform techniques, hardware and software implementations ofdigital signal processors, microprocessor architectures and fabri-cation technology. He is a chartered engineer and a member ofthe IERE and the IEEE.

IEE PROCEEDINGS, Vol. 131, Pt. E, No. I, JANUARY 1984 17

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.

Page 9: Approach to the hardware implementation of digital signal ...wcsiu/paper_store/Journal/1984/1984_J1_IEE_Siu... · this circui art D-type e register (or flip-flopss ) with three-state

A.G. Constantinides received the B.Sc.(Eng.)degree with first-class honours in 1965 andthe Ph.D. degree from the University ofLondon in 1968 for his research in digitalfilter design.

In 1969 he was an STL-sponsoredresearch fellow at the City University,London, and later he became a SeniorResearch Fellow with the British PostOffice Research Department. In 1971 hejoined the Department of Electrical Engi-

neering of the Imperial College of Science and Technology,

London, where he is currently a Reader. He is a co-editor of thebooks 'Introduction of digital filtering', 'Digital signal processing'and co-author of the book 'Digital filters and their applications'.

Dr. Constantinides has been an active member of the IEEE,has served as vice-chairman of the Circuit Theory Chapter of theUK and Republic of Ireland, and on its Executive Committee. Hehas served as the first President of the European Association forSignal Processing (EURASIP) and is the co-chairman of the tri-ennial International conference on digital signal processing (heldin Florence, Italy). He is a chartered engineer and a member ofthelEE.

Book reviewHigh-speed memory systemsA.V. Pohn and O.P. AgrawalPrentice-Hall Int., 1983, 244 pp., £20.65ISBN: 0-8359-2835-7

The authors' stated intention is that this book should servenot only as 'a useful teaching instrument for advancedundergraduates and first year graduate students', but alsoas a practical guide for 'computer professionals who areinvolved in the design of memory systems'. However, therange of material covered is not as wide as the title sug-gests, and some of the topics which are covered are dealtwith in unnecessary detail. The principal area of concern isactually cache memory systems, a subject on which theauthors have previously published a number of papers.The emphasis is on the basic (and relatively static) prin-ciples involved in the design of multilevel stores, ratherthan on storage devices, the nature and characteristics ofwhich change much more rapidly.

The book starts with a brief history and overview of thedesign of high-speed storage hierarchies, both of the cachevariety, first introduced in the IBM 360 Model 85, and thevirtual store variety, first introduced in the Ferranti Atlas.To their credit the authors are well aware of the importantcontributions which early Manchester University com-puters made to developments in computer design and inChapter 7 a significant amount of space is devoted to theAtlas paging system. More recent Manchester contribu-tions in the area of high-speed memory design, particularlythe selective buffering of operands in the MU5 name store,are neglected however.

The basic principles of cache memory design and anumber of important terms and parameters are introducedin Chapter 2 while Chapters 3 and 4 deal in detail withvarious aspects of cache design and factors which affectperformance. Chapter 3 is mainly concerned with swap-

ping algorithms (the policy adopted by the system when awrite order is encountered), and while the theoreticalresults presented give valuable insight into system per-formance, some correlation with practical systems andexperimental results would have enhanced this Chapter.Theoretical and practical results are related to good effectin Chapter 4 which deals with hit ratios and factors whichaffect them.

Chapter 5 deals with a number of aspects of memoryhierarchy organisation including both multicache andmultiprocessor arrangements, three-level hierarchies, andpaged virtual memory systems; it is not obvious from thetext that there are any real systems of this type actually inoperation.

Chapter 6 is concerned with error-correcting codes andreliability, and while this is an important aspect of memorysystem design, well presented in the text, no attempt ismade to relate the use of error-correcting circuitry to per-formance. Error correcting takes time, and since thisincreases memory access times it is important to considerthis effect in high-speed systems.

The final chapter presents a more detailed history of theuse of storage hierarchies and gives details of systems incurrent use in a variety of mini, micro and (IBM) main-frame computers. Thus there is no mention of really highperformance systems such as the CRAY X-MP or CYBER205, and what is missing throughout is some correlationbetween the largely theoretical discussion in earlier chap-ters and the actual techniques used in practical systems.Although the book contains a lot of useful material andmakes a worthwhile contribution to the literature, it lackscoherence generally and omits a number of topics whichone might have expected to find under this title.

R.N. IBBETT

18 IEE PROCEEDINGS, Vol. 131, Pt. E, No. 1, JANUARY 1984

Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on May 19, 2009 at 22:30 from IEEE Xplore. Restrictions apply.