Introduction to Distributed Arithmetic · The direct method involves using dedicated multipliers...

Introduction to Distributed Arithmetic

K. Sridharan, IIT Madras

Distributed Arithmetic (DA)

An efficient technique for calculation of inner product or multiply and accumulate (MAC)The MAC operation is common in Digital Signal Processing Algorithms

What is the direct method of implementing inner products or MAC ?

The direct method involves using dedicated multipliersMultipliers are fast but they consume considerable hardware

An Illustration of MAC OperationThe following expression represents a multiply and accumulate operation

A numerical example

KK xAxAxAy ×++×+×= L2211

∑=

=K

kkk xAyei

1..

20691541171690013446723)22(7820454232

=+−+=×+−×+×+×=

yy

[ ] [ ] )4(67,22,20,4223,45,42,32 =−== KxA

How does distributed arithmetic work?Distributed Arithmetic (DA) is a technique that is bit-serial in nature. It can therefore appear to be slowIt turns out that when the number of elements in a vector is nearly the same as the wordsize, DA is quite fastDA `replaces’ the explicit multiplications by ROM look-ups an efficient technique to implement on Field Programmable Gate Arrays (FPGAs)

What does DA achieve ?

In DA, multiplications are reordered and mixed such that the arithmetic becomes “distributed” through the structure rather than being “lumped”Area savings from using DA can be up to 80% in DSP hardware designsWhile distributed arithmetic technique itself has been around for more than 30 years (Peled and Liu, 1974), interest in this has been revived by the use of Field Programmable Gate Arrays (FPGAs) for DSPWhen DA is implemented in FPGAs, one can take advantage of memory in FPGAs to implement the MAC operation

The Formulation for DAConsider

a. Let xk be an N-bit scaled two’s complement number. In other words,

| xk | < 1xk : {bk0, bk1, bk2……, bk(N-1) }

where bk0 is the sign bitb. We can express xk as c. Substituting (2) in (1),

∑=

=K

kkk xAy

1

∑−

=

−+−=1

10 2

N

n

nknkk bbx

…(1)

…(2)

∑ ∑=

−

=

−⎥⎦

⎤⎢⎣

⎡+−=

K

k

N

n

nknkk bbAy

1

1

10 2

( ) ( )∑∑∑=

−

=

−

=

•+•−=K

k

N

n

nknk

K

kkk bAAby

1

1

110 2 …(3)

Continuing the DA formulation …

[ ]( ) ( ) ( )( ) ( )[ ]( ) ( ) ( )( ) ( )[ ]

( ) ( ) ( )( ) ( )[ ]11

22

11

1212

2222

1221

1111

2112

1111

0220110

222

222

222

−−−

−−

−−−

−−

−−−

−−

•++•+•+

•++•+•+

•++•+•+

•++•+•−=

NKNKKKKK

NN

NN

KK

AbAbAb

AbAbAb

AbAbAb

AbAbAby

L

M

L

L

L

( ) ( )∑ ∑∑=

−

=

−

=⎥⎦

⎤⎢⎣

⎡•+•−=

K

k

N

n

nkkn

K

kkk AbAby

1

1

110 2

( ) ( ) ( ) ( )[ ]∑∑=

−−−

−−

=

•++•+•+•−=K

k

NNkkkkkk

K

kkk bAbAbAAby

1

)1()1(

22

11

10 222 L

…(3)

Expanding this part

Further simplification leads to [ ]( ) ( ) ( )( ) ( )[ ]( ) ( ) ( )( ) ( )[ ]

( ) ( ) ( )( ) ( )[ ]11

22

11

1212

2222

1221

1111

2112

1111

0220110

222

222

222

−−−

−−

−−−

−−

−−−

−−

•++•+•+

•++•+•+

•++•+•+

•++•+•−=

NKNKKKKK

NN

NN

KK

AbAbAb

AbAbAb

AbAbAb

AbAbAby

L

M

L

L

L

[ ]( ) ( ) ( )[ ]( ) ( ) ( )[ ]

( )( ) ( )( ) ( )( )[ ] ( )11212111

22222112

11221111

0220110

2

2

2

−−−−−

−

−

•++•+•+

•++•+•+

•++•+•+

•++•+•−=

NKNKNN

KK

KK

KK

AbAbAb

AbAbAb

AbAbAb

AbAbAby

L

M

L

L

L

The rewrite showing interchange of sum .. [ ]( ) ( ) ( )[ ]( ) ( ) ( )[ ]

( )( ) ( )( ) ( )( )[ ] ( )11212111

22222112

11221111

0220110

2

2

2

−−−−−

−

−

•++•+•+

•++•+•+

•++•+•+

•++•+•−=

NKNKNN

KK

KK

KK

AbAbAb

AbAbAb

AbAbAb

AbAbAby

L

M

L

L

L

[ ]∑∑−

=

−

=

•++•+•+•−=1

1221

10 2)(

N

n

nKKnnkn

K

kkk AbAbAbAby L

∑ ∑∑−

=

−

==⎥⎦

⎤⎢⎣

⎡•+•−=

1

1 110 2)(

N

n

nK

kknk

K

kkk bAbAy …(4)

How is the hardware realization ?Consider the equation (4) rewritten as:

has only 2K possible values

has only 2K possible values

With the sign bit as an input, we can store it in a ROM of size=2*2K

∑ ∑∑−

= =

−

=

−+⎥⎦

⎤⎢⎣

⎡=

1

1 10

1

)(2N

n

K

kkk

nK

kknk bAbAy

⎥⎦

⎤⎢⎣

⎡∑=

K

kknkbA

1

∑=

−K

kkk bA

10)(

ExampleLet number of taps K be 4The fixed coefficients are A1 =0.72, A2= -0.3, A3 = 0.95, A4 = 0.11

We need 2K = 24 = 16-words ROM

∑ ∑∑−

= =

−

=

−+⎥⎦

⎤⎢⎣

⎡=

1

1 10

1

)(2N

n

K

kkk

nK

kknk bAbAy …(4)

ROM: Address and Contentsb1n b2n b3n b4n Contents0 0 0 0 00 0 0 1 A4=0.110 0 1 0 A3=0.950 0 1 1 A3+ A4=1.060 1 0 0 A2=-0.300 1 0 1 A2+ A4= -0.190 1 1 0 A2+ A3=0.650 1 1 1 A2+ A3 + A4=0.751 0 0 0 A1=0.721 0 0 1 A1+ A4=0.831 0 1 0 A1+ A3=1.671 0 1 1 A1+ A3 + A4=1.781 1 0 0 A1+ A2=0.421 1 0 1 A1+ A2 + A4=0.531 1 1 0 A1+ A2 + A3=1.371 1 1 1 A1+ A2 + A3 + A4=1.48

nnnnk

knk bAbAbAbAbA 44332211

4

1+++=⎥

⎦

⎤⎢⎣

⎡∑=

Plus and Minus PointsThe architecture has accomplished MAC without an explicit multiplierThe size of ROM, however,grows exponentially with each added input address lineFor each element in a vector, we have an address line. So we’ll have K address lines If K is 16, this implies 216 (i.e., 64K) of ROM

Offset binary coding to reduce ROM size

)]([21

kkk xxx −−=

∑−

=

−−− ++−=−1

1

)1(0 22

N

n

Nnknkk bbx

∑−

=

−+−=1

10 2

N

n

nknkk bbx

2‘s-complement

( ) ( ) ⎥⎦

⎤⎢⎣

⎡−−+−−= ∑

−

=

−−−1

1

)1(00 22

21 N

n

Nnknknkkk bbbbx

Rewriting xk differently, we have

Define: Offset Code

Finally

( ) ( ) ⎥⎦

⎤⎢⎣

⎡ −−+−−= ∑−

=

−−−1

1

)1(00 22

21 N

n

Nnknknkkk bbbbx

⎥⎦

⎤⎢⎣

⎡ −= ∑−

=

−−−1

0

)1(2221 N

n

Nnknk cx

{ }1,1{0,0,

)(−∈

=≠

−−−

= knknkn

knknkn cwhere

nn

bbbb

c

Using the new xk we have

Substitute the new xk in ∑=

=K

kkk xAy

1

∑ ∑=

−−−−

=⎥⎦

⎤⎢⎣

⎡−=

K

k

Nnkn

N

nk cAy

1

)1(1

022

21

⎥⎦

⎤⎢⎣

⎡ −= ∑−

=

−−−1

0

)1(2221 N

n

Nnknk cx

)1(

11

1

02

212

21 −−

==

−

=

− ∑∑∑ −= NK

kk

K

k

N

n

nknk AcAy

)1(

1

1

0 12

212

21 −−

=

−

= =

− ∑∑ ∑ −= NK

kk

N

n

K

k

nknk AcAy …(9)

The New Formulation in Offset Code)1(

1

1

0 12

212

21 −−

=

−

= =

− ∑∑ ∑ −= NK

kk

N

n

K

k

nknk AcAy

and( ) ∑=

=K

kknkKnnn cAcccQ

121 2

1L ∑

=

−=K

kkAQ

121)0(

If we let

Constant

( ) ( )∑−

=

−−− +=1

0

)1(21 022

N

n

NnKnnn QcccQy L

The Gain: have reduced storage to 8 rowsb1n b2n b3n b4n c1n c2n c3n c4n Contents0 0 0 0

101010101010101

-1/2 (A1+ A2 + A3 + A4) = -0.740 0 0

-1-1-1-1-1-1-1-1-1-1-11111

-1/2 (A1+ A2 + A3 - A4) = - 0.630 0 1

1

1

1

-1

1

-1-1-11111

-1-1-1

-1/2 (A1+ A2 - A3 + A4) = 0.210 0 1

-11

-1

11

11

-1-111

-1-111

-1/2 (A1+ A2 - A3 - A4) = 0.320 1 0

-1-1

1

1-11

-11

-11

-11

-1

11

-1/2 (A1 - A2 + A3 + A4) = -1.040 1 0 -1/2 (A1 - A2 + A3 - A4) = - 0.930 1 1 -1/2 (A1 - A2 - A3 + A4) = - 0.090 1 1 -1/2 (A1 - A2 - A3 - A4) = 0.021 0 0 -1/2 (-A1+ A2 + A3 + A4) = -0.021 0 0 -1/2 (-A1+ A2 + A3 - A4) = 0.091 0 1 -1/2 (-A1+ A2 - A3 + A4) = 0.931 0 1 -1/2 (-A1+ A2 - A3 - A4) = 1.041 1 0

1

1-11

-1/2 (-A1 - A2 + A3 + A4) = - 0.321 1 0 -1/2 (-A1 - A2 + A3 - A4) = - 0.211 1 1 -1/2 (-A1 - A2 - A3 + A4) = 0.631 1 1 -1/2 (-A1 - A2 - A3 - A4) = 0.74

Inverse symm

etry

DA Hardware with Offset Binary Coding

x1 selects between the two symmetric halves

Ts indicates when the sign bit arrives

How to reduce ROM size further ?

One approach to reduce the ROM size is by decomposing the ROM

In particular, can divide the N address bits of the ROM can be divided into (N/K) groups of K bits

So a ROM of size 2N can be divided into N/K ROMs of size 2K

Will need an adder to add the outputs of these ROMs and a multi-input accumulator

This is an active area of research .

Comparisons in initial years of DATraditional comparisons are with multiplier-based solutions for problems pertaining to filtersWhen DA was devised (in the 1970s), the comparisons given were in terms of number of TTL ICs required for mechanization of a certain type of filterIn particular, for an eighth order digital filter operating at a word rate close to 1 MHz, 72 ICs with a total power consumption of about 30 W was stated with the DA approach while 240 ICs with a power dissipation of 96 W was indicated for a multiplier-based solution

Current Status: DA vs Mplr on FPGAA study of computation of Y = aX1 + bX2 + cX3 was performed with code developed in Verilog. Elements were chosen to have 8-bit sizeA Spartan-XC3S500E based synthesis was carried out in Xilinx ISE 10.1When built-in multipliers were used, the resource consumption was 35 slices and 3 multipliers. The combinational path delay was 15.17 ns. For the DA-based solution, 47 slices were used and the max frequency of operation was 142.572 MHz.Power consumption (obtained using XPower) for DA-based approach was slightly less than that for the multiplier-based solution

ReferencesA. Peled and B. Liu, A new hardware realization of digital filters, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-22, No. 6, pp. 456-462, Dec. 1974S. A. White, Applications of distributed arithmetic to digital signal processing: A tutorial review,” IEEE ASSP Magazine, July, 1989URL http://staff.kfupm.edu.sa/ITC/miali/DistributedArithmetic.pptXilinx Application Note, The role of distributed arithmetic in FPGA-based signal processing, www.xilinx.com/appnotes/theory1.pdf

Introduction to Distributed Arithmetic · The direct method involves using dedicated multipliers...

Documents

Transcript of Introduction to Distributed Arithmetic · The direct method involves using dedicated multipliers...