Introduction to Distributed Arithmetic
Transcript of Introduction to Distributed Arithmetic
Distributed Arithmetic (DA)
An efficient technique for calculation of inner product or multiply and accumulate (MAC)The MAC operation is common in Digital Signal Processing Algorithms
What is the direct method of implementing inner products or MAC ?
The direct method involves using dedicated multipliersMultipliers are fast but they consume considerable hardware
An Illustration of MAC OperationThe following expression represents a multiply and accumulate operation
A numerical example
KK xAxAxAy ×++×+×= L2211
∑=
=K
kkk xAyei
1..
20691541171690013446723)22(7820454232
=+−+=×+−×+×+×=
yy
[ ] [ ] )4(67,22,20,4223,45,42,32 =−== KxA
How does distributed arithmetic work?Distributed Arithmetic (DA) is a technique that is bit-serial in nature. It can therefore appear to be slowIt turns out that when the number of elements in a vector is nearly the same as the wordsize, DA is quite fastDA `replaces’ the explicit multiplications by ROM look-ups an efficient technique to implement on Field Programmable Gate Arrays (FPGAs)
What does DA achieve ?
In DA, multiplications are reordered and mixed such that the arithmetic becomes “distributed” through the structure rather than being “lumped”Area savings from using DA can be up to 80% in DSP hardware designsWhile distributed arithmetic technique itself has been around for more than 30 years (Peled and Liu, 1974), interest in this has been revived by the use of Field Programmable Gate Arrays (FPGAs) for DSPWhen DA is implemented in FPGAs, one can take advantage of memory in FPGAs to implement the MAC operation
The Formulation for DAConsider
a. Let xk be an N-bit scaled two’s complement number. In other words,
| xk | < 1xk : {bk0, bk1, bk2……, bk(N-1) }
where bk0 is the sign bitb. We can express xk as c. Substituting (2) in (1),
∑=
=K
kkk xAy
1
∑−
=
−+−=1
10 2
N
n
nknkk bbx
…(1)
…(2)
∑ ∑=
−
=
−⎥⎦
⎤⎢⎣
⎡+−=
K
k
N
n
nknkk bbAy
1
1
10 2
( ) ( )∑∑∑=
−
=
−
=
•+•−=K
k
N
n
nknk
K
kkk bAAby
1
1
110 2 …(3)
Continuing the DA formulation …
[ ]( ) ( ) ( )( ) ( )[ ]( ) ( ) ( )( ) ( )[ ]
( ) ( ) ( )( ) ( )[ ]11
22
11
1212
2222
1221
1111
2112
1111
0220110
222
222
222
−−−
−−
−−−
−−
−−−
−−
•++•+•+
•++•+•+
•++•+•+
•++•+•−=
NKNKKKKK
NN
NN
KK
AbAbAb
AbAbAb
AbAbAb
AbAbAby
L
M
L
L
L
( ) ( )∑ ∑∑=
−
=
−
=⎥⎦
⎤⎢⎣
⎡•+•−=
K
k
N
n
nkkn
K
kkk AbAby
1
1
110 2
( ) ( ) ( ) ( )[ ]∑∑=
−−−
−−
=
•++•+•+•−=K
k
NNkkkkkk
K
kkk bAbAbAAby
1
)1()1(
22
11
10 222 L
…(3)
Expanding this part
Further simplification leads to [ ]( ) ( ) ( )( ) ( )[ ]( ) ( ) ( )( ) ( )[ ]
( ) ( ) ( )( ) ( )[ ]11
22
11
1212
2222
1221
1111
2112
1111
0220110
222
222
222
−−−
−−
−−−
−−
−−−
−−
•++•+•+
•++•+•+
•++•+•+
•++•+•−=
NKNKKKKK
NN
NN
KK
AbAbAb
AbAbAb
AbAbAb
AbAbAby
L
M
L
L
L
[ ]( ) ( ) ( )[ ]( ) ( ) ( )[ ]
( )( ) ( )( ) ( )( )[ ] ( )11212111
22222112
11221111
0220110
2
2
2
−−−−−
−
−
•++•+•+
•++•+•+
•++•+•+
•++•+•−=
NKNKNN
KK
KK
KK
AbAbAb
AbAbAb
AbAbAb
AbAbAby
L
M
L
L
L
The rewrite showing interchange of sum .. [ ]( ) ( ) ( )[ ]( ) ( ) ( )[ ]
( )( ) ( )( ) ( )( )[ ] ( )11212111
22222112
11221111
0220110
2
2
2
−−−−−
−
−
•++•+•+
•++•+•+
•++•+•+
•++•+•−=
NKNKNN
KK
KK
KK
AbAbAb
AbAbAb
AbAbAb
AbAbAby
L
M
L
L
L
[ ]∑∑−
=
−
=
•++•+•+•−=1
1221
10 2)(
N
n
nKKnnkn
K
kkk AbAbAbAby L
∑ ∑∑−
=
−
==⎥⎦
⎤⎢⎣
⎡•+•−=
1
1 110 2)(
N
n
nK
kknk
K
kkk bAbAy …(4)
How is the hardware realization ?Consider the equation (4) rewritten as:
has only 2K possible values
has only 2K possible values
With the sign bit as an input, we can store it in a ROM of size=2*2K
∑ ∑∑−
= =
−
=
−+⎥⎦
⎤⎢⎣
⎡=
1
1 10
1
)(2N
n
K
kkk
nK
kknk bAbAy
⎥⎦
⎤⎢⎣
⎡∑=
K
kknkbA
1
∑=
−K
kkk bA
10)(
ExampleLet number of taps K be 4The fixed coefficients are A1 =0.72, A2= -0.3, A3 = 0.95, A4 = 0.11
We need 2K = 24 = 16-words ROM
∑ ∑∑−
= =
−
=
−+⎥⎦
⎤⎢⎣
⎡=
1
1 10
1
)(2N
n
K
kkk
nK
kknk bAbAy …(4)
ROM: Address and Contentsb1n b2n b3n b4n Contents0 0 0 0 00 0 0 1 A4=0.110 0 1 0 A3=0.950 0 1 1 A3+ A4=1.060 1 0 0 A2=-0.300 1 0 1 A2+ A4= -0.190 1 1 0 A2+ A3=0.650 1 1 1 A2+ A3 + A4=0.751 0 0 0 A1=0.721 0 0 1 A1+ A4=0.831 0 1 0 A1+ A3=1.671 0 1 1 A1+ A3 + A4=1.781 1 0 0 A1+ A2=0.421 1 0 1 A1+ A2 + A4=0.531 1 1 0 A1+ A2 + A3=1.371 1 1 1 A1+ A2 + A3 + A4=1.48
nnnnk
knk bAbAbAbAbA 44332211
4
1+++=⎥
⎦
⎤⎢⎣
⎡∑=
Plus and Minus PointsThe architecture has accomplished MAC without an explicit multiplierThe size of ROM, however,grows exponentially with each added input address lineFor each element in a vector, we have an address line. So we’ll have K address lines If K is 16, this implies 216 (i.e., 64K) of ROM
Offset binary coding to reduce ROM size
)]([21
kkk xxx −−=
∑−
=
−−− ++−=−1
1
)1(0 22
N
n
Nnknkk bbx
∑−
=
−+−=1
10 2
N
n
nknkk bbx
2‘s-complement
( ) ( ) ⎥⎦
⎤⎢⎣
⎡−−+−−= ∑
−
=
−−−1
1
)1(00 22
21 N
n
Nnknknkkk bbbbx
Rewriting xk differently, we have
Define: Offset Code
Finally
( ) ( ) ⎥⎦
⎤⎢⎣
⎡ −−+−−= ∑−
=
−−−1
1
)1(00 22
21 N
n
Nnknknkkk bbbbx
⎥⎦
⎤⎢⎣
⎡ −= ∑−
=
−−−1
0
)1(2221 N
n
Nnknk cx
{ }1,1{0,0,
)(−∈
=≠
−−−
= knknkn
knknkn cwhere
nn
bbbb
c
Using the new xk we have
Substitute the new xk in ∑=
=K
kkk xAy
1
∑ ∑=
−−−−
=⎥⎦
⎤⎢⎣
⎡−=
K
k
Nnkn
N
nk cAy
1
)1(1
022
21
⎥⎦
⎤⎢⎣
⎡ −= ∑−
=
−−−1
0
)1(2221 N
n
Nnknk cx
)1(
11
1
02
212
21 −−
==
−
=
− ∑∑∑ −= NK
kk
K
k
N
n
nknk AcAy
)1(
1
1
0 12
212
21 −−
=
−
= =
− ∑∑ ∑ −= NK
kk
N
n
K
k
nknk AcAy …(9)
The New Formulation in Offset Code)1(
1
1
0 12
212
21 −−
=
−
= =
− ∑∑ ∑ −= NK
kk
N
n
K
k
nknk AcAy
and( ) ∑=
=K
kknkKnnn cAcccQ
121 2
1L ∑
=
−=K
kkAQ
121)0(
If we let
Constant
( ) ( )∑−
=
−−− +=1
0
)1(21 022
N
n
NnKnnn QcccQy L
The Gain: have reduced storage to 8 rowsb1n b2n b3n b4n c1n c2n c3n c4n Contents0 0 0 0
101010101010101
-1/2 (A1+ A2 + A3 + A4) = -0.740 0 0
-1-1-1-1-1-1-1-1-1-1-11111
-1/2 (A1+ A2 + A3 - A4) = - 0.630 0 1
1
1
1
-1
1
-1-1-11111
-1-1-1
-1/2 (A1+ A2 - A3 + A4) = 0.210 0 1
-11
-1
11
11
-1-111
-1-111
-1/2 (A1+ A2 - A3 - A4) = 0.320 1 0
-1-1
1
1-11
-11
-11
-11
-1
11
-1/2 (A1 - A2 + A3 + A4) = -1.040 1 0 -1/2 (A1 - A2 + A3 - A4) = - 0.930 1 1 -1/2 (A1 - A2 - A3 + A4) = - 0.090 1 1 -1/2 (A1 - A2 - A3 - A4) = 0.021 0 0 -1/2 (-A1+ A2 + A3 + A4) = -0.021 0 0 -1/2 (-A1+ A2 + A3 - A4) = 0.091 0 1 -1/2 (-A1+ A2 - A3 + A4) = 0.931 0 1 -1/2 (-A1+ A2 - A3 - A4) = 1.041 1 0
1
1-11
-1/2 (-A1 - A2 + A3 + A4) = - 0.321 1 0 -1/2 (-A1 - A2 + A3 - A4) = - 0.211 1 1 -1/2 (-A1 - A2 - A3 + A4) = 0.631 1 1 -1/2 (-A1 - A2 - A3 - A4) = 0.74
Inverse symm
etry
DA Hardware with Offset Binary Coding
x1 selects between the two symmetric halves
Ts indicates when the sign bit arrives
How to reduce ROM size further ?
One approach to reduce the ROM size is by decomposing the ROM
In particular, can divide the N address bits of the ROM can be divided into (N/K) groups of K bits
So a ROM of size 2N can be divided into N/K ROMs of size 2K
Will need an adder to add the outputs of these ROMs and a multi-input accumulator
This is an active area of research .
Comparisons in initial years of DATraditional comparisons are with multiplier-based solutions for problems pertaining to filtersWhen DA was devised (in the 1970s), the comparisons given were in terms of number of TTL ICs required for mechanization of a certain type of filterIn particular, for an eighth order digital filter operating at a word rate close to 1 MHz, 72 ICs with a total power consumption of about 30 W was stated with the DA approach while 240 ICs with a power dissipation of 96 W was indicated for a multiplier-based solution
Current Status: DA vs Mplr on FPGAA study of computation of Y = aX1 + bX2 + cX3 was performed with code developed in Verilog. Elements were chosen to have 8-bit sizeA Spartan-XC3S500E based synthesis was carried out in Xilinx ISE 10.1When built-in multipliers were used, the resource consumption was 35 slices and 3 multipliers. The combinational path delay was 15.17 ns. For the DA-based solution, 47 slices were used and the max frequency of operation was 142.572 MHz.Power consumption (obtained using XPower) for DA-based approach was slightly less than that for the multiplier-based solution
ReferencesA. Peled and B. Liu, A new hardware realization of digital filters, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-22, No. 6, pp. 456-462, Dec. 1974S. A. White, Applications of distributed arithmetic to digital signal processing: A tutorial review,” IEEE ASSP Magazine, July, 1989URL http://staff.kfupm.edu.sa/ITC/miali/DistributedArithmetic.pptXilinx Application Note, The role of distributed arithmetic in FPGA-based signal processing, www.xilinx.com/appnotes/theory1.pdf