Distributed Arithmetic: Implementations and Applications A Tutorial.
-
date post
21-Dec-2015 -
Category
Documents
-
view
223 -
download
1
Transcript of Distributed Arithmetic: Implementations and Applications A Tutorial.
![Page 1: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/1.jpg)
Distributed Arithmetic: Implementations and Applications
A Tutorial
![Page 2: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/2.jpg)
Distributed Arithmetic (DA) [Peled and Liu,1974]
An efficient technique for calculation of sum of products or vector dot product or inner product or multiply and accumulate (MAC)
MAC operation is very common in all Digital Signal Processing Algorithms
![Page 3: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/3.jpg)
So Why Use DA? The advantages of DA are best exploited in data-
path circuit designing Area savings from using DA can be up to 80% and
seldom less than 50% in digital signal processing hardware designs
An old technique that has been revived by the wide spread use of Field Programmable Gate Arrays (FPGAs) for Digital Signal Processing (DSP)
DA efficiently implements the MAC using basic building blocks (Look Up Tables) in FPGAs
![Page 4: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/4.jpg)
An Illustration of MAC Operation The following expression represents a multiply and
accumulate operation
A numerical example
KK xAxAxAy 2211
K
kkk xAyei
1
..
2069154117169001344
6723)22(7820454232
y
y
)4(67,22,20,4223,45,42,32 KxA
![Page 5: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/5.jpg)
A Few Points about the MAC Consider this
Note a few points A=[A1, A2,…, AK] is a matrix of “constant” values
x=[x1, x2,…, xK] is matrix of input “variables”
Each Ak is of M-bits
Each xk is of N-bits y should be able large enough to accommodate the
result
K
kkk xAy
1
![Page 6: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/6.jpg)
A Possible Hardware (NOT DA Yet!!!) Let, )4(,,,,,, 4321 KDCBAxCCCCA
Multi-bit AND gate
Registers to hold sum of partial products
Shift registersEach scaling accumulator calculates Ai X xi
Shift right
Adder/Subtractor
![Page 7: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/7.jpg)
How does DA work? The “basic” DA technique is bit-serial in nature DA is basically a bit-level rearrangement of the
multiply and accumulate operation DA hides the explicit multiplications by ROM look-
ups an efficient technique to implement on Field Programmable Gate Arrays (FPGAs)
![Page 8: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/8.jpg)
Moving Closer to Distributed Arithmetic Consider once again
a. Let xk be a N-bits scaled two’s complement number i.e.
| xk | < 1
xk : {bk0, bk1, bk2……, bk(N-1) }
where bk0 is the sign bit b. We can express xk as
c. Substituting (2) in (1),
K
kkk xAy
1
1
10 2
N
n
nknkk bbx
…(1)
…(2)
K
k
N
n
nknkk bbAy
1
1
10 2
K
k
N
n
nknk
K
kkk bAAby
1
1
110 2 …(3)
![Page 9: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/9.jpg)
Moving More Closer to DA
11
22
11
1212
2222
1221
1111
2112
1111
0220110
222
222
222
NKNKKKKK
NN
NN
KK
AbAbAb
AbAbAb
AbAbAb
AbAbAby
K
k
N
n
nkkn
K
kkk AbAby
1
1
110 2
K
k
NNkkkkkk
K
kkk bAbAbAAby
1
)1()1(
22
11
10 222
…(3)
Expanding this part
![Page 10: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/10.jpg)
Moving Still More Closer to DA
11
22
11
1212
2222
1221
1111
2112
1111
0220110
222
222
222
NKNKKKKK
NN
NN
KK
AbAbAb
AbAbAb
AbAbAb
AbAbAby
11212111
22222112
11221111
0220110
2
2
2
NKNKNN
KK
KK
KK
AbAbAb
AbAbAb
AbAbAb
AbAbAby
![Page 11: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/11.jpg)
Almost there!
11212111
22222112
11221111
0220110
2
2
2
NKNKNN
KK
KK
KK
AbAbAb
AbAbAb
AbAbAb
AbAbAby
1
1221
10 2)(
N
n
nKKnnkn
K
kkk AbAbAbAby
1
1 110 2)(
N
n
nK
kknk
K
kkk bAbAy
The Final Reformulation
…(4)
![Page 12: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/12.jpg)
Lets See the change of hardware
1
1 110 2)(
N
n
nK
kknk
K
kkk bAbAy
K
k
N
n
nkkn
K
kkk AbAby
1
1
110 2
Our Original Equation
Bit Level Rearrangement
![Page 13: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/13.jpg)
So where does the ROM come in?
Note this portion. It’s can be
treated as function of serial
inputs bits of
{A, B, C,D}
![Page 14: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/14.jpg)
The ROM Construction
has only 2K possible values i.e.
(5) can be pre-calculated for all possible values of b1n b2n …bKn
We can store these in a look-up table of 2K words addressed by K-bits i.e. b1n b2n …bKn
1
1 110 2)(
N
n
nK
kknk
K
kkk bAbAy
K
kknkbA
1
)( 211
Knnnn
K
kknk bbbfbA
…(4)
…(5)
![Page 15: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/15.jpg)
Lets See An Example Let number of taps K=4 The fixed coefficients are A1 =0.72, A2= -0.3, A3 =
0.95, A4 = 0.11
We need 2K = 24 = 16-words ROM
1
1 10
1
)(2N
n
K
kkk
nK
kknk bAbAy …(4)
![Page 16: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/16.jpg)
ROM: Address and Contentsb1n b2n b3n b4n Contents0 0 0 0 0
0 0 0 1 A4=0.11
0 0 1 0 A3=0.95
0 0 1 1 A3+ A4=1.06
0 1 0 0 A2=-0.30
0 1 0 1 A2+ A4= -0.19
0 1 1 0 A2+ A3=0.65
0 1 1 1 A2+ A3 + A4=0.75
1 0 0 0 A1=0.72
1 0 0 1 A1+ A4=0.83
1 0 1 0 A1+ A3=1.67
1 0 1 1 A1+ A3 + A4=1.78
1 1 0 0 A1+ A2=0.42
1 1 0 1 A1+ A2 + A4=0.53
1 1 1 0 A1+ A2 + A3=1.37
1 1 1 1 A1+ A2 + A3 + A4=1.48
nnnnk
knk bAbAbAbAbA 44332211
4
1
![Page 17: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/17.jpg)
Key Issue: ROM Size The size of ROM is very important for high speed
implementation as well as area efficiency ROM size grows exponentially with each added
input address line The number of address lines are equal to the
number of elements in the vector i.e. K Elements up to 16 and more are common =>
216=64K of ROM!!! We have to reduce the size of ROM
![Page 18: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/18.jpg)
A Very Neat Trick:
1
1
)1(0 22
N
n
Nnknkk bbx
1
10 2
N
n
nknkk bbx
1
1
)1(00 22
2
1 N
n
Nnknknkkk bbbbx
)]([2
1kkk xxx
2‘s-complement
…(7)
…(6)
![Page 19: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/19.jpg)
Re-Writing xk in a Different Code
Define: Offset Code
Finally
1
1
)1(00 22
2
1 N
n
Nnknknkkk bbbbx
1
0
)1(222
1 N
n
Nnknk cx
}1,1{0,
0,
)(
knknkn
knknkn cwhere
n
n
bb
bbc
…(7)
…(8)
![Page 20: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/20.jpg)
Using the New xk
Substitute the new xk in here
K
kkk xAy
1
K
k
Nnkn
N
nk cAy
1
)1(1
0
222
1
1
0
)1(222
1 N
n
Nnknk cx
)1(
11
1
0
22
12
2
1
NK
kk
K
k
N
n
nknk AcAy
)1(
1
1
0 1
22
12
2
1
NK
kk
N
n
K
k
nknk AcAy …(9)
![Page 21: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/21.jpg)
The New Formulation in Offset Code
Let and
K
kknkKnnn cAcccQ
121 2
1
K
kkAQ
12
1)0(
Constant
1
0
)1(21 022
N
n
NnKnnn QcccQy
)1(
1
1
0 1
22
12
2
1
NK
kk
N
n
K
k
nknk AcAy
![Page 22: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/22.jpg)
The Benefit: Only Half Values to Storeb1n b2n b3n b4n c1n c2n c3n c4n Contents
0 0 0 0 -1 -1 -1 -1 -1/2 (A1+ A2 + A3 + A4) = -0.74
0 0 0 1 -1 -1 -1 1 -1/2 (A1+ A2 + A3 - A4) = - 0.63
0 0 1 0 -1 -1 1 -1 -1/2 (A1+ A2 - A3 + A4) = 0.21
0 0 1 1 -1 -1 1 1 -1/2 (A1+ A2 - A3 - A4) = 0.32
0 1 0 0 -1 1 -1 -1 -1/2 (A1 - A2 + A3 + A4) = -1.04
0 1 0 1 -1 1 -1 1 -1/2 (A1 - A2 + A3 - A4) = - 0.93
0 1 1 0 -1 1 1 -1 -1/2 (A1 - A2 - A3 + A4) = - 0.09
0 1 1 1 -1 1 1 1 -1/2 (A1 - A2 - A3 - A4) = 0.02
1 0 0 0 1 -1 -1 -1 -1/2 (-A1+ A2 + A3 + A4) = -0.02
1 0 0 1 1 -1 -1 1 -1/2 (-A1+ A2 + A3 - A4) = 0.09
1 0 1 0 1 -1 1 -1 -1/2 (-A1+ A2 - A3 + A4) = 0.93
1 0 1 1 1 -1 1 1 -1/2 (-A1+ A2 - A3 - A4) = 1.04
1 1 0 0 1 1 -1 -1 -1/2 (-A1 - A2 + A3 + A4) = - 0.32
1 1 0 1 1 1 -1 1 -1/2 (-A1 - A2 + A3 - A4) = - 0.21
1 1 1 0 1 1 1 -1 -1/2 (-A1 - A2 - A3 + A4) = 0.63
1 1 1 1 1 1 1 1 -1/2 (-A1 - A2 - A3 - A4) = 0.74
Inverse sym
metry
![Page 23: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/23.jpg)
Hardware Using Offset Coding
x1 selects between the two symmetric halves
Ts indicates when the sign bit arrives
![Page 24: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/24.jpg)
Alternate Technique: Decomposing the ROM
Requires additional adder to the sum the partial outputs
![Page 25: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/25.jpg)
Speed Concerns We considered One Bit At A Time (1 BAAT) No. of Clock Cycles Required = N If K=N, then essentially we are taking 1 cycle per dot
product Not bad! Opportunity for parallelism exists but at a cost of
more hardware We could have 2 BAAT or up to N BAAT in the
extreme case N BAAT One complete result/cycle
![Page 26: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/26.jpg)
Illustration of 2 BAAT
![Page 27: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/27.jpg)
Illustration of N BAAT
![Page 28: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/28.jpg)
The Speed Limit: Carry Propagation The speed in the critical path is limited by the width
of the carry propagation Speed can be improved upon by using techniques to
limit the carry propagation
![Page 29: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/29.jpg)
Speeding Up Further: Using RNS+DA By Using RNS, the computations can be broken
down into smaller elements which can be executed in parallel
Since we are operating on smaller arguments, the carry propagation is naturally limited
So by using RNS+DA, greater speed benefits can be attained, specially for higher precision calculations
![Page 30: Distributed Arithmetic: Implementations and Applications A Tutorial.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d6d5503460f94a4cc70/html5/thumbnails/30.jpg)
Conclusion Ref: Stanley A. White, “Applications of Distributed
Arithmetic to Digital Signal Processing: A Tutorial Review,” IEEE ASSP Magazine, July, 1989
Ref: Xilinx App Note, ”The Role of Distributed Arithmetic In FPGA Based Signal Processing’