CSE 567 - Autumn 1998 - Misc. Topics - 1
multiplicand: 1 1 0 0 12multiplier: 0 1 0 1 5
1 1 0 00 0 0 0
1 1 0 00 0 0 00 1 1 1 1 0 0 60
4 partial products
compute partial product; shift; add
repeat n times:
note: each bit of partial products is just an AND operation
Multiplication
Example
CSE 567 - Autumn 1998 - Misc. Topics - 2
adder
result
0
one bit of multiplier applied each cycle
2n bit adder
multiplicand
multiplierx
z
y
z = 0;
repeat n
if (x[0]) z = z + y;
x = x >> 1; y = y << 1;
Sequential Multiplier
CSE 567 - Autumn 1998 - Misc. Topics - 3
adder
result
one bit of multiplier applied each cycle
n-bit adder
multiplicand
multiplierx
z
y
z = 0;
repeat n
if (x[0]) z = z + y * 2n;
x = x >> 1; z = z >> 1;
Sequential Multiplier (cont’d)
CSE 567 - Autumn 1998 - Misc. Topics - 4
Parallelism in hardware
Fine-grained - bit level e.g., carry-select, carry-lookahead adder
Pipelining same number of functional units different latency, but increased throughput less work per clock cycle
Coarse-grained - data-path level e.g., multiple arithmetic units multi-port register files (read/write from different
sources/destinations)
Processor level difficult to take advantage of many levels of parallelism
in fixed general-purpose processors much easier when the processors are special-purpose,
e.g., systolic computations
CSE 567 - Autumn 1998 - Misc. Topics - 5
Bit level parallelism
Exploit ability to do necessary bit-level computations directly exploit redundant logic goal - keep all circuits busy, reduce critical path
Examples carry-lookahead adder carry-select adder multipliers
CSE 567 - Autumn 1998 - Misc. Topics - 6
LSB
LSB1 1 0
1
1
0
1 1 0
1 1 0
0 0 0
01001
multplier
multiplicand
Combinational Multipliers
Use AND gates to generate all partial products in parallel
CSE 567 - Autumn 1998 - Misc. Topics - 7
LSB
LSB1 1 0
1
1
0
1 1 0
1 1 0
0 0 0
0
1
001
Combinational Multipliers (cont'd)
Skew array to send partial products along diagonal and make it square
CSE 567 - Autumn 1998 - Misc. Topics - 8
worst-case delay is 3n
LSB
LSB
0 0 00
0
0
A B
CinCout
S
Full Adder
Combinational Multipliers (cont'd)
Ripple-carry adder in each row (carries ripple right to left)
Sums ripple down (shifted one to right)
CSE 567 - Autumn 1998 - Misc. Topics - 9
LSB
LSB0 0 0
0 0 0
0
0
CLA
A B Cin
Cout S
Full Adder
no need to optimize carry more than sum
using CLA for final stage makes this fasterthan previous multiplier (worst-case is 2n)
Using Carry-Save
Forward carries to next row of adders
CLA at the end to add last partial product and forwarded carries
CSE 567 - Autumn 1998 - Misc. Topics - 10
x1
x2
x1
x2
x1
x2
x1
x2
x1
x2
partial products
Combinational Multipliers (cont'd)
Carry-save adder is a 3-2 adder:
CSE 567 - Autumn 1998 - Misc. Topics - 11
PP1PP2 PP3PP4PP5
+
CLA
Result
PP6PP7PP8
+ +
+ +
PP0
+
+
Wallace Tree Multiplier
Use tree structure to reduce number of additions in critical path to O(logn) rather than O(n)
Difficult structure to layoutand integrate with partial product crossbar
Wiring constraints make it unattractive in many technologies
CSE 567 - Autumn 1998 - Misc. Topics - 12
Binary Tree Multipliers
Problem with Wallace tree is 3:2 column reduction need 2:1 reduction for binary tree
One solution: signed-digit binary trees represent digits as 0, 1, -1 similar to Booth's encoding
1+ -1
0 0
1+ 1
1 0
0+ 0
0 0
-1+ -1-1 0
1+ 0
1 -10 1
-1+ 0
0 -1-1 1
xyif x>=0 and y>=0otherwise
xyif x>=0 and y>=0otherwise
CSE 567 - Autumn 1998 - Misc. Topics - 13
0 0 1 1 0 1 13
1 1 1 0 1 0 –6
0 0 –1 1 –1 0
0 –1 –2
1 1 1 1 1 1 1 0 0 1 1 0
1 1 1 1 1 1 0 0 1 1
0 0 0 0 0 0 0 0
1 1 1 1 1 0 1 1 0 0 1 0 –78
Boothrecodingsteps
must be able to add multiplier times 0, –1, –2, 1, and 2
i+1 i i-1 add
0 0 0 0*M0 0 1 1*M0 1 0 1*M0 1 1 2*M1 0 0 –2*M1 0 1 –1*M1 1 0 –1*M1 1 1 0*M
Boothrecodingtable
Booth's Algorithm
Take care of (retire) more than one bit per shift operation
Example: shift two bits at a time
CSE 567 - Autumn 1998 - Misc. Topics - 14
Register Transfer
Registers have input and output output can be fanned out to many destinations input can come from many sources
multiplexer needed on input to select which
input
output
input
outputs to other registers
inputs from other registers
output
controlsignalsto chooseinputsource
CSE 567 - Autumn 1998 - Misc. Topics - 15
Connecting Registers
Multiplexers: lots of control signals but full parallelism of transfers
Busses
CSE 567 - Autumn 1998 - Misc. Topics - 16
Pipelining
Adding registers along a path split combinational logic into multiple cycles each cycle smaller than previously Told Cold > Tnew Cnew
increase throughput
CSE 567 - Autumn 1998 - Misc. Topics - 17
Pipelining
Delay, d, of slowest combinational stage determines performance
Throughput = 1/d – rate at which outputs are produced
Latency = n•d – number of stages * clock period
Pipelining increases circuit utilization
Registers slow down data, synchronize data paths
Wave-pipelining no pipeline registers - waves of data flow through circuit relies on equal-delay circuit paths - no short paths
CSE 567 - Autumn 1998 - Misc. Topics - 18
When and How to Pipeline?
Where is the best place to add registers? splitting combinational logic overhead of registers (propagation delay and setup time
requirements)
What about cycles in data path?
Example: 16-bit adder, add 8-bits in each of two cycles
CSE 567 - Autumn 1998 - Misc. Topics - 19
Retiming
Process of optimally distributing registers throughout a circuit minimize the clock period minimize the number of registers
CSE 567 - Autumn 1998 - Misc. Topics - 20
Retiming (cont’d)
Fast optimal algorithm (Leiserson & Saxe 1983)
Retiming rules: remove one register from each input and add one to each
output remove one register from each output and add one to each
input
CSE 567 - Autumn 1998 - Misc. Topics - 21
Optimal Pipelining
Add registers - use retiming to find optimal location
871310
56
871310
56
CSE 567 - Autumn 1998 - Misc. Topics - 22
Example - Digital Correlator
yt = (xt, a0) + (xt-1, a1) + (xt-2, a2) + (xt-3, a3)
(xt, a0) = 0 if x a, 1 otherwise (and passes x along to the right)
++
+
host
yt
xta0 a1 a2 a3
CSE 567 - Autumn 1998 - Misc. Topics - 23
Example - Digital Correlator (cont’d)
Delays: adder, 7; comparator, 3; host, 0
++
+
host
++
+
host
cycle time = 24
cycle time = 13
CSE 567 - Autumn 1998 - Misc. Topics - 24
+
CLA
+ +
+ +
+
+
CLA
FF at every intersection of pipe state and wire
Pipelined Multipliers
Pipelining can be applied to any of the combinational multipliers
CSE 567 - Autumn 1998 - Misc. Topics - 25
Comparator
Parallel Sorter
Example - Sorting
AB
HL
CSE 567 - Autumn 1998 - Misc. Topics - 26
Example - Sorting (cont’d)
Pipelined
CSE 567 - Autumn 1998 - Misc. Topics - 27
Pipelined Sorter (cont’d)
CSE 567 - Autumn 1998 - Misc. Topics - 28
Better Sorter
CSE 567 - Autumn 1998 - Misc. Topics - 29
Sequential Sorter
CSE 567 - Autumn 1998 - Misc. Topics - 30
Analogy: data flowing through the system in a
rhythmic fashion – from main memory through
a series of processing elements and back to
main memory
Systolic Arrays
Set of identical processing elements specialized or programmable
Efficient nearest-neighbor interconnections (in 1-D, 2-D, other)
SIMD-like
Multiple data flows, converging to engage in computation
CSE 567 - Autumn 1998 - Misc. Topics - 31
- x3 - x2 - x1
- - - y1 - y2 - y3 -w4 w3 w2 w1
y1 = x1w1 + x2w2 + x3w3 + x4w4
y2 = x2w1 + x3w2 + x4w3 + x5w4
y3 = x3w1 + x4w2 + x5w3 + x6w4
. . . .
Example - Convolution
yj = xjw1 + xj+1w2 + . . . + xj+n-1wn
CSE 567 - Autumn 1998 - Misc. Topics - 32
– – – y1 – y2 – y3
x6 – x5 – x4 – x3 – x2 – x1
– – – y1 – y2 – y3
x6 – x5 – x4 – x3 – x2 – x1
– – – y1 – y2 – y3
– – – y1 – y2 – y3
w4 w3 w2 w1
x6 – x5 – x4 – x3 – x2 – x1
x6 – x5 – x4 – x3 – x2 – x1
x6 – x5 – x4 – x3 – x2 –
– y1 – y2 – y3
x6 – x5 – x4 – x3 – x2
y1 – y2 – y3
x6 – x5 – x4 – x3 –
– y2 – y3
x6 – x5 – x4 – x3
– – y1 – y2 – y3
Example - Convolution (cont’d)
CSE 567 - Autumn 1998 - Misc. Topics - 33
c11 c12 c13 c14
c21 c22 c23 c24
c31 c32 c33 c34
c41 c42 c43 c44
Example: Matrix Multiplication
C = A B cij = k=1n aikbkj
– – – a14 a13 a12 a11
– – a24 a23 a22 a21 –
– a34 a33 a32 a31 – –
a44 a43 a42 a41 – – –
c11 c12 c13 c14
c21 c22 c23 c24
c31 c32 c33 c34
c41 c42 c43 c44
| | | b44
| | b43 b34
| b42 b33 b24
b41 b32 b23 b14
b31 b22 b13 |
b21 b12 | |
b11 | | |
Example: Matrix Multiplication
CSE 567 - Autumn 1998 - Misc. Topics - 35
Systolic Computers
Warp (CMU) - 1987 linear array of 10 or more processing cells optimized inter-cell communication for low-latency pipelined cells and communication conditional execution compiler partitions problem into cells and generates microcode
i-Warp (Intel) - 1990 successor to Warp two-dimensional array time-multiplexing of physical busses between cells 32x32 array has 20Gflops peak performance
Top Related