CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000...

35
CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand: 1 1 0 0 12 multiplier: 0 1 0 1 5 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 60 4 partial products compute partial product; shift; add repeat n times: note: each bit of partial products is just an AND operat Multiplication Example

Transcript of CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000...

Page 1: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 1

multiplicand: 1 1 0 0 12multiplier: 0 1 0 1 5

1 1 0 00 0 0 0

1 1 0 00 0 0 00 1 1 1 1 0 0 60

4 partial products

compute partial product; shift; add

repeat n times:

note: each bit of partial products is just an AND operation

Multiplication

Example

Page 2: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 2

adder

result

0

one bit of multiplier applied each cycle

2n bit adder

multiplicand

multiplierx

z

y

z = 0;

repeat n

if (x[0]) z = z + y;

x = x >> 1; y = y << 1;

Sequential Multiplier

Page 3: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 3

adder

result

one bit of multiplier applied each cycle

n-bit adder

multiplicand

multiplierx

z

y

z = 0;

repeat n

if (x[0]) z = z + y * 2n;

x = x >> 1; z = z >> 1;

Sequential Multiplier (cont’d)

Page 4: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 4

Parallelism in hardware

Fine-grained - bit level e.g., carry-select, carry-lookahead adder

Pipelining same number of functional units different latency, but increased throughput less work per clock cycle

Coarse-grained - data-path level e.g., multiple arithmetic units multi-port register files (read/write from different

sources/destinations)

Processor level difficult to take advantage of many levels of parallelism

in fixed general-purpose processors much easier when the processors are special-purpose,

e.g., systolic computations

Page 5: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 5

Bit level parallelism

Exploit ability to do necessary bit-level computations directly exploit redundant logic goal - keep all circuits busy, reduce critical path

Examples carry-lookahead adder carry-select adder multipliers

Page 6: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 6

LSB

LSB1 1 0

1

1

0

1 1 0

1 1 0

0 0 0

01001

multplier

multiplicand

Combinational Multipliers

Use AND gates to generate all partial products in parallel

Page 7: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 7

LSB

LSB1 1 0

1

1

0

1 1 0

1 1 0

0 0 0

0

1

001

Combinational Multipliers (cont'd)

Skew array to send partial products along diagonal and make it square

Page 8: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 8

worst-case delay is 3n

LSB

LSB

0 0 00

0

0

A B

CinCout

S

Full Adder

Combinational Multipliers (cont'd)

Ripple-carry adder in each row (carries ripple right to left)

Sums ripple down (shifted one to right)

Page 9: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 9

LSB

LSB0 0 0

0 0 0

0

0

CLA

A B Cin

Cout S

Full Adder

no need to optimize carry more than sum

using CLA for final stage makes this fasterthan previous multiplier (worst-case is 2n)

Using Carry-Save

Forward carries to next row of adders

CLA at the end to add last partial product and forwarded carries

Page 10: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 10

x1

x2

x1

x2

x1

x2

x1

x2

x1

x2

partial products

Combinational Multipliers (cont'd)

Carry-save adder is a 3-2 adder:

Page 11: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 11

PP1PP2 PP3PP4PP5

+

CLA

Result

PP6PP7PP8

+ +

+ +

PP0

+

+

Wallace Tree Multiplier

Use tree structure to reduce number of additions in critical path to O(logn) rather than O(n)

Difficult structure to layoutand integrate with partial product crossbar

Wiring constraints make it unattractive in many technologies

Page 12: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 12

Binary Tree Multipliers

Problem with Wallace tree is 3:2 column reduction need 2:1 reduction for binary tree

One solution: signed-digit binary trees represent digits as 0, 1, -1 similar to Booth's encoding

1+ -1

0 0

1+ 1

1 0

0+ 0

0 0

-1+ -1-1 0

1+ 0

1 -10 1

-1+ 0

0 -1-1 1

xyif x>=0 and y>=0otherwise

xyif x>=0 and y>=0otherwise

Page 13: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 13

0 0 1 1 0 1 13

1 1 1 0 1 0 –6

0 0 –1 1 –1 0

0 –1 –2

1 1 1 1 1 1 1 0 0 1 1 0

1 1 1 1 1 1 0 0 1 1

0 0 0 0 0 0 0 0

1 1 1 1 1 0 1 1 0 0 1 0 –78

Boothrecodingsteps

must be able to add multiplier times 0, –1, –2, 1, and 2

i+1 i i-1 add

0 0 0 0*M0 0 1 1*M0 1 0 1*M0 1 1 2*M1 0 0 –2*M1 0 1 –1*M1 1 0 –1*M1 1 1 0*M

Boothrecodingtable

Booth's Algorithm

Take care of (retire) more than one bit per shift operation

Example: shift two bits at a time

Page 14: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 14

Register Transfer

Registers have input and output output can be fanned out to many destinations input can come from many sources

multiplexer needed on input to select which

input

output

input

outputs to other registers

inputs from other registers

output

controlsignalsto chooseinputsource

Page 15: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 15

Connecting Registers

Multiplexers: lots of control signals but full parallelism of transfers

Busses

Page 16: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 16

Pipelining

Adding registers along a path split combinational logic into multiple cycles each cycle smaller than previously Told Cold > Tnew Cnew

increase throughput

Page 17: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 17

Pipelining

Delay, d, of slowest combinational stage determines performance

Throughput = 1/d – rate at which outputs are produced

Latency = n•d – number of stages * clock period

Pipelining increases circuit utilization

Registers slow down data, synchronize data paths

Wave-pipelining no pipeline registers - waves of data flow through circuit relies on equal-delay circuit paths - no short paths

Page 18: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 18

When and How to Pipeline?

Where is the best place to add registers? splitting combinational logic overhead of registers (propagation delay and setup time

requirements)

What about cycles in data path?

Example: 16-bit adder, add 8-bits in each of two cycles

Page 19: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 19

Retiming

Process of optimally distributing registers throughout a circuit minimize the clock period minimize the number of registers

Page 20: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 20

Retiming (cont’d)

Fast optimal algorithm (Leiserson & Saxe 1983)

Retiming rules: remove one register from each input and add one to each

output remove one register from each output and add one to each

input

Page 21: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 21

Optimal Pipelining

Add registers - use retiming to find optimal location

871310

56

871310

56

Page 22: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 22

Example - Digital Correlator

yt = (xt, a0) + (xt-1, a1) + (xt-2, a2) + (xt-3, a3)

(xt, a0) = 0 if x a, 1 otherwise (and passes x along to the right)

++

+

host

yt

xta0 a1 a2 a3

Page 23: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 23

Example - Digital Correlator (cont’d)

Delays: adder, 7; comparator, 3; host, 0

++

+

host

++

+

host

cycle time = 24

cycle time = 13

Page 24: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 24

+

CLA

+ +

+ +

+

+

CLA

FF at every intersection of pipe state and wire

Pipelined Multipliers

Pipelining can be applied to any of the combinational multipliers

Page 25: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 25

Comparator

Parallel Sorter

Example - Sorting

AB

HL

Page 26: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 26

Example - Sorting (cont’d)

Pipelined

Page 27: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 27

Pipelined Sorter (cont’d)

Page 28: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 28

Better Sorter

Page 29: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 29

Sequential Sorter

Page 30: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 30

Analogy: data flowing through the system in a

rhythmic fashion – from main memory through

a series of processing elements and back to

main memory

Systolic Arrays

Set of identical processing elements specialized or programmable

Efficient nearest-neighbor interconnections (in 1-D, 2-D, other)

SIMD-like

Multiple data flows, converging to engage in computation

Page 31: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 31

- x3 - x2 - x1

- - - y1 - y2 - y3 -w4 w3 w2 w1

y1 = x1w1 + x2w2 + x3w3 + x4w4

y2 = x2w1 + x3w2 + x4w3 + x5w4

y3 = x3w1 + x4w2 + x5w3 + x6w4

. . . .

Example - Convolution

yj = xjw1 + xj+1w2 + . . . + xj+n-1wn

Page 32: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 32

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

x6 – x5 – x4 – x3 – x2 – x1

– – – y1 – y2 – y3

– – – y1 – y2 – y3

w4 w3 w2 w1

x6 – x5 – x4 – x3 – x2 – x1

x6 – x5 – x4 – x3 – x2 – x1

x6 – x5 – x4 – x3 – x2 –

– y1 – y2 – y3

x6 – x5 – x4 – x3 – x2

y1 – y2 – y3

x6 – x5 – x4 – x3 –

– y2 – y3

x6 – x5 – x4 – x3

– – y1 – y2 – y3

Example - Convolution (cont’d)

Page 33: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 33

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

Example: Matrix Multiplication

C = A B cij = k=1n aikbkj

Page 34: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

– – – a14 a13 a12 a11

– – a24 a23 a22 a21 –

– a34 a33 a32 a31 – –

a44 a43 a42 a41 – – –

c11 c12 c13 c14

c21 c22 c23 c24

c31 c32 c33 c34

c41 c42 c43 c44

| | | b44

| | b43 b34

| b42 b33 b24

b41 b32 b23 b14

b31 b22 b13 |

b21 b12 | |

b11 | | |

Example: Matrix Multiplication

Page 35: CSE 567 - Autumn 1998 - Misc. Topics - 1 multiplicand:110012 multiplier:01015 1100 0000 1100 0000 011110060 4 partial products compute partial product;

CSE 567 - Autumn 1998 - Misc. Topics - 35

Systolic Computers

Warp (CMU) - 1987 linear array of 10 or more processing cells optimized inter-cell communication for low-latency pipelined cells and communication conditional execution compiler partitions problem into cells and generates microcode

i-Warp (Intel) - 1990 successor to Warp two-dimensional array time-multiplexing of physical busses between cells 32x32 array has 20Gflops peak performance