Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

62
Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida

Transcript of Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Page 1: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Systolic Architectures:Why is RC fast?

Greg Stitt

ECE Department

University of Florida

Page 2: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Why are microprocessors slow?

Von Neumann architecture “Stored-program” machine

Memory for instructions (and data)

Page 3: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Von Neumann architecture Summary

1) Fetch instruction 2) Decode instruction, fetch data 3) Execute 4) Store results 5) Repeat from 1 until end of program

Problem Inherently sequential

Only executes one instruction at a time Does not take into consideration parallelism of

application

Page 4: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Problem 2: Von Neumann bottleneck Constantly reading/writing data for every

instruction requires high memory bandwidth

Performance limited by bandwidth of memory

Von Neumann architecture

RAM

Control

Bandwidth not sufficient - “Von Neumann bottleneck” Datapath

Page 5: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Improvements Increase resources in datapath to execute

multiple instructions in parallel VLIW - very long instruction word

Compiler encodes parallelism into “very-long” instructions Superscalar

Architecture determines parallelism at run time - out-of-order instruction execution

Von Neumann bottleneck still problem

RAM

Control Datapath Datapath Datapath . . .

Page 6: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Why is RC fast? RC implements custom circuits for an

application Circuits can exploit massive amount of

parallelism VLIW/Superscalar Parallelism

~5 ins/cycle in best case (rarely occurs) RC

Potentially thousands As many ops as will fit in device Also, supports different types of parallelism

Page 7: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Types of Parallelism

Bit-level

x = (x >>16) | (x <<16);x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00);x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0);x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc);x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa);

C Code for Bit Reversal

sll $v1[3],$v0[2],0x10srl $v0[2],$v0[2],0x10or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x8and $v1[3],$v1[3],$t5[13]sll $v0[2],$v0[2],0x8and $v0[2],$v0[2],$t4[12]or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x4and $v1[3],$v1[3],$t3[11]sll $v0[2],$v0[2],0x4and $v0[2],$v0[2],$t2[10]...

Binary

Compilation

ProcessorProcessor

Requires between 32 and 128 cycles

Circuit for Bit Reversal

Bit Reversed X Value

Bit Reversed X ValueBit Reversed X Value

. . . . . . . . . . .

. . . . . . . . . . .

Original X Value

ProcessorFPGA

Requires only 1 cycle (speedup of 32x to 128x) for same clock

Page 8: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Types of Parallelism

Arithmetic-level Parallelism

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

for (i=0; i < 128; i++) y += c[i] * x[i]......

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

C Code

Processor Processor

1000’s of instructions Several thousand cycles

Circuit

Processor FPGA

~ 7 cycles Speedup >

100x for same clock

Page 9: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Types of Parallelism

Pipeline Parallelism

for (i=0; i < 128; i++) y[i] += c[i] * x[i]......

for (j=0; j < n; j++) { y = a[j]; x = b[j]; for (i=0; i < 128; i++) y += c[i] * x[i]; // output y y = 0;}

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

Start new inner loop every cycle

After filling up pipeline, performs 128 mults + 127 adds every cycle

Page 10: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Types of Parallelism

Task-level Parallelism e.g. MPEG-2 Execute each task

in parallel

Page 11: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

How to exploit parallelism?

General Idea Identify tasks

Create circuit for each task Communication between tasks with buffers

How to create circuit for each task? Want to exploit bit-level, arithmetic-level,

and pipeline-level parallelism Solution: Systolic architectures

(arrays/computing)

Page 12: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Systolic Architectures Systolic definition

The rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and

pulmonary artery after each dilation or diastole. Analogy with heart pumping blood

We want architecture that pumps data through efficiently.

Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory. [Hung]

Page 13: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Systolic Architecture General Idea: Fully pipelined circuit, with I/O

at top and bottom level Local connections - each element communicates with

elements at same level or level below

Inputs arrive each cycle

Outputs depart each cycle, after pipeline is full

Page 14: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Systolic Architecture

Simple Example Create DFG (data flow graph) for body of

loop Represent data dependencies of code

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

b[i] b[i+1] b[i+2]

+

+

a[i]

Page 15: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Simple Example

Add pipeline stages to each level of DFG

b[i] b[i+1] b[i+2]

+

+

a[i]

Page 16: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Simple Example

Allocate one resource (adder, ALU, etc) for each operation in DFG Resulting systolic architecture:

+

+

b[0] b[1] b[2]

Cycle 1for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 17: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Simple Example

Allocate one resource for each operation in DFG Resulting systolic architecture:

+

+

b[1] b[2] b[3]

Cycle 2b[0]+b[1] b[2]for (i=0; i < 100; I++)

a[i] = b[i] + b[i+1] + b[i+2];

Page 18: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Simple Example

Allocate one resource for each operation in DFG Resulting systolic architecture:

+

+

b[2] b[3] b[4]

Cycle 3b[1]+b[2] b[3]

b[0]+b[1]+b[2]

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 19: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Simple Example

Allocate one resource for each operation in DFG Resulting systolic architecture:

+

+

b[3] b[4] b[5]

Cycle 4b[2]+b[3] b[4]

b[1]+b[2]+b[3]

a[0] First output appears, takes 4 cycles to fill pipeline

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 20: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Simple Example

Allocate one resource for each operation in DFG Resulting systolic architecture:

+

+

b[4] b[5] b[6]

Cycle 5b[3]+b[4] b[5]

b[2]+b[3]+b[4]

a[1] One output per cycle at this point, 99 more until completion

Total Cycles => 4 init + 99 = 103

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 21: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

uP Performance Comparison Assumptions:

10 instructions for loop body CPI = 1.5 Clk 10x faster than FPGA

Total SW cycles: 100*10*1.5 = 1,500 cycles

RC Speedup (1500/103)*(1/10) = 1.46x

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 22: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

uP Performance Comparison What if uP clock is 15x faster?

e.g. 3 GHz vs. 200 MHz RC Speedup

(1500/103)*(1/15) = .97x RC is slightly slower

But! RC requires much less power

Several Watts vs ~100 Watts SW may be practical for embedded uPs => low power

Clock may be just 2x faster (1500/103)*(1/2) = 7.3x faster

RC may be cheaper Depends on area needed This example would certainly be cheaper

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 23: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Simple Example, Cont.

Improvement to systolic array Why not execute multiple iterations at

same time? No data dependencies Loop unrolling

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

b[i] b[i+1] b[i+2]

+

+

a[i]

b[i+1] b[i+2] b[i+3]

+

+

a[i+1]

. . . . .

Unrolled DFG

Page 24: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Simple Example, Cont.

How much to unroll? Limited by memory bandwidth and area

b[i] b[i+1] b[i+2]

+

+

a[i]

b[i+1] b[i+2] b[i+3]

+

+

a[i+1]

. . . . .

Must get all inputs once per cycle

Must write all outputs once per cycle

Must be sufficient area for all ops in DFG

Page 25: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Unrolling Example

Original circuit

+

+

b[0] b[1] b[2]

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

a[0]

1st iteration requires 3 inputs

Page 26: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Unrolling Example

Original circuit

+

+

b[0] b[1] b[2]

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

+

+

b[3]

a[0] a[1]

Each unrolled iteration requires one additional input

Page 27: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Unrolling Example

Original circuit

+

+

b[1] b[2] b[3]

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

+

+

b[4]

b[0]+b[1] b[2] b[1]+b[2] b[3]

Each cycle brings in 4 inputs (instead of 6)

Page 28: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Performance after unrolling How much unrolling?

Assume b[] elements are 8 bits First iteration requires 3 elements = 24 bits Each unrolled iteration requires 1 element = 8 bit

Due to overlapping inputs Assume memory bandwidth = 64 bits/cycle

Can perform 6 iterations in parallel (24 + 8 + 8 +8 +8 +8) = 64 bits

New performance Unrolled systolic architecture requires

4 cycles to fill pipeline, 100/6 iterations ~ 21 cycles

With unrolling, RC is (1500/21)*(1/15) = 4.8x faster than 3 GHz microprocessor!!!

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 29: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Importance of Memory Bandwidth Performance with wider memories

128-bit bus 14 iterations in parallel

64 extra bits/8 bits per iteration = 8 parallel iterations + 6 original unrolled iterations = 14 total parallel iterations

Total cycles = 4 to fill pipeline + 100/14 = ~11 Speedup (1500/11)*(1/15) = 9.1x

Doubling memory width increased speedup from 4.8x to 9.1x!!!

Important Point Performance of hardware often limited by memory

bandwidth More bandwidth => more unrolling => more parallelism =>

BIG SPEEDUP

Page 30: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Delay Registers

Common mistake Forgetting to add registers for values not

used during a cycle Values “delayed” or passed on until needed

+

+

+

+

Instead of

Incorrect Correct

Page 31: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Delay Registers

Illustration of incorrect delays

+

+

b[0] b[1] b[2]

Cycle 1

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 32: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Delay Registers

Illustration of incorrect delays

+

+

b[1] b[2] b[3]

Cycle 2

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

b[0]+b[1]

b[2] + ?????

Page 33: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Delay Registers

Illustration of incorrect delays

+

+

b[2] b[3] b[4]

Cycle 3

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

b[1]+b[2]

b[0] + b[1] + b[3]

b[2] + ?????

Page 34: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Another Example Your turn

Steps Build DFG for body of loop Add pipeline stages Map operations to hardware resources

Assume divide takes one cycle Determine maximum amount of unrolling

Memory bandwidth = 128 bits/cycle Determine performance compared to uP

Assume 15 instructions per iteration, CPI = 1.5, CLK = 15x faster than RC

short b[1004], a[1000];for (i=0; i < 1000; i++) a[i] = avg( b[i], b[i+1], b[i+2], b[i+3], b[i+4] );

Page 35: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Another Example, Cont.

What if divider takes 20 cycles? But, fully pipelined

Calculate the effect on performance

In systolic architectures, performance usually dominated by throughput of pipeline, not latency

Page 36: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Dependencies op2 is dependent on op1 when the input to op2

is an output from op1 Problem: limits arithmetic parallelism, increases

latency i.e. Can’t execute op2 before op1

Serious Problem: FPGAs rely on parallelism for performance

Little parallelism = Bad performance

op1

op2

Page 37: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Dependencies

Partial solution Parallelizing transformations

e.g. tree height reduction

+

+

+

+ +

+

a b c da b c d

Depth = # of adders Depth = log2( # of adders )

Page 38: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Dependencies Simple example w/ inter-iteration dependency

- potential problem for systolic arrays Can’t keep pipeline full

a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];

+

+

b[1] b[2] a[0]

+

+

b[2] b[3] a[1]Can’t execute until 1st iteration completes - limited arithmetic parallelism, increases latency

Page 39: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Dependencies But, systolic arrays also have pipeline-level

parallelism - latency less of an issue

+

+

b[1] b[2] a[0]

+

+

b[2] b[3] a[1]

a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];

Page 40: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Dependencies But, systolic arrays also have pipeline-level

parallelism - latency less of an issue

+

+

b[1] b[2] a[0]

+

+

b[2] b[3]

a[1]

a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];

Page 41: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Dependencies But, systolic arrays also have pipeline-level

parallelism - latency less of an issue

+

+

b[1] b[2] a[0]

+

+

b[2] b[3]

a[1]

+

b[3] b[4]

a[2]

+

. . . .

a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];

Page 42: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Dependencies But, systolic arrays also have pipeline-level

parallelism - latency less of an issue

+

+

b[1] b[2] a[0]

+

+

b[2] b[3]

a[1]

+

b[3] b[4]

a[2]

+

Add pipeline stages => systolic array . . . .

Only works if loop is fully unrolled! Requires sufficient memory bandwidth

a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];

*Outputs not shown

Page 43: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Dependencies

Your turn

char b[1006];for (i=0; i < 1000; i++) { acc=0; for (j=0; j < 6; j++) acc += b[i+j];}

StepsBuild DFG for inner loop (note dependencies)Fully unroll inner loop (check to see if memory bandwidth allows)

Assume bandwidth = 64 bits/cycleAdd pipeline stagesMap operations to hardware resourcesDetermine performance compared to uP

Assume 15 cycles per iteration, CPI = 1.5, CLK = 15x faster than RC

Page 44: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Control

If statementschar b[1006], a[1000];for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a[I] = b[I+2] + b[I+3] ;}

*

MUX

b[i] b[I+1]

+

b[I+2] b[I+3]

a[i]

Can’t wait for result of condition - stalls pipeline

Convert control into computation - if conversion

i

%

2

Page 45: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Dealing with Control

If conversion, not always so easychar b[1006], a[1000], a2[1000];for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a2[I] = b[I+2] + b[I+3] ;}

*

MUX

b[i] b[I+1]

+

b[I+2] b[I+3]

a[i]

a[i]

MUX

a2[i]i 2

%

a2[i]

Page 46: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Other Challenges Outputs can also limit unrolling

Example 4 outputs, 1 input

Each output 32 bits Total output bandwidth for 1 iteration = 128 bits

Memory bus = 128 bits Can’t unroll, even though inputs only use 32 bits

long b[1004], a[1000];for (i=0, j=0; i < 1000; i+=4, j++) { a[i] = b[j] + 10 ; a[i+1] = b[j] * 23; a[i+2] = b[j] - 12; a[i+3] = b[j] * b[j];}

Page 47: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Other Challenges Requires streaming data to work well

Systolic array But, pipelining is wasted because small data stream Point - systolic arrays work best with repeated computation

for (i=0; i < 4; i++) a[i] = b[i] + b[i+1];

+

b[0] b[1]

+

b[1] b[2]

+

b[2] b[3]

+

b[3] b[4]

a[0] a[1] a[2] a[3]

Page 48: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Other Challenges Memory bandwidth

Values so far are “peak” values Can only be achieved if all input data stored

sequentially in memory Often not the case

Example Two-dimensional arrays

long a[100][100], b[100][100];for (i=1; i < 100; i++) { for (j=1; j < 100; j++) { a[i][j] = avg( b[i-1][j], b[I][j-1], b[I+1][j], b[I][j+1]); }}

Page 49: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Other Challenges Memory bandwidth, cont. Example 2

Multiple array inputs

b[] and c[] stored in different locations Memory accesses may jump back and forth

Possible solutions Use multiple memories, or multiported memory (high cost) Interleave data from b[] and c[] in memory (programming effort)

If no compiler support, requires manual rewite

long a[100], b[100], c[100];for (i=0; i < 100; i++) { a[i] = b[i] + c[i]}

Page 50: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Other Challenges Dynamic memory access patterns

Sequence of addresses not known until run time Clearly, not sequential

Possible solution Something creative enough for a Ph.D thesis

int f( int val ) { long a[100], b[100], c[100]; for (i=0; i < 100; i++) { a[i] = b[rand()%100] + c[i * val] }}

Page 51: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Other Challenges Pointer-based data structures

Even if scanning through list, data could be all over memory

Very unlikely to be sequential Can cause aliasing problems

Greatly limits optimization potential Solutions are another Ph. D.

Pointers ok if used as array

int f( int val ) { long a[100], b[100]; long *p = b; for (i=0; i < 100; i++, p++) { a[i] = *p + 1; }}

int f( int val ) { long a[100], b[100]; for (i=0; i < 100; i++) { a[i] = b[i] + 1; }}

equivalent to

Page 52: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Other Challenges Not all code is just one loop

Yet another Ph.D.

Main point to remember Systolic arrays are extremely fast, but only certain

types of code work

What can we do instead of systolic arrays?

Page 53: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Other Options Try something completely different Try slight variation

Example - 3 inputs, but can only read 2 per cycle

+

+

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Not possible - can only read two inputs per cycle

Page 54: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Variations

Example, cont. Break previous rules - use extra delay

registers

+

+

b[i] b[i+1] b[i+2]

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 55: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Variations

Example, cont. Break previous rules - use extra delay

registers

+

+

b[0] b[1] Junk

Cycle 1

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 56: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Variations

Example, cont. Break previous rules - use extra delay

registers

+

+

Junk Junk b[2]

Cycle 2

b[0] b[1]

Junk

b[2]

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 57: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Variations

Example, cont. Break previous rules - use extra delay

registers

+

+

b[1] b[2] Junk

Cycle 3

Junk Junk

b[0]+b[1] b[2]

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Junk

Page 58: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Variations

Example, cont. Break previous rules - use extra delay

registers

+

+

Junk Junk b[3]

Cycle 4

b[1] b[2]

Junk Junk

b[3]

b[0]+b[1]+b[2]

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Junk

Page 59: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Variations

Example, cont. Break previous rules - use extra delay

registers

+

+

b[2] b[3] Junk

Cycle 5

Junk Junk

b[1] + b[2] b[3]

Junk

Junk

a[0]

First output after 5 cycles

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 60: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Variations

Example, cont. Break previous rules - use extra delay

registers

+

+

Junk Junk b[4]

Cycle 6

b[2] b[3]

Junk Junk

b[4]

b[1]+b[2]+b[3]

Junk

Junk on next cycle

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 61: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Variations

Example, cont. Break previous rules - use extra delay

registers

+

+

b[3] b[4] Junk

Cycle 7

Junk Junk

b[2]+b[3] b[4]

Junk

Junk

a[1]

Second output 2 cycles later

Valid output every 2 cycles - approximately 1/2 the performance

for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

Page 62: Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Entire Circuit

Controller

Buffer

RAM

Output Address Generator

Input Address Generator

Datapath

Buffer

RAM

Buffers handle differences in speed between RAM and datapath