Systolic Architectures:Why is RC fast?
Greg Stitt
ECE Department
University of Florida
Why are microprocessors slow?
Von Neumann architecture “Stored-program” machine
Memory for instructions (and data)
Von Neumann architecture Summary
1) Fetch instruction 2) Decode instruction, fetch data 3) Execute 4) Store results 5) Repeat from 1 until end of program
Problem Inherently sequential
Only executes one instruction at a time Does not take into consideration parallelism of
application
Problem 2: Von Neumann bottleneck Constantly reading/writing data for every
instruction requires high memory bandwidth
Performance limited by bandwidth of memory
Von Neumann architecture
RAM
Control
Bandwidth not sufficient - “Von Neumann bottleneck” Datapath
Improvements Increase resources in datapath to execute
multiple instructions in parallel VLIW - very long instruction word
Compiler encodes parallelism into “very-long” instructions Superscalar
Architecture determines parallelism at run time - out-of-order instruction execution
Von Neumann bottleneck still problem
RAM
Control Datapath Datapath Datapath . . .
Why is RC fast? RC implements custom circuits for an
application Circuits can exploit massive amount of
parallelism VLIW/Superscalar Parallelism
~5 ins/cycle in best case (rarely occurs) RC
Potentially thousands As many ops as will fit in device Also, supports different types of parallelism
Types of Parallelism
Bit-level
x = (x >>16) | (x <<16);x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00);x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0);x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc);x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa);
C Code for Bit Reversal
sll $v1[3],$v0[2],0x10srl $v0[2],$v0[2],0x10or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x8and $v1[3],$v1[3],$t5[13]sll $v0[2],$v0[2],0x8and $v0[2],$v0[2],$t4[12]or $v0[2],$v1[3],$v0[2]srl $v1[3],$v0[2],0x4and $v1[3],$v1[3],$t3[11]sll $v0[2],$v0[2],0x4and $v0[2],$v0[2],$t2[10]...
Binary
Compilation
ProcessorProcessor
Requires between 32 and 128 cycles
Circuit for Bit Reversal
Bit Reversed X Value
Bit Reversed X ValueBit Reversed X Value
. . . . . . . . . . .
. . . . . . . . . . .
Original X Value
ProcessorFPGA
Requires only 1 cycle (speedup of 32x to 128x) for same clock
Types of Parallelism
Arithmetic-level Parallelism
for (i=0; i < 128; i++) y[i] += c[i] * x[i]......
for (i=0; i < 128; i++) y += c[i] * x[i]......
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
C Code
Processor Processor
1000’s of instructions Several thousand cycles
Circuit
Processor FPGA
~ 7 cycles Speedup >
100x for same clock
Types of Parallelism
Pipeline Parallelism
for (i=0; i < 128; i++) y[i] += c[i] * x[i]......
for (j=0; j < n; j++) { y = a[j]; x = b[j]; for (i=0; i < 128; i++) y += c[i] * x[i]; // output y y = 0;}
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
Start new inner loop every cycle
After filling up pipeline, performs 128 mults + 127 adds every cycle
Types of Parallelism
Task-level Parallelism e.g. MPEG-2 Execute each task
in parallel
How to exploit parallelism?
General Idea Identify tasks
Create circuit for each task Communication between tasks with buffers
How to create circuit for each task? Want to exploit bit-level, arithmetic-level,
and pipeline-level parallelism Solution: Systolic architectures
(arrays/computing)
Systolic Architectures Systolic definition
The rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and
pulmonary artery after each dilation or diastole. Analogy with heart pumping blood
We want architecture that pumps data through efficiently.
Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory. [Hung]
Systolic Architecture General Idea: Fully pipelined circuit, with I/O
at top and bottom level Local connections - each element communicates with
elements at same level or level below
Inputs arrive each cycle
Outputs depart each cycle, after pipeline is full
Systolic Architecture
Simple Example Create DFG (data flow graph) for body of
loop Represent data dependencies of code
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
b[i] b[i+1] b[i+2]
+
+
a[i]
Simple Example
Add pipeline stages to each level of DFG
b[i] b[i+1] b[i+2]
+
+
a[i]
Simple Example
Allocate one resource (adder, ALU, etc) for each operation in DFG Resulting systolic architecture:
+
+
b[0] b[1] b[2]
Cycle 1for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Simple Example
Allocate one resource for each operation in DFG Resulting systolic architecture:
+
+
b[1] b[2] b[3]
Cycle 2b[0]+b[1] b[2]for (i=0; i < 100; I++)
a[i] = b[i] + b[i+1] + b[i+2];
Simple Example
Allocate one resource for each operation in DFG Resulting systolic architecture:
+
+
b[2] b[3] b[4]
Cycle 3b[1]+b[2] b[3]
b[0]+b[1]+b[2]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Simple Example
Allocate one resource for each operation in DFG Resulting systolic architecture:
+
+
b[3] b[4] b[5]
Cycle 4b[2]+b[3] b[4]
b[1]+b[2]+b[3]
a[0] First output appears, takes 4 cycles to fill pipeline
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Simple Example
Allocate one resource for each operation in DFG Resulting systolic architecture:
+
+
b[4] b[5] b[6]
Cycle 5b[3]+b[4] b[5]
b[2]+b[3]+b[4]
a[1] One output per cycle at this point, 99 more until completion
Total Cycles => 4 init + 99 = 103
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
uP Performance Comparison Assumptions:
10 instructions for loop body CPI = 1.5 Clk 10x faster than FPGA
Total SW cycles: 100*10*1.5 = 1,500 cycles
RC Speedup (1500/103)*(1/10) = 1.46x
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
uP Performance Comparison What if uP clock is 15x faster?
e.g. 3 GHz vs. 200 MHz RC Speedup
(1500/103)*(1/15) = .97x RC is slightly slower
But! RC requires much less power
Several Watts vs ~100 Watts SW may be practical for embedded uPs => low power
Clock may be just 2x faster (1500/103)*(1/2) = 7.3x faster
RC may be cheaper Depends on area needed This example would certainly be cheaper
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Simple Example, Cont.
Improvement to systolic array Why not execute multiple iterations at
same time? No data dependencies Loop unrolling
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
b[i] b[i+1] b[i+2]
+
+
a[i]
b[i+1] b[i+2] b[i+3]
+
+
a[i+1]
. . . . .
Unrolled DFG
Simple Example, Cont.
How much to unroll? Limited by memory bandwidth and area
b[i] b[i+1] b[i+2]
+
+
a[i]
b[i+1] b[i+2] b[i+3]
+
+
a[i+1]
. . . . .
Must get all inputs once per cycle
Must write all outputs once per cycle
Must be sufficient area for all ops in DFG
Unrolling Example
Original circuit
+
+
b[0] b[1] b[2]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
a[0]
1st iteration requires 3 inputs
Unrolling Example
Original circuit
+
+
b[0] b[1] b[2]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
+
+
b[3]
a[0] a[1]
Each unrolled iteration requires one additional input
Unrolling Example
Original circuit
+
+
b[1] b[2] b[3]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
+
+
b[4]
b[0]+b[1] b[2] b[1]+b[2] b[3]
Each cycle brings in 4 inputs (instead of 6)
Performance after unrolling How much unrolling?
Assume b[] elements are 8 bits First iteration requires 3 elements = 24 bits Each unrolled iteration requires 1 element = 8 bit
Due to overlapping inputs Assume memory bandwidth = 64 bits/cycle
Can perform 6 iterations in parallel (24 + 8 + 8 +8 +8 +8) = 64 bits
New performance Unrolled systolic architecture requires
4 cycles to fill pipeline, 100/6 iterations ~ 21 cycles
With unrolling, RC is (1500/21)*(1/15) = 4.8x faster than 3 GHz microprocessor!!!
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Importance of Memory Bandwidth Performance with wider memories
128-bit bus 14 iterations in parallel
64 extra bits/8 bits per iteration = 8 parallel iterations + 6 original unrolled iterations = 14 total parallel iterations
Total cycles = 4 to fill pipeline + 100/14 = ~11 Speedup (1500/11)*(1/15) = 9.1x
Doubling memory width increased speedup from 4.8x to 9.1x!!!
Important Point Performance of hardware often limited by memory
bandwidth More bandwidth => more unrolling => more parallelism =>
BIG SPEEDUP
Delay Registers
Common mistake Forgetting to add registers for values not
used during a cycle Values “delayed” or passed on until needed
+
+
+
+
Instead of
Incorrect Correct
Delay Registers
Illustration of incorrect delays
+
+
b[0] b[1] b[2]
Cycle 1
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Delay Registers
Illustration of incorrect delays
+
+
b[1] b[2] b[3]
Cycle 2
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
b[0]+b[1]
b[2] + ?????
Delay Registers
Illustration of incorrect delays
+
+
b[2] b[3] b[4]
Cycle 3
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
b[1]+b[2]
b[0] + b[1] + b[3]
b[2] + ?????
Another Example Your turn
Steps Build DFG for body of loop Add pipeline stages Map operations to hardware resources
Assume divide takes one cycle Determine maximum amount of unrolling
Memory bandwidth = 128 bits/cycle Determine performance compared to uP
Assume 15 instructions per iteration, CPI = 1.5, CLK = 15x faster than RC
short b[1004], a[1000];for (i=0; i < 1000; i++) a[i] = avg( b[i], b[i+1], b[i+2], b[i+3], b[i+4] );
Another Example, Cont.
What if divider takes 20 cycles? But, fully pipelined
Calculate the effect on performance
In systolic architectures, performance usually dominated by throughput of pipeline, not latency
Dealing with Dependencies op2 is dependent on op1 when the input to op2
is an output from op1 Problem: limits arithmetic parallelism, increases
latency i.e. Can’t execute op2 before op1
Serious Problem: FPGAs rely on parallelism for performance
Little parallelism = Bad performance
op1
op2
Dealing with Dependencies
Partial solution Parallelizing transformations
e.g. tree height reduction
+
+
+
+ +
+
a b c da b c d
Depth = # of adders Depth = log2( # of adders )
Dealing with Dependencies Simple example w/ inter-iteration dependency
- potential problem for systolic arrays Can’t keep pipeline full
a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];
+
+
b[1] b[2] a[0]
+
+
b[2] b[3] a[1]Can’t execute until 1st iteration completes - limited arithmetic parallelism, increases latency
Dealing with Dependencies But, systolic arrays also have pipeline-level
parallelism - latency less of an issue
+
+
b[1] b[2] a[0]
+
+
b[2] b[3] a[1]
a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];
Dealing with Dependencies But, systolic arrays also have pipeline-level
parallelism - latency less of an issue
+
+
b[1] b[2] a[0]
+
+
b[2] b[3]
a[1]
a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];
Dealing with Dependencies But, systolic arrays also have pipeline-level
parallelism - latency less of an issue
+
+
b[1] b[2] a[0]
+
+
b[2] b[3]
a[1]
+
b[3] b[4]
a[2]
+
. . . .
a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];
Dealing with Dependencies But, systolic arrays also have pipeline-level
parallelism - latency less of an issue
+
+
b[1] b[2] a[0]
+
+
b[2] b[3]
a[1]
+
b[3] b[4]
a[2]
+
Add pipeline stages => systolic array . . . .
Only works if loop is fully unrolled! Requires sufficient memory bandwidth
a[0] = 0;for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1];
*Outputs not shown
Dealing with Dependencies
Your turn
char b[1006];for (i=0; i < 1000; i++) { acc=0; for (j=0; j < 6; j++) acc += b[i+j];}
StepsBuild DFG for inner loop (note dependencies)Fully unroll inner loop (check to see if memory bandwidth allows)
Assume bandwidth = 64 bits/cycleAdd pipeline stagesMap operations to hardware resourcesDetermine performance compared to uP
Assume 15 cycles per iteration, CPI = 1.5, CLK = 15x faster than RC
Dealing with Control
If statementschar b[1006], a[1000];for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a[I] = b[I+2] + b[I+3] ;}
*
MUX
b[i] b[I+1]
+
b[I+2] b[I+3]
a[i]
Can’t wait for result of condition - stalls pipeline
Convert control into computation - if conversion
i
%
2
Dealing with Control
If conversion, not always so easychar b[1006], a[1000], a2[1000];for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a2[I] = b[I+2] + b[I+3] ;}
*
MUX
b[i] b[I+1]
+
b[I+2] b[I+3]
a[i]
a[i]
MUX
a2[i]i 2
%
a2[i]
Other Challenges Outputs can also limit unrolling
Example 4 outputs, 1 input
Each output 32 bits Total output bandwidth for 1 iteration = 128 bits
Memory bus = 128 bits Can’t unroll, even though inputs only use 32 bits
long b[1004], a[1000];for (i=0, j=0; i < 1000; i+=4, j++) { a[i] = b[j] + 10 ; a[i+1] = b[j] * 23; a[i+2] = b[j] - 12; a[i+3] = b[j] * b[j];}
Other Challenges Requires streaming data to work well
Systolic array But, pipelining is wasted because small data stream Point - systolic arrays work best with repeated computation
for (i=0; i < 4; i++) a[i] = b[i] + b[i+1];
+
b[0] b[1]
+
b[1] b[2]
+
b[2] b[3]
+
b[3] b[4]
a[0] a[1] a[2] a[3]
Other Challenges Memory bandwidth
Values so far are “peak” values Can only be achieved if all input data stored
sequentially in memory Often not the case
Example Two-dimensional arrays
long a[100][100], b[100][100];for (i=1; i < 100; i++) { for (j=1; j < 100; j++) { a[i][j] = avg( b[i-1][j], b[I][j-1], b[I+1][j], b[I][j+1]); }}
Other Challenges Memory bandwidth, cont. Example 2
Multiple array inputs
b[] and c[] stored in different locations Memory accesses may jump back and forth
Possible solutions Use multiple memories, or multiported memory (high cost) Interleave data from b[] and c[] in memory (programming effort)
If no compiler support, requires manual rewite
long a[100], b[100], c[100];for (i=0; i < 100; i++) { a[i] = b[i] + c[i]}
Other Challenges Dynamic memory access patterns
Sequence of addresses not known until run time Clearly, not sequential
Possible solution Something creative enough for a Ph.D thesis
int f( int val ) { long a[100], b[100], c[100]; for (i=0; i < 100; i++) { a[i] = b[rand()%100] + c[i * val] }}
Other Challenges Pointer-based data structures
Even if scanning through list, data could be all over memory
Very unlikely to be sequential Can cause aliasing problems
Greatly limits optimization potential Solutions are another Ph. D.
Pointers ok if used as array
int f( int val ) { long a[100], b[100]; long *p = b; for (i=0; i < 100; i++, p++) { a[i] = *p + 1; }}
int f( int val ) { long a[100], b[100]; for (i=0; i < 100; i++) { a[i] = b[i] + 1; }}
equivalent to
Other Challenges Not all code is just one loop
Yet another Ph.D.
Main point to remember Systolic arrays are extremely fast, but only certain
types of code work
What can we do instead of systolic arrays?
Other Options Try something completely different Try slight variation
Example - 3 inputs, but can only read 2 per cycle
+
+
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Not possible - can only read two inputs per cycle
Variations
Example, cont. Break previous rules - use extra delay
registers
+
+
b[i] b[i+1] b[i+2]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Variations
Example, cont. Break previous rules - use extra delay
registers
+
+
b[0] b[1] Junk
Cycle 1
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Variations
Example, cont. Break previous rules - use extra delay
registers
+
+
Junk Junk b[2]
Cycle 2
b[0] b[1]
Junk
b[2]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Variations
Example, cont. Break previous rules - use extra delay
registers
+
+
b[1] b[2] Junk
Cycle 3
Junk Junk
b[0]+b[1] b[2]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Junk
Variations
Example, cont. Break previous rules - use extra delay
registers
+
+
Junk Junk b[3]
Cycle 4
b[1] b[2]
Junk Junk
b[3]
b[0]+b[1]+b[2]
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Junk
Variations
Example, cont. Break previous rules - use extra delay
registers
+
+
b[2] b[3] Junk
Cycle 5
Junk Junk
b[1] + b[2] b[3]
Junk
Junk
a[0]
First output after 5 cycles
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Variations
Example, cont. Break previous rules - use extra delay
registers
+
+
Junk Junk b[4]
Cycle 6
b[2] b[3]
Junk Junk
b[4]
b[1]+b[2]+b[3]
Junk
Junk on next cycle
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Variations
Example, cont. Break previous rules - use extra delay
registers
+
+
b[3] b[4] Junk
Cycle 7
Junk Junk
b[2]+b[3] b[4]
Junk
Junk
a[1]
Second output 2 cycles later
Valid output every 2 cycles - approximately 1/2 the performance
for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
Entire Circuit
Controller
Buffer
RAM
Output Address Generator
Input Address Generator
Datapath
Buffer
RAM
Buffers handle differences in speed between RAM and datapath
Top Related