Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware
description
Transcript of Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware
Liquid Metal’s OPTIMUS:Synthesis of Efficient Streaming Hardware
Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM)DAC HLS Tutorial, San Francisco 2009
1
your apphere
JIT compilerconfigureslogic
Dynamic Application Specific Customization of HW
2
Inspired by ASIC paradigm:• High Performance• Low Power
Liquid Metal: “JIT the Hardware”
3
Single language for programming HW & SW Run in a standard JVM, or synthesize in HW Fluidly move computation between HW & SW Do for HW (viz. FPGAs) what FORTRAN did for
computing Address critical technology trendsPower address impractical growth of power and
cooling demandsArchitecture enabling million way parallelism vs. small
scale multicoresVersatility in the field & on the fly customization to end-
user applications
Applications demand for pervasive streaming and mobile content (WWW, multimedia, gaming)
ASIC-like
Reconfigurable
Lime: the Liquid Metal Language
4
Design Principles: Object-oriented, Java-like, Java-compatible Raise level of abstraction Parallel constructs that simplify code Target synthesis while retaining generality
4 reasons not another *C to HDL approach
Emphasis on programmer productivity Leverage rich Java IDEs, libraries, and analysis
Not an auto-parallelization approach Lime is explicitly parallel and synthesizable
Fast fail-safe mechanism Lime may be refined into parallel SW implementation
Intrinsic opportunity for online optimizations Static optimizations with dynamic refinement
Lime Overview6
Computation is well encapsulatedData-flow driven computationMultiple “clock domains
Tasks, Value typesHW (FPGA): Lime:
Bit-level control and reasoning
Memory usage statically determined before layout
Abstract OO programming down to the bit-level!
Ordinal-indexed arrays, bounded loops
Streaming primitives
Template-like Generics
Rate “matching” operators
Streams: Exposing Computational Structure
7
Stream primitives are integral to the language
Tasks in streams are strongly isolated Only the endpoints may perform side-
effects Provide macro-level functional
programming abstraction… … While allowing traditional imperative
programming inside
A Brief Introduction to Stream Operations
8
int stream s1 = { 1, 1, 2, 3, 5, 8 };
A finite stream literal:
int stream s2 = task 3;
An infinite stream of 3’s:
int stream s3 = s2 * 17;double stream s4 = Math.sin(s1);double stream s5 = s3 + s4;
Stream expressions:
These operations create and connect tasks. Execution occurs later: lazy computation, functional.
Simple Audio Processing9
value int[] squareWave(int freq, int rate, int amplitude) { int wavelength = rate / freq; int[] samples = new int[wavelength];
for (int s: 1::wavelength) samples[s] = (s <= wavelength/2) ? 0 : amplitude;
return (value int[]) samples;}
int stream sqwaves = task squareWave(1000, 44100, 80));
task AudioSink(44100).play(sqwaves);
Liquid Metal Tool Chain10 Lime
QuicklimeFront-EndCompiler
StreamingIR
LM VM Virtex5 FPGALM VM
Xilinxbitfile
XilinxVHDL
Compiler
HDL
Cell BELM VM
Cell binary
Cell SDK
C
CrucibleBack-EndCompiler
OptimusBack-EndCompiler
FPGAModel
Streaming Intermediate Representation (SIR)
11
splitter joiner
joiner splitter
Task:
SplitJoin:
Feedback Loop:
switch joiner
Switch:
Pipeline:
• Task may be stateless or have state• Task mapped to “module” with FIFO I/O• Task graphs are hierarchical & structured
SIR Compiler Optimizations12
Address FPGA compilation challenges Finite, non-virtualizable device Complex optimization space
Throughput, latency, power, area Very long synthesis times (minutes-hours)
Task fusion and fission load balancing, scalabilityStream buffer allocation locality enhancing, manage
cache footprint or SRAM and control logic complexity
Data access fusion reduce critical path length, improve communication-to-computation balance
Preliminary Liquid Metal Results on Energy Consumption: FPGA vs PPC 405
13
FFT
Para
llel A
...
Bubb
le S
ort
Mer
ge S
ort
Disc
rete
...
DES
Mat
rix M
ult..
.
Mat
rix B
loc.
..
Aver
age0
0.2
0.4
0.6
0.8
Frac
tion
of P
ower
PC E
nerg
y
~1.4~1.4~1.4 2.25
• Liquid Metal on Virtex 4 FPGA, 1.6W• C reference implementation on PPC 405, 0.5W
Preliminary Liquid Metal Results on Parallelism: FPGA vs PPC 405
14
• Liquid Metal on Virtex 4 FPGA, 1.6W• C reference implementation on PPC 405, 0.5W
Handel-C Comparison Compared DES and DCT with hand-optimized
Handel-C implementation
Performance 5% faster before optimizations 12x faster after optimizations
Area 66% larger before optimizations 90% larger after optimizations
15
Overview
Compilation Flow
Scheduling
Optimizations
Results
16
Top Level Compilation
Filter
Controller
M0
Init
M1
…
. . .
i0 i1 ix
OmO0O0
…
Mn
Work Source
Filter Filter
Round-Robin Splitter(8,8,8,8)
FilterFilter
Round-Robin Joiner(1,1,1,1)
Sink
a[ ]
i
Init
Controller
Controller
Controller
Controller
Controller
Controller
Controller
Controller
A
B EC
HGF I
J
D
Work
Work
WorkWorkWork
Work
Work
Work
Source
Filter Filter
Round-Robin Splitter(8,8,8,8)
FilterFilter
Round-Robin Joiner(1,1,1,1)
Sink
B DC
F
E
A
J
IHG
17
Filter Compilation
sum = 0i = 0
temp = pop( )
sum = sum + tempi = i + 1Branch bb2 if i < 8
push(sum)
1
2
3
4
Basic Block
Register
Control in
Control outs
Mem
ory/Queue ports
Ack
Live data outsLive data ins
bb1
bb2
bb3
bb4
Live out Data
Live out Data
Register
mux mux
Register
Register
Register
FIFO Read
FIFO Write
Control
Token
Control Token
Control Token
Ack
Ack
Ack
18
Operation Compilation
FU
…
…
i0 im
o0 on
predicate
ADDADD
CMP
Register
i 1 temp sum
8
Control out 3
11
1
temp
Control out 4
Control in
…
sum = sum + tempi = i + 1Branch bb2 if i < 8
19
Static Stream Scheduling
20
Filter 1
Filter 2
Push 2
Pop 3
Each queue has to be deep enough to hold values generated from a single execution of the connected filter
Double buffering is needed
Buffer access is non-blocking
A controller module is needed to orchestrate the schedule
Controller uses finite state machine to execute the steady state schedule
20
Greedy Stream Scheduling
Filter 1
Filter 2
Filters fire eagerly. Blocking channel access.
Allows for potentially smaller channels
Controller is not needed
Results produced with lower latency.
21
Latency Comparison
FF
T
Par
alle
l Add
er
Bub
ble
Sor
t
Mer
ge S
ort
Dis
cret
e C
os...
DE
S
Mat
rix M
ultip
ly
Mat
rix B
lock
M...
Ave
rage
0
2
4
6
8
10
12
14
16
18
Late
ncy
of S
tatic
Rel
ativ
e to
Gre
edy
22
Area ComparisonF
FT
Par
alle
l Add
er
Bub
ble
Sor
t
Mer
ge S
ort
Dis
cret
e C
os...
DE
S
Mat
rix M
ultip
ly
Mat
rix B
lock
M...
Ave
rage
0
10
20
30
40
50
60
70
80
90
100Circuits with static schedulerCircuits with greedy scheduler
%
of
FPG
A
Are
a
23
Optimizations
Streaming optimizations (macro functional) Channel allocations, Channel access fusion, Critical Path
Balancing, Filter fission and fusion, etc. Doing these optimization needs global information about the
stream graph Typically performed manually using existing tools
Classic optimizations (micro functional) Flip-flop elimination, Common subexpression elimination,
Constant folding, Loop unrolling, etc. Typically included in existing compilers and tools
24
Channel Allocation
Larger channels: More SRAM More control logic Less stalls
Interlocking makes sure that each filter gets the
right data or blocks.
What is the right channel size?
25
Channel Allocation Algorithm
Set the size of the channels to infinity.
Warm-up the queues.
Record the steady state instruction schedules for each pair.
Unroll the schedules to have the same number of pushes and pops.
Find the maximum number of overlapping lifetimes.
26
Channel Allocation Example
----
----
push
----
push
----
push
push
push
----
----
push
----
----
pop
----
----
----
pop
----
pop
pop
pop
pop
Max overlap = 3
Producer Consumer
Source
Filter 1
Filter 2
Sink
27
Channel Allocation28
FFT
Para
llel A
...
Bubb
le S
ort
Merg
e So
rt
Disc
rete
...
DES
Matri
x Mu
l...
Matri
x Bl
o...
Aver
age0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Rel
ativ
e C
hann
el S
ize
Aft
er
Opt
imiz
atio
n
Channel Access Fusion
Each channel access (push or pop) takes one cycle.
Communication to computation ratio
Longer critical path latency
Limit task-level parallelism
29
Channel Access Fusion Algorithm
Clustering channel access operations Loop Unrolling Code Motion Balancing the groups
Similar to vectorization Wide channels
30
rrrrrrrr
w
w
w
w
r
w
w
r
Write Mult. = 1
Read Mult. = 8
Write Mult. = 8
Read Mult. = 8
Write Mult. = 4
Read Mult. = 1
30
Access Fusion Example
Some caveats
int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum);
int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); }}
int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);
int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum);
31
FFT
Para
llel A
...
Bubb
le S
ort
Mer
ge S
ort
Disc
rete
...
DES
Mat
rix M
ult..
.
Mat
rix B
loc.
..
Aver
age0
1
2
3
4
5
6
7
8
Spee
dup
(x10
0%)
Access Fusion32
Critical Path Balancing Critical path is set by the longest combinational
path in the filters Optimus uses its internal FPGA model to estimate
how this impacts throughput and latency Balancing Algorithm:
Optimus take target clock as input Start with least number of basic blocks Form USE/DEF chains for the filter Use the internal FPGA model to measure critical path
latency Break the paths whose latency exceeds the target
33
Critical Path Balancing Example
Mul
Add
MulMul
Sub
Add
MulMul
Sub
Mul
Sub
Add Sub Add Sub
Add Sub
Mul Mul
Add Add
Shift Shif
t
Add
AddSub
Add
MulMul
Sub
Mul
Add
Add SubAdd Sub
Add
Shift
1
1
1
2
21
33
4Operation
Delay
Add/Sub 4Shift 2Multiply 10
34
Liquid Metal 35
Interdisciplinary effort addressing the entire stack One language for programming HW (FPGAs) and
SW Liquid Metal VM: JIT the hardware!
GPU MulticoreCPU ???FPGA
LiquidMetal VM
Program all withLime
Streaming IR
Expose structure: computation and communication
Uniform framework for pipeline and data parallelism
Canonical representation for stream-aware optimizations
Streaming OptimizationsMacro-functional Fold streaming IR
graphs into FPGA… Fusion, fission,
replication …subject to
latency, area, and throughput constraints
Micro-functional Micro-pipelining Channel
allocation Access fusion Flip-flop
elimination
Ongoing Effort Application development
Streaming for enterprise and consumer Real-time applications
Compiler and JIT Pre-provisioning profitable HW implementations Runtime opportunities to “JIT” the HW
Advanced dynamic reconfiguration support in VM Predictive, hides latency
New platforms Tightly coupled, higher bandwidth, lower
latency communication Heterogeneous MPSoC systems – FPGA +
processors
38