Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware

Liquid Metal’s OPTIMUS:Synthesis of Efficient Streaming Hardware

Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM)DAC HLS Tutorial, San Francisco 2009

1

your apphere

JIT compilerconfigureslogic

Dynamic Application Specific Customization of HW

2

Inspired by ASIC paradigm:• High Performance• Low Power

Liquid Metal: “JIT the Hardware”

3

Single language for programming HW & SW Run in a standard JVM, or synthesize in HW Fluidly move computation between HW & SW Do for HW (viz. FPGAs) what FORTRAN did for

computing Address critical technology trendsPower address impractical growth of power and

cooling demandsArchitecture enabling million way parallelism vs. small

scale multicoresVersatility in the field & on the fly customization to end-

user applications

Applications demand for pervasive streaming and mobile content (WWW, multimedia, gaming)

ASIC-like

Reconfigurable

Lime: the Liquid Metal Language

4

Design Principles: Object-oriented, Java-like, Java-compatible Raise level of abstraction Parallel constructs that simplify code Target synthesis while retaining generality

4 reasons not another *C to HDL approach

Emphasis on programmer productivity Leverage rich Java IDEs, libraries, and analysis

Not an auto-parallelization approach Lime is explicitly parallel and synthesizable

Fast fail-safe mechanism Lime may be refined into parallel SW implementation

Intrinsic opportunity for online optimizations Static optimizations with dynamic refinement

Lime Overview6

Computation is well encapsulatedData-flow driven computationMultiple “clock domains

Tasks, Value typesHW (FPGA): Lime:

Bit-level control and reasoning

Memory usage statically determined before layout

Abstract OO programming down to the bit-level!

Ordinal-indexed arrays, bounded loops

Streaming primitives

Template-like Generics

Rate “matching” operators

Streams: Exposing Computational Structure

7

Stream primitives are integral to the language

Tasks in streams are strongly isolated Only the endpoints may perform side-

effects Provide macro-level functional

programming abstraction… … While allowing traditional imperative

programming inside

A Brief Introduction to Stream Operations

8

int stream s1 = { 1, 1, 2, 3, 5, 8 };

A finite stream literal:

int stream s2 = task 3;

An infinite stream of 3’s:

int stream s3 = s2 * 17;double stream s4 = Math.sin(s1);double stream s5 = s3 + s4;

Stream expressions:

These operations create and connect tasks. Execution occurs later: lazy computation, functional.

Simple Audio Processing9

value int[] squareWave(int freq, int rate, int amplitude) { int wavelength = rate / freq; int[] samples = new int[wavelength];

for (int s: 1::wavelength) samples[s] = (s <= wavelength/2) ? 0 : amplitude;

return (value int[]) samples;}

int stream sqwaves = task squareWave(1000, 44100, 80));

task AudioSink(44100).play(sqwaves);

Liquid Metal Tool Chain10 Lime

QuicklimeFront-EndCompiler

StreamingIR

LM VM Virtex5 FPGALM VM

Xilinxbitfile

XilinxVHDL

Compiler

HDL

Cell BELM VM

Cell binary

Cell SDK

C

CrucibleBack-EndCompiler

OptimusBack-EndCompiler

FPGAModel

Streaming Intermediate Representation (SIR)

11

splitter joiner

joiner splitter

Task:

SplitJoin:

Feedback Loop:

switch joiner

Switch:

Pipeline:

• Task may be stateless or have state• Task mapped to “module” with FIFO I/O• Task graphs are hierarchical & structured

SIR Compiler Optimizations12

Address FPGA compilation challenges Finite, non-virtualizable device Complex optimization space

Throughput, latency, power, area Very long synthesis times (minutes-hours)

Task fusion and fission load balancing, scalabilityStream buffer allocation locality enhancing, manage

cache footprint or SRAM and control logic complexity

Data access fusion reduce critical path length, improve communication-to-computation balance

Preliminary Liquid Metal Results on Energy Consumption: FPGA vs PPC 405

13

FFT

Para

llel A

...

Bubb

le S

ort

Mer

ge S

ort

Disc

rete

...

DES

Mat

rix M

ult..

.

Mat

rix B

loc.

..

Aver

age0

0.2

0.4

0.6

0.8

Frac

tion

of P

ower

PC E

nerg

y

~1.4~1.4~1.4 2.25

• Liquid Metal on Virtex 4 FPGA, 1.6W• C reference implementation on PPC 405, 0.5W

Preliminary Liquid Metal Results on Parallelism: FPGA vs PPC 405

14

• Liquid Metal on Virtex 4 FPGA, 1.6W• C reference implementation on PPC 405, 0.5W

Handel-C Comparison Compared DES and DCT with hand-optimized

Handel-C implementation

Performance 5% faster before optimizations 12x faster after optimizations

Area 66% larger before optimizations 90% larger after optimizations

15

Overview

Compilation Flow

Scheduling

Optimizations

Results

16

Top Level Compilation

Filter

Controller

M0

Init

M1

…

. . .

i0 i1 ix

OmO0O0

…

Mn

Work Source

Filter Filter

Round-Robin Splitter(8,8,8,8)

FilterFilter

Round-Robin Joiner(1,1,1,1)

Sink

a[ ]

i

Init

Controller

Controller

Controller

Controller

Controller

Controller

Controller

Controller

A

B EC

HGF I

J

D

Work

Work

WorkWorkWork

Work

Work

Work

Source

Filter Filter

Round-Robin Splitter(8,8,8,8)

FilterFilter

Round-Robin Joiner(1,1,1,1)

Sink

B DC

F

E

A

J

IHG

17

Filter Compilation

sum = 0i = 0

temp = pop( )

sum = sum + tempi = i + 1Branch bb2 if i < 8

push(sum)

1

2

3

4

Basic Block

Register

Control in

Control outs

Mem

ory/Queue ports

Ack

Live data outsLive data ins

bb1

bb2

bb3

bb4

Live out Data

Live out Data

Register

mux mux

Register

Register

Register

FIFO Read

FIFO Write

Control

Token

Control Token

Control Token

Ack

Ack

Ack

18

Operation Compilation

FU

…

…

i0 im

o0 on

predicate

ADDADD

CMP

Register

i 1 temp sum

8

Control out 3

11

1

temp

Control out 4

Control in

…

sum = sum + tempi = i + 1Branch bb2 if i < 8

19

Static Stream Scheduling

20

Filter 1

Filter 2

Push 2

Pop 3

Each queue has to be deep enough to hold values generated from a single execution of the connected filter

Double buffering is needed

Buffer access is non-blocking

A controller module is needed to orchestrate the schedule

Controller uses finite state machine to execute the steady state schedule

20

Greedy Stream Scheduling

Filter 1

Filter 2

Filters fire eagerly. Blocking channel access.

Allows for potentially smaller channels

Controller is not needed

Results produced with lower latency.

21

Latency Comparison

FF

T

Par

alle

l Add

er

Bub

ble

Sor

t

Mer

ge S

ort

Dis

cret

e C

os...

DE

S

Mat

rix M

ultip

ly

Mat

rix B

lock

M...

Ave

rage

0

2

4

6

8

10

12

14

16

18

Late

ncy

of S

tatic

Rel

ativ

e to

Gre

edy

22

Area ComparisonF

FT

Par

alle

l Add

er

Bub

ble

Sor

t

Mer

ge S

ort

Dis

cret

e C

os...

DE

S

Mat

rix M

ultip

ly

Mat

rix B

lock

M...

Ave

rage

0

10

20

30

40

50

60

70

80

90

100Circuits with static schedulerCircuits with greedy scheduler

%

of

FPG

A

Are

a

23

Optimizations

Streaming optimizations (macro functional) Channel allocations, Channel access fusion, Critical Path

Balancing, Filter fission and fusion, etc. Doing these optimization needs global information about the

stream graph Typically performed manually using existing tools

Classic optimizations (micro functional) Flip-flop elimination, Common subexpression elimination,

Constant folding, Loop unrolling, etc. Typically included in existing compilers and tools

24

Channel Allocation

Larger channels: More SRAM More control logic Less stalls

Interlocking makes sure that each filter gets the

right data or blocks.

What is the right channel size?

25

Channel Allocation Algorithm

Set the size of the channels to infinity.

Warm-up the queues.

Record the steady state instruction schedules for each pair.

Unroll the schedules to have the same number of pushes and pops.

Find the maximum number of overlapping lifetimes.

26

Channel Allocation Example

----

----

push

----

push

----

push

push

push

----

----

push

----

----

pop

----

----

----

pop

----

pop

pop

pop

pop

Max overlap = 3

Producer Consumer

Source

Filter 1

Filter 2

Sink

27

Channel Allocation28

FFT

Para

llel A

...

Bubb

le S

ort

Merg

e So

rt

Disc

rete

...

DES

Matri

x Mu

l...

Matri

x Bl

o...

Aver

age0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Rel

ativ

e C

hann

el S

ize

Aft

er

Opt

imiz

atio

n

Channel Access Fusion

Each channel access (push or pop) takes one cycle.

Communication to computation ratio

Longer critical path latency

Limit task-level parallelism

29

Channel Access Fusion Algorithm

Clustering channel access operations Loop Unrolling Code Motion Balancing the groups

Similar to vectorization Wide channels

30

rrrrrrrr

w

w

w

w

r

w

w

r

Write Mult. = 1

Read Mult. = 8

Write Mult. = 8

Read Mult. = 8

Write Mult. = 4

Read Mult. = 1

30

Access Fusion Example

Some caveats

int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum);

int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); }}

int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);

int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum);

31

FFT

Para

llel A

...

Bubb

le S

ort

Mer

ge S

ort

Disc

rete

...

DES

Mat

rix M

ult..

.

Mat

rix B

loc.

..

Aver

age0

1

2

3

4

5

6

7

8

Spee

dup

(x10

0%)

Access Fusion32

Critical Path Balancing Critical path is set by the longest combinational

path in the filters Optimus uses its internal FPGA model to estimate

how this impacts throughput and latency Balancing Algorithm:

Optimus take target clock as input Start with least number of basic blocks Form USE/DEF chains for the filter Use the internal FPGA model to measure critical path

latency Break the paths whose latency exceeds the target

33

Critical Path Balancing Example

Mul

Add

MulMul

Sub

Add

MulMul

Sub

Mul

Sub

Add Sub Add Sub

Add Sub

Mul Mul

Add Add

Shift Shif

t

Add

AddSub

Add

MulMul

Sub

Mul

Add

Add SubAdd Sub

Add

Shift

1

1

1

2

21

33

4Operation

Delay

Add/Sub 4Shift 2Multiply 10

34

Liquid Metal 35

Interdisciplinary effort addressing the entire stack One language for programming HW (FPGAs) and

SW Liquid Metal VM: JIT the hardware!

GPU MulticoreCPU ???FPGA

LiquidMetal VM

Program all withLime

Streaming IR

Expose structure: computation and communication

Uniform framework for pipeline and data parallelism

Canonical representation for stream-aware optimizations

Streaming OptimizationsMacro-functional Fold streaming IR

graphs into FPGA… Fusion, fission,

replication …subject to

latency, area, and throughput constraints

Micro-functional Micro-pipelining Channel

allocation Access fusion Flip-flop

elimination

Ongoing Effort Application development

Streaming for enterprise and consumer Real-time applications

Compiler and JIT Pre-provisioning profitable HW implementations Runtime opportunities to “JIT” the HW

Advanced dynamic reconfiguration support in VM Predictive, hides latency

New platforms Tightly coupled, higher bandwidth, lower

latency communication Heterogeneous MPSoC systems – FPGA +

processors

38

Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware

Documents

Transcript of Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware