Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid...

Application-Specific Customization of Microblaze Processors, and other

UCR FPGA Research

Frank Vahid Professor

Department of Computer Science and EngineeringUniversity of California, Riverside

Associate Director, Center for Embedded Computer Systems, UC Irvine

Work supported by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx

Collaborators: David Sheldon (4th yr UCR PhD student), Roman Lysecky (PhD UCR 2005, now Asst. Prof. at U. Arizona), Rakesh Kumar (PhD UCSD 2006, now Asst. Prof. at UIUC), Dean Tullsen (Prof. at

UCSD)

Frank Vahid, UC Rivers

ide

2/57

Outline Two UCR ICCAD’06 papers

Microblaze customization Microblaze conjoining (and customization)

Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core

systems Related FPGA work

"Warp processing" Standard binaries for FPGAs


ide

3/57

Microblaze Customization (ICCAD paper #1)

FPGAs an increasingly popular software platform FPGA soft core processor

Microprocessor synthesized onto FPGA fabric Soft core customization

Cores come with configurable parameters Xilinx Microblaze comes with several instantiatable units:

multiplier, barrel shifter, divider, FPU, or cache Customization: Tuning soft core parameters to a specific

application

Micro-processor

Mul

Micro-processor

BS

Div

FPU

I$

App1

Mul

I$

Micro-processor

App2

Mul

FPU

Div


ide

4/57

Instantiable Unit Speedups

Instantiating units can yield significant speedups “base” – Microblaze without any optional units instantiated

4.092.64 1.97 1.541.936.54

0.000.200.400.600.801.001.201.40

Benchmark

Spe

edup

Base

Barrel Shifter

Divider

Multiplier

FPU7


ide

5/57

0

2

4

6

8

10

12

14

16

18

Size (Equivalent LUTs)

Ap

plicati

on

Ru

nti

me (

ms)

No thingMultiplie rBarre l Shifte rMultiplie r and Barre l Shifte rDividerMultiplie r and DividerBarre l Shifte r and DividerMultiplie r, Barre l Shifte r and Divide rFlo a ting P o int UnitMultiplie r and F lo a ting P o int UnitBarre l Shifte r and F lo a ting P o int UnitMultiplie r, Barre l Shifte r and F lo a ting P o int UnitDivider and F lo a ting P o int UnitMultiplie r, Divider and F lo a ting P o int UnitBarre l Shifte r, Divide r and Flo a ting P o int UnitMultiplie r, Barre l Shifte r, Divider and F lo a ting P o int UnitMCH CacheMultiplie r and MCH cacheBarre l Shifte r and MCH cacheMultiplie r, Barre l Shifte r and MCH CacheDivider and MCH CacheMultiplie r, Divider and MCH CacheBarre l Shifte r, Divide r and MCH CacheMultiplie r, Barre l Shifte r, Divider and MCH CacheMCH Cache and Flo a ting P o int UnitMultiplie r, MCH Cache and F lo a ting P o int UnitBarre l Shifte r, MCH Cache and F lo a ting P o int UnitMultiplie r, Barre l Shifte r, MCH Cache and F lo a ting P o int UnitDivider, MCH Cache and Flo a ting P o int UnitMultiplie r, Divider, MCH Cache and F lo a ting P o int UnitBarre l Shifte r, Divide r, MCH Cache and F lo a ting P o int UnitMultiplie r, Barre l Shifte r, Divider, MCH Cache and Flo a ting P o int Unit

base

bs

mul+bs

mul+bs+cache

FPU

bs+cache

mul

Customization Tradeoffs

Data for aifir EEMBC benchmark on Xilinx Microblaze synthesized to Virtex device

2x p

erf

orm

an

ce t

radeoff

4.5x size tradeoff


ide

9/57

Key Problem Related to Core Customization

Problem: Synthesis of one soft core configuration, and execution/simulation on that configuration, requires about 15 minutes

Thus, for reasonable customization tool runtimes, can only synthesize 5-10 configurations in search of best one


ide

10/57

Two Solution Approaches Traditional CAD approach

Pre-characterize using synthesis and execution/simulation, create abstract problem model, solve using CAD exploration algorithms

Used 0-1 knapsack formulation

Synthesis-in-the-loop approach

Run synthesis and execute/simulate application while exploring

More accurate

Pre-characterize

Model

Explore

Synthesis and execution/simulati

on

ExploreSynthesis and

execution/simulation

start

finish

start

finish

Typically some form of graph

5-10 executions

5-10 interations


ide

15/57

Synthesis-in-the-Loop Approach View solution space as tree, each

level a decision for a unit, order levels by unit speedup/size for application

11 synthesis runs make take a few hours

To reduce, can consider using pre-determined order

Determined by soft core vendor based on averages over many benchmarks

1000*Speedup/Size

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Multiplier BarrelShifter

FloatingPoint Unit

Divider MCHCache

YesNo

base+div

base

baseBarrel shifter

Multiplier

Divider

FPU

Cache

Application-specific impact-

ordered tree

Divider

Barrel Shifter

Multiplier

FPU

Cache

Fixed impact-ordered tree


ide

16/57

Synthesis-in-the-Loop Approach Data for fixed impact-ordered tree for 11 EEMBC benchmarks

Speedup

00.20.40.60.8

11.21.41.61.8


FPU Divider Cache

Size (Equiv LUTs)

0

1000

2000

3000

4000

5000

6000


FPU Divider Cache

Speedup/Size

0123456789


FPU Divider Cache


ide

17/57

Customization Results

Fixed tree approach generally best

App-spec tree better for certain apps, but 2x runtime

ICCAD'06 David Sheldon et al

Fixed order Impact-ordered Tree

Application-SpecificImpact-ordered Tree

Random Impact-ordered Tree

Exhaustive

Knapsack

0

200

400

600

800

1 1.5 2 2.5

Speedup

To

ol R

un

Tim

e (m

)

No size constraint, Virtex II

0

200

400

600

800

1 1.5 2 2.5

Speedup

To

ol R

un

Tim

e (m

)

Size constraint = 80% of full MB size, Virtex II

0

200

400

600

800

1 1.5 2 2.5Speedup

To

ol R

un

Tim

e (m

)

Size constraint = 80% of application-specific optimal MB configuration (guaranteed to “hurt”),

Virtex II No size constraint, Spartan2 device

050

100150200250300

1 1.2 1.4 1.6Speedup

To

ol R

un

tim

e (m

)Results are averages for 11 EEMBC benchmarks


ide

18/57

Conjoined Processors (ICCAD paper #2)

Conjoined processors Two processors sharing a hardware unit to

save size (Kumar/Jouppi/Tullsen ISCA 2004) Showed little performance overhead for desktop processors Only research customer is Intel; for soft core processors,

research customers are every soft core user

How much size savings and performance overhead for conjoined Microblazes?

Processor 1 Multiplier

Processor 2 Multiplier

Processor 1

Multiplier

Processor 2

Conjoined


ide

19/57

Conjoined Processors – Size Savings

0

2000

4000

6000

8000

10000

bs div mul fpu

Unit instantiated with base processor

Eq

uiv

ale

nt

LU

Ts

sep

conj6% 4%23%

32%


ide

20/57

Conjoined Processors – Performance Overhead

We created a trace simulator Reads two instruction traces output by

MB simulator Adds 1-cycle delay for every access to a

conjoined unit (pessimistic assumption about contention detection scheme)

Looks for simultaneous access of shared unit, stalls one MB entirely until unit becomes available

brev

bitmnp

ConfigurationCycle

Latency

Barrel Shifter 2

Divider 34

Multiplier 3

FPUAdd, Sub,

MulDiv

6

30


ide

21/57

Conjoined Processors – Performance Overhead

Data shown for benchmarks that benefit (>1.3x speedup) from barrel shifter

Performance overheads are small

00.5

11.5

22.5

33.5

44.5

(bre

v),ca

nrdr

brev

,(can

rdr)

(bre

v),b

itmnp

brev

,(bitm

np)

(bre

v),b

rev

brev

,(bre

v)

(bitm

np),c

anrd

r

bitm

np,(c

anrd

r)

(bitm

np),b

itmnp

bitm

np,(b

itmnp

)

(can

rdr),

canr

dr

canr

dr,(c

anrd

r)

Sp

ee

du

p

Conjoined

Unconjoined


ide

23/57

Customization Considering Conjoinment

Developed 0-1 knapsack approach “Disjunctively-Constrained Knapsack Solution” to

accomodate conjoinment

012345678

BaseF

P01, B

aseF

P01

BaseF

P01, b

itmnp

BaseF

P01, c

anrd

r

bitm

np, b

itmnp

canr

dr, c

anrd

r

tbloo

k, b

itmnp

tbloo

k, ca

nrdr

tbloo

k, tb

look

AVERAGE

Sp

eed

up

knapsack

exhaustive w/ conj.

exhaustive w/o conj.

0

2000

4000

6000

8000

10000

12000

BaseF

P01, B

aseF

P01

BaseF

P01, b

itmnp

BaseF

P01, c

anrd

r

bitm

np, b

itmnp

canr

dr, c

anrd

r

tbloo

k, bit

mnp

tbloo

k, ca

nrdr

tbloo

k, tb

look

AVERAGE

Siz

e (

eq

uiv

. L

UT

s)

knapsack

exhaustive w/ conj.

exhaustive w/o conj.

Note: To avoid exaggerating the benefits of conjoinment, data only considers benchmark pairs that significantly use a

shared unit

Only 8 pairings shown due to space limits

ICCAD'06 David Sheldon et al


ide

24/57



Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core

systems Related FPGA work

"Warp processing" Standard binaries for FPGAs


ide

25/57

Ongoing Work – Design of Experiments Paradigm

"Design of Experiments" Well-established discipline (>80 yrs) for tuning

parameters For factories, crops, management, etc. Want to set parameter values for best output

But each experiment costly, so can't try all combinations Clear mapping of soft core customization to DOE problem

Given parameters and # of possible experiments Generates which experiments to run (parameter values) Analyzes resulting data Sound mathematical foundations

Present focus of David Sheldon (4th yr Ph.D.)


ide

26/57


Suppose time for 12 experiments

DOE tool generates which 12 experiments to run

User fills in results column

Factor A B C D E F G H I JRow # BS FPU MUL DIV MSR COMP ICACHE_type ICACHE_size DCACHE_type DCACHE_size

1 0 0 0 0 0 0 0 0 0 02 0 0 0 0 0 1 1 1 1 13 0 0 1 1 1 0 0 0 1 14 0 1 0 1 1 0 1 1 0 05 0 1 1 0 1 1 0 1 0 16 0 1 1 1 0 1 1 0 1 07 1 0 1 1 0 0 1 1 0 18 1 0 1 0 1 1 1 0 0 09 1 0 0 1 1 1 0 1 1 0

10 1 1 1 0 0 0 0 1 1 011 1 1 0 1 0 1 0 0 0 112 1 1 0 0 1 0 1 0 1 1

CyclesY1126962653544216826221438086472860019

10818171264450980461713601399330845092089467276392


ide

27/57


DOE tool analyzes results Finds most important factors for given application

Y bar Marginal Means Plot

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Effect Levels

BS

FPU

MUL

DIV

MSR

COMP

ICACHE_type

ICACHE_size

DCACHE_type

DCACHE_size


ide

28/57


Results for a different applicationY bar Marginal Means Plot

0

5

10

15

20

25

30

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Effect Levels

BS

FPU

MUL

DIV

MSR

COMP

ICACHE_type

ICACHE_size

DCACHE_type

DCACHE_size


ide

29/57


Interactions among parameters also automatically determined

BS FPU MUL DIV MSR COMP ICACHE_type ICACHE_size DCACHE_type DCACHE_size

BS

FPU

MU

LD

IVM

SR

CO

MP

ICA

CH

E_t

ype

ICA

CH

E_s

ize

DC

AC

HE

_typ

eD

CA

CH

E_s

ize

Marginal Means

0.00

9384693.17

0.00 1.00

BS vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

BS vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

BS vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

BS vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

BS vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

BS vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

BS vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

BS vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

BS vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

FPU vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

Marginal Means

0.00

9384693.17

0.00 1.00

FPU vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

FPU vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

FPU vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

FPU vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

FPU vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

FPU vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

FPU vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

FPU vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

MUL vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

MUL vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

Marginal Means

0.00

9384693.17

0.00 1.00

MUL vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

MUL vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

MUL vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

MUL vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

MUL vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

MUL vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

MUL vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

DIV vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

DIV vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

DIV vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

Marginal Means

0.00

9384693.17

0.00 1.00

DIV vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

DIV vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

DIV vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

DIV vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

DIV vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

DIV vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

MSR vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

MSR vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

MSR vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

MSR vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

Marginal Means

0.00

9384693.17

0.00 1.00

MSR vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

MSR vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

MSR vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

MSR vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

MSR vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

COMP vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

COMP vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

COMP vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

COMP vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

COMP vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

Marginal Means

0.00

9384693.17

0.00 1.00

COMP vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

COMP vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

COMP vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

COMP vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

ICACHE_type vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

ICACHE_type vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

ICACHE_type vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

ICACHE_type vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

ICACHE_type vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

ICACHE_type vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

Marginal Means

0.00

9384693.17

0.00 1.00

ICACHE_type vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

ICACHE_type vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

ICACHE_type vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

ICACHE_size vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

ICACHE_size vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

ICACHE_size vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

ICACHE_size vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

ICACHE_size vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

ICACHE_size vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

ICACHE_size vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

Marginal Means

0.00

9384693.17

0.00 1.00

ICACHE_size vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

ICACHE_size vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

DCACHE_type vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

DCACHE_type vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

DCACHE_type vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

DCACHE_type vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

DCACHE_type vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

DCACHE_type vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

DCACHE_type vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

DCACHE_type vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

Marginal Means

0.00

9384693.17

0.00 1.00

DCACHE_type vs DCACHE_size

2937659.33

10907794.00

0.00 1.00

DCACHE_size = 0

DCACHE_size = 1

DCACHE_size vs BS

2937659.33

10907794.00

0.00 1.00

BS = 0

BS = 1

DCACHE_size vs FPU

2937659.33

10907794.00

0.00 1.00

FPU = 0

FPU = 1

DCACHE_size vs MUL

2937659.33

10907794.00

0.00 1.00

MUL = 0

MUL = 1

DCACHE_size vs DIV

2937659.33

10907794.00

0.00 1.00

DIV = 0

DIV = 1

DCACHE_size vs MSR

2937659.33

10907794.00

0.00 1.00

MSR = 0

MSR = 1

DCACHE_size vs COMP

2937659.33

10907794.00

0.00 1.00

COMP = 0

COMP = 1

DCACHE_size vs ICACHE_type

2937659.33

10907794.00

0.00 1.00

ICACHE_type = 0

ICACHE_type = 1

DCACHE_size vs ICACHE_size

2937659.33

10907794.00

0.00 1.00

ICACHE_size = 0

ICACHE_size = 1

DCACHE_size vs DCACHE_type

2937659.33

10907794.00

0.00 1.00

DCACHE_type = 0

DCACHE_type = 1

Marginal Means

0.00

9384693.17

0.00 1.00


ide

30/57

Ongoing work – System synthesis

Given N applications Create customized soft core for each app

Criteria: Meet size constraint, minimize total applications' runtime

Other criteria possible (e.g., meet runtime constraint, minimize size)

Present focus of Ryan Mannion, 3rd yr Ph.D.App1 App2 AppN

Microblaze

Mul

I$PicoBlaze

Mul

FPU

Div Microblaze


ide

31/57

Ongoing work – System synthesis Presently use Integer Linear Program Solutions for large set of Xilinx devices generated in seconds

Device LUTs PPCs Area Utilization Cycles bitmnp canrdr aifir brev g3fax g721_ps idct matmul tblook ttsprk BaseFP01 raytraceXC2V2000 21504 0 21439 99.70% 2.46E+08 PBBase MBShift PBBase MBShift PBBase PBBase MBMult MBMult MBBase MBShiftDivMBShiftFPUMBBaseXC2VP2 2816 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC4VLX80 71680 0 69065 96.35% 1.86E+08 MBMultShiftMBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBMultShiftFPUMBMultShiftFPUXC4VLX15 12288 0 12247 99.67% 4.62E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift PBBase MBBase PBBase MBShiftFPUMBShiftXC2S300E 6140 0 6036 98.31% 8.21E+08 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBShiftXC2V4000 46080 0 45313 98.34% 1.87E+08 MBShift MBShift MBMultShiftMBMultShiftMBBase MBShift MBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBShiftFPUMBShiftFPUXC2VP40 38784 2 38712 99.81% 1.3E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBBase MBMultShiftDivMBMultShiftFPUMBMultShiftPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC4VSX25 20480 0 20341 99.32% 2.6E+08 PBBase PBBase PBBase MBShift PBBase PBBase MBMultShiftMBMult MBBase MBShiftDivMBShiftFPUMBShiftXC4VSX35 30720 0 30681 99.87% 1.96E+08 MBShift MBShift MBBase MBShift MBBase PBBase MBMult MBMult MBBase MBMultDiv MBShiftFPUMBShiftFPUXC4VFX20 17088 1 17023 99.62% 2.25E+08 PBBase PBBase PBBase MBShift PBBase PBBase MBShift MBMult MBShift MBShiftDivMBShiftFPUPPCBaseXC2S150E 3456 0 2928 84.72% 1.33E+09 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBaseXC2VP30 27392 2 26488 96.70% 1.31E+08 MBShift MBShift MBShift MBShift MBBase MBShift MBMultShiftMBMultShiftPPCBase MBMultShiftDivMBShiftFPUPPCBaseXC4VLX60 53248 0 52136 97.91% 1.87E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBBase MBMultShiftFPUMBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBShiftFPUMBShiftFPUXC2S600E 13824 0 13801 99.83% 3.96E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBBase MBShift PBBase MBShiftFPUMBShiftXC2VP20 18560 2 18527 99.82% 1.53E+08 PBBase MBShift PBBase MBShift PBBase PBBase MBMultShiftMBMult PPCBase MBShiftDivMBShiftFPUPPCBaseXC2V500 6144 0 6036 98.24% 8.21E+08 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBShiftXC2VPX70 66176 2 61858 93.47% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC4VLX40 36864 0 36746 99.68% 1.88E+08 MBShift MBShift MBShift MBShift MBBase MBShift MBMultShiftMBMultShiftMBShiftFPUMBMultShiftDivMBShiftFPUMBShiftFPUXC2V6000 67584 0 65738 97.27% 1.86E+08 MBMultShiftMBMultShiftMBMultShiftFPUMBMultShiftMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBMultShiftFPUMBMultShiftFPUXC4VFX60 50560 2 48665 96.25% 1.3E+08 MBMultShiftMBMultShiftMBMultShiftMBMultShiftMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultShiftPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC4VFX100 84352 2 62218 73.76% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftDivFPUPPCBaseXC2VP4 6016 1 5792 96.28% 6E+08 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase MBShift PBBase PPCBase MBShiftXC2VP70 66176 2 61858 93.47% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC2V40 512 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC2V1500 15360 0 15291 99.55% 3.42E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBBase MBBase MBShiftDivMBShiftFPUMBShiftXC2V8000 93184 0 76084 81.65% 1.86E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUMBMultShiftFPUMBMultShiftDivMBMultShiftDivFPUMBMultShiftDivFPUXC2V1000 10240 0 10014 97.79% 5.48E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBBase MBBase PBBase MBShift MBBaseXC4VSX55 49152 0 48970 99.63% 1.87E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBBase MBMultShiftDivMBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBShiftFPUMBShiftFPUXC2S50E 1536 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC2V3000 28672 0 28562 99.62% 2.01E+08 MBShift MBShift MBShift MBShift PBBase PBBase MBMultShiftMBMult MBShift MBShiftDivMBShiftFPUMBShiftFPUXC4VFX40 37284 2 36968 99.15% 1.3E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBBase MBMultShiftDivMBMultShiftFPUMBMultShiftPPCBase MBMultShiftDivMBShiftFPUPPCBaseXC2VPX20 19584 1 19562 99.89% 1.92E+08 PBBase MBShift PBBase MBShift PBBase PBBase MBMultShiftMBMult MBBase MBDiv MBFPU PPCBaseXC2S100E 2400 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC4VFX140 126336 2 62218 49.25% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftDivFPUPPCBaseXC4VLX25 21504 0 21439 99.70% 2.46E+08 PBBase MBShift PBBase MBShift PBBase PBBase MBMult MBMult MBBase MBShiftDivMBShiftFPUMBBaseXC2VP50 47232 2 46917 99.33% 1.3E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultShiftPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC2VP7 9856 1 9664 98.05% 3.9E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBBase MBBase MBDiv PPCBase MBBaseXC2S400E 9600 0 9558 99.56% 5.62E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBBase MBBase MBBase PBBase MBBase MBBaseXC2S200E 4704 0 4482 95.28% 9.96E+08 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase MBShift PBBaseXC4VLX160 135168 0 76084 56.29% 1.86E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUMBMultShiftFPUMBMultShiftDivMBMultShiftDivFPUMBMultShiftDivFPUXC2V250 3072 0 2928 95.31% 1.33E+09 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBaseXC2V80 1024 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC4VLX100 98304 0 76084 77.40% 1.86E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUMBMultShiftFPUMBMultShiftDivMBMultShiftDivFPUMBMultShiftDivFPUXC2VP100 88192 2 62218 70.55% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftDivFPUPPCBaseXC4VLX200 178176 0 76084 42.70% 1.86E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUMBMultShiftFPUMBMultShiftDivMBMultShiftDivFPUMBMultShiftDivFPUXC4VFX12 10944 1 10868 99.31% 3.72E+08 PBBase PBBase PBBase MBBase PBBase PBBase MBShift MBBase MBBase MBBase PPCBase MBBase

Device LUTs PPCs Area Utilization Cycles bitmnp canrdr aifirXC2V2000 21504 0 21439 99.70% 246,157,335 PBBase MBShift PBBaseXC2VP2 2816 0 0 0.00% 0 infeasible infeasible infeasibleXC4VLX80 71680 0 69065 96.35% 186,482,222 MBMultShift MBMultShift MBMultShiftFPUXC4VLX15 12288 0 12247 99.67% 462,384,156 PBBase PBBase PBBaseXC2S300E 6140 0 6036 98.31% 821,383,431 PBBase PBBase PBBaseXC2V4000 46080 0 45313 98.34% 186,747,116 MBShift MBShift MBMultShiftXC2VP40 38784 2 38712 99.81% 129,796,303 MBShift MBMultShift MBMultShiftXC4VSX25 20480 0 20341 99.32% 260,363,903 PBBase PBBase PBBaseXC4VSX35 30720 0 30681 99.87% 196,496,340 MBShift MBShift MBBaseXC4VFX20 17088 1 17023 99.62% 225,019,722 PBBase PBBase PBBaseXC2S150E 3456 0 2928 84.72% 1,329,161,797 PBBase PBBase PBBaseXC2VP30 27392 2 26488 96.70% 131,131,074 MBShift MBShift MBShiftXC4VLX60 53248 0 52136 97.91% 186,547,866 MBShift MBMultShift MBMultShiftXC2S600E 13824 0 13801 99.83% 395,723,730 PBBase PBBase PBBaseXC2VP20 18560 2 18527 99.82% 152,516,142 PBBase MBShift PBBase

Graduate Student: Ryan Mannion, 3rd yr

Ph.D.


ide

32/57



Current work targetting Microblaze users “Design of Experiments” paradigm System-

level synthesis for multi-core systems Related FPGA work

Warp processing Standard binaries for FPGAs


ide

33/57

Binary-Level Synthesis Binary-level FPGA compiler developed 2002-2006 (Greg Stitt, Ph.D. UCR

2007)

C++ Java asmM obj

Compiler

Compiler

Assembler

Linker

Microproc. Binary

Sour

ce-l

evel

F

PG

A c

ompi

ler

prov

ides

a

lim

ited

sol

utio

n

Binary-level FPGA compiler

Binary-level FPGA compiler provides a more general solution, at the expense of lost high-

level information

FPGA Binary

FPGA Binary Microproc. Binary


ide

34/57

Binary Synthesis Competitive with Source Level

Aggressive decompilation recovers most high-level constructs needed for good synthesis – Makes binary-level synthesis competitive with source level

0

2

4

6

8

10

1 5 9 13 17 21 25 29 33 37 41 45 49

Number of Functions in Hardware

Sp

ee

du

p

Source-level Binary-level

Freescale H264 decoder example, from ISSS/CODES 2005


ide

35/57

Binary Synthesis Enables Dynamic Hardware/Software Partitioning

Called “Warp Processing” (Vahid/Stitt/Lysecky 2003-2007)

Direct collaborators: Intel, IBM, and Freescale

On-chip Binary-level FPGA Compiler

Microprocessor

FP

GA

Microproc. BinaryF

PG

A

Bin

ary

Microproc. Binary Downloader

Chip or board


ide

36/57

µP

FPGAOn-chip CAD

Warp Processing Idea

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary


ide

37/57

µP

FPGAOn-chip CAD


ProfilerI Mem

D$


Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP


ide

38/57

µP

FPGAOn-chip CAD


Profiler

µP

I Mem

D$


Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected


ide

39/57

µP

FPGAOn-chip CAD


Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD


ide

40/57

µP

FPGADynamic Part. Module (DPM)


Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD decompiles critical

region into control data flow graph (CDFG)

55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0


ide

41/57

µP



Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .


ide

42/57

µP



Profiler

µP

I Mem

D$


Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

FPGA

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++


ide

43/57

µP



Profiler

µP

I Mem

D$


Software Binary88

Time Energy

Profiler

On-chip CAD


ret reg4

reg3 := 0reg4 := 0

+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”

DAC'03, DAC'04, DATE'04, ISSS/CODES'04, FPGA'04, DATE'05, FCCM'05, ICCAD'05,

ISSS/CODES'05, TECS'06, U.S. Patent Pending


ide

45/57

191 113 130

0

10

20

30

40

50

60

70

80

Speedup

Warp Proc.

Xilinx Virtex-E

Warp ProcessorsPerformance Speedup (Most Frequent Kernel Only)

Average kernel speedup of 41, vs. 21 for Virtex-E

SW Only Execution

WCLA simplicity results in faster HW

circuits


ide

46/57

0

2

4

6

8

10

12

14

16

18

Speedup

Warp Proc.

Warp ProcessorsPerformance Speedup (Overall, Multiple Kernels)

Average speedup of 7.4 Energy reduction of 38% - 94%

SW Only Execution

Assuming 100 MHz ARM, and fabric clocked at rate determined by

synthesis


ide

47/57

Warp ProcessorsSpeedups Compared with Digital Signal Processor

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

brev

g3fa

xro

cm

pktfl

ow

canr

dr

bitm

np

tbloo

k

ttspr

k

mat

rix idct

mpe

g2 fir

mat

mul

Avera

ge

ARM

DSP

Warp (w/ ARM)


ide

48/57

Warp ProcessorsSpeedups for Multi-Threaded Application Benchmarks

Compelling computing advantage of FPGAs:Parallellism from bit level up to processor level, and everywhere in between

307.7 501.9

0

20

40

60

80

100

120

1404-uP

8-uP

16-uP

32-uP

64-uP

Warp


ide

49/57

FPGA Ubiquity via Obscurity Warp processing hides FPGA

from languages and tools ANY microprocessor platform

extendible with FPGA Maintains "ecosystem":

application, tool, and architecture developers

New platforms with FPGAs appearing

FPGAProc.

Translator

BinarySW

ProfilingStandard Compiler

BinaryStandard Binary

Architectures

Applications Tools

Standard binaries

New processor platforms

with FPGA

evolving


ide

50/57

FPGA Standard Binaries? Microprocessor binary

represents one form of a "standard binary for FPGAs"

Missing is explicit concurrency Parallelism, pipelining, queues,

etc. As FPGAs appear in more

platforms, might a more general FPGA binary evolve?

FPGAProc.

Translator

BinarySW

ProfilingStandard Compiler

BinaryStandard Binary

Architectures

Applications Tools

Standard binaries

BinarySystemC?

Standard FPGA Compiler

BinaryStandard FPGA binary?

Standard FPGA binaries

Ecosystem for FPGAs

presently sorely missing


ide

51/57

FPGA Standard Binaries?

Translator would make best use of existing FPGA resources

Could even add FPGA, like adding memory, to improve performance

Add more FPGA to your PDA to implement compute-intensive application?

BinaryBinary

FPGAProc.

Translator

FPGA

* * * * * * * * * * * *

+ + + + + +

+ + +

+ +

+

BinaryFPGA Binary

Translator

FPGA

Low-end PDA

100 sec

Translator

FPGA

High-end PDA

1 sec


ide

52/57

FPGA Standard Binaries NSF funding received for 2006-2009

Xilinx letter of support was helpfulGraduate

Student: Scott Sirowy, 2nd year

Ph.D.


ide

57/57

Conclusions Soft core customization increasingly

important to make best use of limited FPGA resources Good initial automatic customization results “Design of Experiments” paradigm looks

promising System-level synthesis may yield very useful MB

user tool, perhaps web based Warp processing and standard FPGA binary

work can help make FPGAs ubiquitous Accomplishments made possible by Xilinx

donations and interactions Continued and close collaboration sought

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid...

Documents

Transcript of Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid...