Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid...
Application-Specific Customization of Microblaze Processors, and other
UCR FPGA Research
Frank Vahid Professor
Department of Computer Science and EngineeringUniversity of California, Riverside
Associate Director, Center for Embedded Computer Systems, UC Irvine
Work supported by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx
Collaborators: David Sheldon (4th yr UCR PhD student), Roman Lysecky (PhD UCR 2005, now Asst. Prof. at U. Arizona), Rakesh Kumar (PhD UCSD 2006, now Asst. Prof. at UIUC), Dean Tullsen (Prof. at
UCSD)
Frank Vahid, UC Rivers
ide
2/57
Outline Two UCR ICCAD’06 papers
Microblaze customization Microblaze conjoining (and customization)
Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core
systems Related FPGA work
"Warp processing" Standard binaries for FPGAs
Frank Vahid, UC Rivers
ide
3/57
Microblaze Customization (ICCAD paper #1)
FPGAs an increasingly popular software platform FPGA soft core processor
Microprocessor synthesized onto FPGA fabric Soft core customization
Cores come with configurable parameters Xilinx Microblaze comes with several instantiatable units:
multiplier, barrel shifter, divider, FPU, or cache Customization: Tuning soft core parameters to a specific
application
Micro-processor
Mul
Micro-processor
BS
Div
FPU
I$
App1
Mul
I$
Micro-processor
App2
Mul
FPU
Div
Frank Vahid, UC Rivers
ide
4/57
Instantiable Unit Speedups
Instantiating units can yield significant speedups “base” – Microblaze without any optional units instantiated
4.092.64 1.97 1.541.936.54
0.000.200.400.600.801.001.201.40
Benchmark
Spe
edup
Base
Barrel Shifter
Divider
Multiplier
FPU7
Frank Vahid, UC Rivers
ide
5/57
0
2
4
6
8
10
12
14
16
18
Size (Equivalent LUTs)
Ap
plicati
on
Ru
nti
me (
ms)
No thingMultiplie rBarre l Shifte rMultiplie r and Barre l Shifte rDividerMultiplie r and DividerBarre l Shifte r and DividerMultiplie r, Barre l Shifte r and Divide rFlo a ting P o int UnitMultiplie r and F lo a ting P o int UnitBarre l Shifte r and F lo a ting P o int UnitMultiplie r, Barre l Shifte r and F lo a ting P o int UnitDivider and F lo a ting P o int UnitMultiplie r, Divider and F lo a ting P o int UnitBarre l Shifte r, Divide r and Flo a ting P o int UnitMultiplie r, Barre l Shifte r, Divider and F lo a ting P o int UnitMCH CacheMultiplie r and MCH cacheBarre l Shifte r and MCH cacheMultiplie r, Barre l Shifte r and MCH CacheDivider and MCH CacheMultiplie r, Divider and MCH CacheBarre l Shifte r, Divide r and MCH CacheMultiplie r, Barre l Shifte r, Divider and MCH CacheMCH Cache and Flo a ting P o int UnitMultiplie r, MCH Cache and F lo a ting P o int UnitBarre l Shifte r, MCH Cache and F lo a ting P o int UnitMultiplie r, Barre l Shifte r, MCH Cache and F lo a ting P o int UnitDivider, MCH Cache and Flo a ting P o int UnitMultiplie r, Divider, MCH Cache and F lo a ting P o int UnitBarre l Shifte r, Divide r, MCH Cache and F lo a ting P o int UnitMultiplie r, Barre l Shifte r, Divider, MCH Cache and Flo a ting P o int Unit
base
bs
mul+bs
mul+bs+cache
FPU
bs+cache
mul
Customization Tradeoffs
Data for aifir EEMBC benchmark on Xilinx Microblaze synthesized to Virtex device
2x p
erf
orm
an
ce t
radeoff
4.5x size tradeoff
Frank Vahid, UC Rivers
ide
9/57
Key Problem Related to Core Customization
Problem: Synthesis of one soft core configuration, and execution/simulation on that configuration, requires about 15 minutes
Thus, for reasonable customization tool runtimes, can only synthesize 5-10 configurations in search of best one
Frank Vahid, UC Rivers
ide
10/57
Two Solution Approaches Traditional CAD approach
Pre-characterize using synthesis and execution/simulation, create abstract problem model, solve using CAD exploration algorithms
Used 0-1 knapsack formulation
Synthesis-in-the-loop approach
Run synthesis and execute/simulate application while exploring
More accurate
Pre-characterize
Model
Explore
Synthesis and execution/simulati
on
ExploreSynthesis and
execution/simulation
start
finish
start
finish
Typically some form of graph
5-10 executions
5-10 interations
Frank Vahid, UC Rivers
ide
15/57
Synthesis-in-the-Loop Approach View solution space as tree, each
level a decision for a unit, order levels by unit speedup/size for application
11 synthesis runs make take a few hours
To reduce, can consider using pre-determined order
Determined by soft core vendor based on averages over many benchmarks
1000*Speedup/Size
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Multiplier BarrelShifter
FloatingPoint Unit
Divider MCHCache
YesNo
base+div
base
baseBarrel shifter
Multiplier
Divider
FPU
Cache
Application-specific impact-
ordered tree
Divider
Barrel Shifter
Multiplier
FPU
Cache
Fixed impact-ordered tree
Frank Vahid, UC Rivers
ide
16/57
Synthesis-in-the-Loop Approach Data for fixed impact-ordered tree for 11 EEMBC benchmarks
Speedup
00.20.40.60.8
11.21.41.61.8
Multiplier BarrelShifter
FPU Divider Cache
Size (Equiv LUTs)
0
1000
2000
3000
4000
5000
6000
Multiplier BarrelShifter
FPU Divider Cache
Speedup/Size
0123456789
Multiplier BarrelShifter
FPU Divider Cache
Frank Vahid, UC Rivers
ide
17/57
Customization Results
Fixed tree approach generally best
App-spec tree better for certain apps, but 2x runtime
ICCAD'06 David Sheldon et al
Fixed order Impact-ordered Tree
Application-SpecificImpact-ordered Tree
Random Impact-ordered Tree
Exhaustive
Knapsack
0
200
400
600
800
1 1.5 2 2.5
Speedup
To
ol R
un
Tim
e (m
)
No size constraint, Virtex II
0
200
400
600
800
1 1.5 2 2.5
Speedup
To
ol R
un
Tim
e (m
)
Size constraint = 80% of full MB size, Virtex II
0
200
400
600
800
1 1.5 2 2.5Speedup
To
ol R
un
Tim
e (m
)
Size constraint = 80% of application-specific optimal MB configuration (guaranteed to “hurt”),
Virtex II No size constraint, Spartan2 device
050
100150200250300
1 1.2 1.4 1.6Speedup
To
ol R
un
tim
e (m
)Results are averages for 11 EEMBC benchmarks
Frank Vahid, UC Rivers
ide
18/57
Conjoined Processors (ICCAD paper #2)
Conjoined processors Two processors sharing a hardware unit to
save size (Kumar/Jouppi/Tullsen ISCA 2004) Showed little performance overhead for desktop processors Only research customer is Intel; for soft core processors,
research customers are every soft core user
How much size savings and performance overhead for conjoined Microblazes?
Processor 1 Multiplier
Processor 2 Multiplier
Processor 1
Multiplier
Processor 2
Conjoined
Frank Vahid, UC Rivers
ide
19/57
Conjoined Processors – Size Savings
0
2000
4000
6000
8000
10000
bs div mul fpu
Unit instantiated with base processor
Eq
uiv
ale
nt
LU
Ts
sep
conj6% 4%23%
32%
Frank Vahid, UC Rivers
ide
20/57
Conjoined Processors – Performance Overhead
We created a trace simulator Reads two instruction traces output by
MB simulator Adds 1-cycle delay for every access to a
conjoined unit (pessimistic assumption about contention detection scheme)
Looks for simultaneous access of shared unit, stalls one MB entirely until unit becomes available
brev
bitmnp
ConfigurationCycle
Latency
Barrel Shifter 2
Divider 34
Multiplier 3
FPUAdd, Sub,
MulDiv
6
30
Frank Vahid, UC Rivers
ide
21/57
Conjoined Processors – Performance Overhead
Data shown for benchmarks that benefit (>1.3x speedup) from barrel shifter
Performance overheads are small
00.5
11.5
22.5
33.5
44.5
(bre
v),ca
nrdr
brev
,(can
rdr)
(bre
v),b
itmnp
brev
,(bitm
np)
(bre
v),b
rev
brev
,(bre
v)
(bitm
np),c
anrd
r
bitm
np,(c
anrd
r)
(bitm
np),b
itmnp
bitm
np,(b
itmnp
)
(can
rdr),
canr
dr
canr
dr,(c
anrd
r)
Sp
ee
du
p
Conjoined
Unconjoined
Frank Vahid, UC Rivers
ide
23/57
Customization Considering Conjoinment
Developed 0-1 knapsack approach “Disjunctively-Constrained Knapsack Solution” to
accomodate conjoinment
012345678
BaseF
P01, B
aseF
P01
BaseF
P01, b
itmnp
BaseF
P01, c
anrd
r
bitm
np, b
itmnp
canr
dr, c
anrd
r
tbloo
k, b
itmnp
tbloo
k, ca
nrdr
tbloo
k, tb
look
AVERAGE
Sp
eed
up
knapsack
exhaustive w/ conj.
exhaustive w/o conj.
0
2000
4000
6000
8000
10000
12000
BaseF
P01, B
aseF
P01
BaseF
P01, b
itmnp
BaseF
P01, c
anrd
r
bitm
np, b
itmnp
canr
dr, c
anrd
r
tbloo
k, bit
mnp
tbloo
k, ca
nrdr
tbloo
k, tb
look
AVERAGE
Siz
e (
eq
uiv
. L
UT
s)
knapsack
exhaustive w/ conj.
exhaustive w/o conj.
Note: To avoid exaggerating the benefits of conjoinment, data only considers benchmark pairs that significantly use a
shared unit
Only 8 pairings shown due to space limits
ICCAD'06 David Sheldon et al
Frank Vahid, UC Rivers
ide
24/57
Outline Two UCR ICCAD’06 papers
Microblaze customization Microblaze conjoining (and customization)
Current work targetting Microblaze users “Design of Experiments” paradigm System-level synthesis for multi-core
systems Related FPGA work
"Warp processing" Standard binaries for FPGAs
Frank Vahid, UC Rivers
ide
25/57
Ongoing Work – Design of Experiments Paradigm
"Design of Experiments" Well-established discipline (>80 yrs) for tuning
parameters For factories, crops, management, etc. Want to set parameter values for best output
But each experiment costly, so can't try all combinations Clear mapping of soft core customization to DOE problem
Given parameters and # of possible experiments Generates which experiments to run (parameter values) Analyzes resulting data Sound mathematical foundations
Present focus of David Sheldon (4th yr Ph.D.)
Frank Vahid, UC Rivers
ide
26/57
Ongoing Work – Design of Experiments Paradigm
Suppose time for 12 experiments
DOE tool generates which 12 experiments to run
User fills in results column
Factor A B C D E F G H I JRow # BS FPU MUL DIV MSR COMP ICACHE_type ICACHE_size DCACHE_type DCACHE_size
1 0 0 0 0 0 0 0 0 0 02 0 0 0 0 0 1 1 1 1 13 0 0 1 1 1 0 0 0 1 14 0 1 0 1 1 0 1 1 0 05 0 1 1 0 1 1 0 1 0 16 0 1 1 1 0 1 1 0 1 07 1 0 1 1 0 0 1 1 0 18 1 0 1 0 1 1 1 0 0 09 1 0 0 1 1 1 0 1 1 0
10 1 1 1 0 0 0 0 1 1 011 1 1 0 1 0 1 0 0 0 112 1 1 0 0 1 0 1 0 1 1
CyclesY1126962653544216826221438086472860019
10818171264450980461713601399330845092089467276392
Frank Vahid, UC Rivers
ide
27/57
Ongoing Work – Design of Experiments Paradigm
DOE tool analyzes results Finds most important factors for given application
Y bar Marginal Means Plot
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
Effect Levels
BS
FPU
MUL
DIV
MSR
COMP
ICACHE_type
ICACHE_size
DCACHE_type
DCACHE_size
Frank Vahid, UC Rivers
ide
28/57
Ongoing Work – Design of Experiments Paradigm
Results for a different applicationY bar Marginal Means Plot
0
5
10
15
20
25
30
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
Effect Levels
BS
FPU
MUL
DIV
MSR
COMP
ICACHE_type
ICACHE_size
DCACHE_type
DCACHE_size
Frank Vahid, UC Rivers
ide
29/57
Ongoing Work – Design of Experiments Paradigm
Interactions among parameters also automatically determined
BS FPU MUL DIV MSR COMP ICACHE_type ICACHE_size DCACHE_type DCACHE_size
BS
FPU
MU
LD
IVM
SR
CO
MP
ICA
CH
E_t
ype
ICA
CH
E_s
ize
DC
AC
HE
_typ
eD
CA
CH
E_s
ize
Marginal Means
0.00
9384693.17
0.00 1.00
BS vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
BS vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
BS vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
BS vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
BS vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
BS vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
BS vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
BS vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
BS vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
FPU vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
Marginal Means
0.00
9384693.17
0.00 1.00
FPU vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
FPU vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
FPU vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
FPU vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
FPU vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
FPU vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
FPU vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
FPU vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
MUL vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
MUL vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
Marginal Means
0.00
9384693.17
0.00 1.00
MUL vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
MUL vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
MUL vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
MUL vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
MUL vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
MUL vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
MUL vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
DIV vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
DIV vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
DIV vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
Marginal Means
0.00
9384693.17
0.00 1.00
DIV vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
DIV vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
DIV vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
DIV vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
DIV vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
DIV vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
MSR vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
MSR vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
MSR vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
MSR vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
Marginal Means
0.00
9384693.17
0.00 1.00
MSR vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
MSR vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
MSR vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
MSR vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
MSR vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
COMP vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
COMP vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
COMP vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
COMP vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
COMP vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
Marginal Means
0.00
9384693.17
0.00 1.00
COMP vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
COMP vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
COMP vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
COMP vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
ICACHE_type vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
ICACHE_type vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
ICACHE_type vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
ICACHE_type vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
ICACHE_type vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
ICACHE_type vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
Marginal Means
0.00
9384693.17
0.00 1.00
ICACHE_type vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
ICACHE_type vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
ICACHE_type vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
ICACHE_size vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
ICACHE_size vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
ICACHE_size vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
ICACHE_size vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
ICACHE_size vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
ICACHE_size vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
ICACHE_size vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
Marginal Means
0.00
9384693.17
0.00 1.00
ICACHE_size vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
ICACHE_size vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
DCACHE_type vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
DCACHE_type vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
DCACHE_type vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
DCACHE_type vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
DCACHE_type vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
DCACHE_type vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
DCACHE_type vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
DCACHE_type vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
Marginal Means
0.00
9384693.17
0.00 1.00
DCACHE_type vs DCACHE_size
2937659.33
10907794.00
0.00 1.00
DCACHE_size = 0
DCACHE_size = 1
DCACHE_size vs BS
2937659.33
10907794.00
0.00 1.00
BS = 0
BS = 1
DCACHE_size vs FPU
2937659.33
10907794.00
0.00 1.00
FPU = 0
FPU = 1
DCACHE_size vs MUL
2937659.33
10907794.00
0.00 1.00
MUL = 0
MUL = 1
DCACHE_size vs DIV
2937659.33
10907794.00
0.00 1.00
DIV = 0
DIV = 1
DCACHE_size vs MSR
2937659.33
10907794.00
0.00 1.00
MSR = 0
MSR = 1
DCACHE_size vs COMP
2937659.33
10907794.00
0.00 1.00
COMP = 0
COMP = 1
DCACHE_size vs ICACHE_type
2937659.33
10907794.00
0.00 1.00
ICACHE_type = 0
ICACHE_type = 1
DCACHE_size vs ICACHE_size
2937659.33
10907794.00
0.00 1.00
ICACHE_size = 0
ICACHE_size = 1
DCACHE_size vs DCACHE_type
2937659.33
10907794.00
0.00 1.00
DCACHE_type = 0
DCACHE_type = 1
Marginal Means
0.00
9384693.17
0.00 1.00
Frank Vahid, UC Rivers
ide
30/57
Ongoing work – System synthesis
Given N applications Create customized soft core for each app
Criteria: Meet size constraint, minimize total applications' runtime
Other criteria possible (e.g., meet runtime constraint, minimize size)
Present focus of Ryan Mannion, 3rd yr Ph.D.App1 App2 AppN
Microblaze
Mul
I$PicoBlaze
Mul
FPU
Div Microblaze
Frank Vahid, UC Rivers
ide
31/57
Ongoing work – System synthesis Presently use Integer Linear Program Solutions for large set of Xilinx devices generated in seconds
Device LUTs PPCs Area Utilization Cycles bitmnp canrdr aifir brev g3fax g721_ps idct matmul tblook ttsprk BaseFP01 raytraceXC2V2000 21504 0 21439 99.70% 2.46E+08 PBBase MBShift PBBase MBShift PBBase PBBase MBMult MBMult MBBase MBShiftDivMBShiftFPUMBBaseXC2VP2 2816 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC4VLX80 71680 0 69065 96.35% 1.86E+08 MBMultShiftMBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBMultShiftFPUMBMultShiftFPUXC4VLX15 12288 0 12247 99.67% 4.62E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift PBBase MBBase PBBase MBShiftFPUMBShiftXC2S300E 6140 0 6036 98.31% 8.21E+08 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBShiftXC2V4000 46080 0 45313 98.34% 1.87E+08 MBShift MBShift MBMultShiftMBMultShiftMBBase MBShift MBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBShiftFPUMBShiftFPUXC2VP40 38784 2 38712 99.81% 1.3E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBBase MBMultShiftDivMBMultShiftFPUMBMultShiftPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC4VSX25 20480 0 20341 99.32% 2.6E+08 PBBase PBBase PBBase MBShift PBBase PBBase MBMultShiftMBMult MBBase MBShiftDivMBShiftFPUMBShiftXC4VSX35 30720 0 30681 99.87% 1.96E+08 MBShift MBShift MBBase MBShift MBBase PBBase MBMult MBMult MBBase MBMultDiv MBShiftFPUMBShiftFPUXC4VFX20 17088 1 17023 99.62% 2.25E+08 PBBase PBBase PBBase MBShift PBBase PBBase MBShift MBMult MBShift MBShiftDivMBShiftFPUPPCBaseXC2S150E 3456 0 2928 84.72% 1.33E+09 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBaseXC2VP30 27392 2 26488 96.70% 1.31E+08 MBShift MBShift MBShift MBShift MBBase MBShift MBMultShiftMBMultShiftPPCBase MBMultShiftDivMBShiftFPUPPCBaseXC4VLX60 53248 0 52136 97.91% 1.87E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBBase MBMultShiftFPUMBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBShiftFPUMBShiftFPUXC2S600E 13824 0 13801 99.83% 3.96E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBBase MBShift PBBase MBShiftFPUMBShiftXC2VP20 18560 2 18527 99.82% 1.53E+08 PBBase MBShift PBBase MBShift PBBase PBBase MBMultShiftMBMult PPCBase MBShiftDivMBShiftFPUPPCBaseXC2V500 6144 0 6036 98.24% 8.21E+08 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBShiftXC2VPX70 66176 2 61858 93.47% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC4VLX40 36864 0 36746 99.68% 1.88E+08 MBShift MBShift MBShift MBShift MBBase MBShift MBMultShiftMBMultShiftMBShiftFPUMBMultShiftDivMBShiftFPUMBShiftFPUXC2V6000 67584 0 65738 97.27% 1.86E+08 MBMultShiftMBMultShiftMBMultShiftFPUMBMultShiftMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBMultShiftFPUMBMultShiftFPUXC4VFX60 50560 2 48665 96.25% 1.3E+08 MBMultShiftMBMultShiftMBMultShiftMBMultShiftMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultShiftPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC4VFX100 84352 2 62218 73.76% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftDivFPUPPCBaseXC2VP4 6016 1 5792 96.28% 6E+08 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase MBShift PBBase PPCBase MBShiftXC2VP70 66176 2 61858 93.47% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC2V40 512 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC2V1500 15360 0 15291 99.55% 3.42E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBBase MBBase MBShiftDivMBShiftFPUMBShiftXC2V8000 93184 0 76084 81.65% 1.86E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUMBMultShiftFPUMBMultShiftDivMBMultShiftDivFPUMBMultShiftDivFPUXC2V1000 10240 0 10014 97.79% 5.48E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBBase MBBase PBBase MBShift MBBaseXC4VSX55 49152 0 48970 99.63% 1.87E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBBase MBMultShiftDivMBMultShiftFPUMBMultShiftMBMultShiftFPUMBMultShiftDivMBShiftFPUMBShiftFPUXC2S50E 1536 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC2V3000 28672 0 28562 99.62% 2.01E+08 MBShift MBShift MBShift MBShift PBBase PBBase MBMultShiftMBMult MBShift MBShiftDivMBShiftFPUMBShiftFPUXC4VFX40 37284 2 36968 99.15% 1.3E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBBase MBMultShiftDivMBMultShiftFPUMBMultShiftPPCBase MBMultShiftDivMBShiftFPUPPCBaseXC2VPX20 19584 1 19562 99.89% 1.92E+08 PBBase MBShift PBBase MBShift PBBase PBBase MBMultShiftMBMult MBBase MBDiv MBFPU PPCBaseXC2S100E 2400 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC4VFX140 126336 2 62218 49.25% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftDivFPUPPCBaseXC4VLX25 21504 0 21439 99.70% 2.46E+08 PBBase MBShift PBBase MBShift PBBase PBBase MBMult MBMult MBBase MBShiftDivMBShiftFPUMBBaseXC2VP50 47232 2 46917 99.33% 1.3E+08 MBShift MBMultShiftMBMultShiftMBMultShiftMBMultFPUMBMultShiftFPUMBMultShiftFPUMBMultShiftPPCBase MBMultShiftDivMBMultShiftFPUPPCBaseXC2VP7 9856 1 9664 98.05% 3.9E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBShift MBBase MBBase MBDiv PPCBase MBBaseXC2S400E 9600 0 9558 99.56% 5.62E+08 PBBase PBBase PBBase PBBase PBBase PBBase MBBase MBBase MBBase PBBase MBBase MBBaseXC2S200E 4704 0 4482 95.28% 9.96E+08 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase MBShift PBBaseXC4VLX160 135168 0 76084 56.29% 1.86E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUMBMultShiftFPUMBMultShiftDivMBMultShiftDivFPUMBMultShiftDivFPUXC2V250 3072 0 2928 95.31% 1.33E+09 PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBase PBBaseXC2V80 1024 0 0 0.00% 0 infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasible infeasibleXC4VLX100 98304 0 76084 77.40% 1.86E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUMBMultShiftFPUMBMultShiftDivMBMultShiftDivFPUMBMultShiftDivFPUXC2VP100 88192 2 62218 70.55% 1.3E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUPPCBase MBMultShiftDivMBMultShiftDivFPUPPCBaseXC4VLX200 178176 0 76084 42.70% 1.86E+08 MBMultShiftMBMultShiftFPUMBMultShiftFPUMBMultShiftFPUMBMultFPUMBMultShiftDivFPUMBMultShiftDivFPUMBMultDivFPUMBMultShiftFPUMBMultShiftDivMBMultShiftDivFPUMBMultShiftDivFPUXC4VFX12 10944 1 10868 99.31% 3.72E+08 PBBase PBBase PBBase MBBase PBBase PBBase MBShift MBBase MBBase MBBase PPCBase MBBase
Device LUTs PPCs Area Utilization Cycles bitmnp canrdr aifirXC2V2000 21504 0 21439 99.70% 246,157,335 PBBase MBShift PBBaseXC2VP2 2816 0 0 0.00% 0 infeasible infeasible infeasibleXC4VLX80 71680 0 69065 96.35% 186,482,222 MBMultShift MBMultShift MBMultShiftFPUXC4VLX15 12288 0 12247 99.67% 462,384,156 PBBase PBBase PBBaseXC2S300E 6140 0 6036 98.31% 821,383,431 PBBase PBBase PBBaseXC2V4000 46080 0 45313 98.34% 186,747,116 MBShift MBShift MBMultShiftXC2VP40 38784 2 38712 99.81% 129,796,303 MBShift MBMultShift MBMultShiftXC4VSX25 20480 0 20341 99.32% 260,363,903 PBBase PBBase PBBaseXC4VSX35 30720 0 30681 99.87% 196,496,340 MBShift MBShift MBBaseXC4VFX20 17088 1 17023 99.62% 225,019,722 PBBase PBBase PBBaseXC2S150E 3456 0 2928 84.72% 1,329,161,797 PBBase PBBase PBBaseXC2VP30 27392 2 26488 96.70% 131,131,074 MBShift MBShift MBShiftXC4VLX60 53248 0 52136 97.91% 186,547,866 MBShift MBMultShift MBMultShiftXC2S600E 13824 0 13801 99.83% 395,723,730 PBBase PBBase PBBaseXC2VP20 18560 2 18527 99.82% 152,516,142 PBBase MBShift PBBase
Graduate Student: Ryan Mannion, 3rd yr
Ph.D.
Frank Vahid, UC Rivers
ide
32/57
Outline Two UCR ICCAD’06 papers
Microblaze customization Microblaze conjoining (and customization)
Current work targetting Microblaze users “Design of Experiments” paradigm System-
level synthesis for multi-core systems Related FPGA work
Warp processing Standard binaries for FPGAs
Frank Vahid, UC Rivers
ide
33/57
Binary-Level Synthesis Binary-level FPGA compiler developed 2002-2006 (Greg Stitt, Ph.D. UCR
2007)
C++ Java asmM obj
Compiler
Compiler
Assembler
Linker
Microproc. Binary
Sour
ce-l
evel
F
PG
A c
ompi
ler
prov
ides
a
lim
ited
sol
utio
n
Binary-level FPGA compiler
Binary-level FPGA compiler provides a more general solution, at the expense of lost high-
level information
FPGA Binary
FPGA Binary Microproc. Binary
Frank Vahid, UC Rivers
ide
34/57
Binary Synthesis Competitive with Source Level
Aggressive decompilation recovers most high-level constructs needed for good synthesis – Makes binary-level synthesis competitive with source level
0
2
4
6
8
10
1 5 9 13 17 21 25 29 33 37 41 45 49
Number of Functions in Hardware
Sp
ee
du
p
Source-level Binary-level
Freescale H264 decoder example, from ISSS/CODES 2005
Frank Vahid, UC Rivers
ide
35/57
Binary Synthesis Enables Dynamic Hardware/Software Partitioning
Called “Warp Processing” (Vahid/Stitt/Lysecky 2003-2007)
Direct collaborators: Intel, IBM, and Freescale
On-chip Binary-level FPGA Compiler
Microprocessor
FP
GA
Microproc. BinaryF
PG
A
Bin
ary
Microproc. Binary Downloader
Chip or board
Frank Vahid, UC Rivers
ide
36/57
µP
FPGAOn-chip CAD
Warp Processing Idea
Profiler
Initially, software binary loaded into instruction memory
11
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary
Frank Vahid, UC Rivers
ide
37/57
µP
FPGAOn-chip CAD
Warp Processing Idea
ProfilerI Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryMicroprocessor executes
instructions in software binary
22
Time EnergyµP
Frank Vahid, UC Rivers
ide
38/57
µP
FPGAOn-chip CAD
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryProfiler monitors instructions
and detects critical regions in binary
33
Time Energy
Profiler
add
add
add
add
add
add
add
add
add
add
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Critical Loop Detected
Frank Vahid, UC Rivers
ide
39/57
µP
FPGAOn-chip CAD
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD reads in critical
region44
Time Energy
Profiler
On-chip CAD
Frank Vahid, UC Rivers
ide
40/57
µP
FPGADynamic Part. Module (DPM)
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD decompiles critical
region into control data flow graph (CDFG)
55
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Frank Vahid, UC Rivers
ide
41/57
µP
FPGADynamic Part. Module (DPM)
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD synthesizes
decompiled CDFG to a custom (parallel) circuit
66
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
Frank Vahid, UC Rivers
ide
42/57
µP
FPGADynamic Part. Module (DPM)
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD maps circuit onto
FPGA77
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
FPGA
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
Frank Vahid, UC Rivers
ide
43/57
µP
FPGADynamic Part. Module (DPM)
Warp Processing Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary88
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more
Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA
Ret reg4
FPGA
Time Energy
Software-only“Warped”
DAC'03, DAC'04, DATE'04, ISSS/CODES'04, FPGA'04, DATE'05, FCCM'05, ICCAD'05,
ISSS/CODES'05, TECS'06, U.S. Patent Pending
Frank Vahid, UC Rivers
ide
45/57
191 113 130
0
10
20
30
40
50
60
70
80
Speedup
Warp Proc.
Xilinx Virtex-E
Warp ProcessorsPerformance Speedup (Most Frequent Kernel Only)
Average kernel speedup of 41, vs. 21 for Virtex-E
SW Only Execution
WCLA simplicity results in faster HW
circuits
Frank Vahid, UC Rivers
ide
46/57
0
2
4
6
8
10
12
14
16
18
Speedup
Warp Proc.
Warp ProcessorsPerformance Speedup (Overall, Multiple Kernels)
Average speedup of 7.4 Energy reduction of 38% - 94%
SW Only Execution
Assuming 100 MHz ARM, and fabric clocked at rate determined by
synthesis
Frank Vahid, UC Rivers
ide
47/57
Warp ProcessorsSpeedups Compared with Digital Signal Processor
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
brev
g3fa
xro
cm
pktfl
ow
canr
dr
bitm
np
tbloo
k
ttspr
k
mat
rix idct
mpe
g2 fir
mat
mul
Avera
ge
ARM
DSP
Warp (w/ ARM)
Frank Vahid, UC Rivers
ide
48/57
Warp ProcessorsSpeedups for Multi-Threaded Application Benchmarks
Compelling computing advantage of FPGAs:Parallellism from bit level up to processor level, and everywhere in between
307.7 501.9
0
20
40
60
80
100
120
1404-uP
8-uP
16-uP
32-uP
64-uP
Warp
Frank Vahid, UC Rivers
ide
49/57
FPGA Ubiquity via Obscurity Warp processing hides FPGA
from languages and tools ANY microprocessor platform
extendible with FPGA Maintains "ecosystem":
application, tool, and architecture developers
New platforms with FPGAs appearing
FPGAProc.
Translator
BinarySW
ProfilingStandard Compiler
BinaryStandard Binary
Architectures
Applications Tools
Standard binaries
New processor platforms
with FPGA
evolving
Frank Vahid, UC Rivers
ide
50/57
FPGA Standard Binaries? Microprocessor binary
represents one form of a "standard binary for FPGAs"
Missing is explicit concurrency Parallelism, pipelining, queues,
etc. As FPGAs appear in more
platforms, might a more general FPGA binary evolve?
FPGAProc.
Translator
BinarySW
ProfilingStandard Compiler
BinaryStandard Binary
Architectures
Applications Tools
Standard binaries
BinarySystemC?
Standard FPGA Compiler
BinaryStandard FPGA binary?
Standard FPGA binaries
Ecosystem for FPGAs
presently sorely missing
Frank Vahid, UC Rivers
ide
51/57
FPGA Standard Binaries?
Translator would make best use of existing FPGA resources
Could even add FPGA, like adding memory, to improve performance
Add more FPGA to your PDA to implement compute-intensive application?
BinaryBinary
FPGAProc.
Translator
FPGA
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
BinaryFPGA Binary
Translator
FPGA
Low-end PDA
100 sec
Translator
FPGA
High-end PDA
1 sec
Frank Vahid, UC Rivers
ide
52/57
FPGA Standard Binaries NSF funding received for 2006-2009
Xilinx letter of support was helpfulGraduate
Student: Scott Sirowy, 2nd year
Ph.D.
Frank Vahid, UC Rivers
ide
57/57
Conclusions Soft core customization increasingly
important to make best use of limited FPGA resources Good initial automatic customization results “Design of Experiments” paradigm looks
promising System-level synthesis may yield very useful MB
user tool, perhaps web based Warp processing and standard FPGA binary
work can help make FPGAs ubiquitous Accomplishments made possible by Xilinx
donations and interactions Continued and close collaboration sought