Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
description
Transcript of Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
![Page 1: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/1.jpg)
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Greg StittDepartment of Electrical and Computer Engineering
University of Florida
![Page 2: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/2.jpg)
2/55
Introduction
Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell
phones, etc. Future architectures - Speech/image recognition, self-
guiding cars, computation biology, etc.
![Page 3: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/3.jpg)
3/55
Introduction
FPGAs (Field Programmable Gate Arrays) – Implement custom circuits
10x, 100x, even 1000x for scientific and embedded apps [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05],
…
But, FPGAs not mainstream Warp Processing Goal: Bring FPGAs into
mainstream Make FPGAs “Invisible”
uPFPGA
Perf
orm
ance
FPGAs capable of large performance improvements
![Page 4: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/4.jpg)
4/55
Introduction – Hardware/Software Partitioning
for (i=0; i < 128; i++) y[i] += c[i] * x[i]......
for (i=0; i < 16; i++) y[i] += c[i] * x[i]......
C Code for FIR Filter
Processor Processor
~1000 cycles
Compiler
0102030405060708090
100
Time Energy
Sw
Hardware/software partitioning selects performance critical regions for hardware implementation
[Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94]
Processor FPGA
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Designer creates custom hardware using hardware description language (HDL)
Hardware for loop
0102030405060708090
100
Time Energy
Hw/ SwSw
~ 10 cycles Speedup = 1000 cycles/ 10
cycles = 100x
![Page 5: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/5.jpg)
5/55
Introduction – High-level Synthesis
Libraries/Object Code
Libraries/Object Code
Updated Binary
High-level Code
Decompilation
High-level Synthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftware
Problem: Describing circuit using HDL is time consuming/difficult
Solution: High-level synthesis Create circuit from high-
level code [Gupta, DeMicheli 92][Camposano,
Wolf 91][Rabaey 96][Gajski, Dutt 92]
Allows developers to use higher-level specification
Potentially, enables synthesis for software developers
DecompilationHw/Sw Partitioning
Compiler
![Page 6: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/6.jpg)
6/55
Introduction – High-level Synthesis
Problem: Describing circuit using HDL is time consuming/difficult
Solution: High-level synthesis Create circuit from high-
level code [Gupta, DeMicheli 92][Camposano,
Wolf 91][Rabaey 96][Gajski, Dutt 92]
Allows developers to use higher-level specification
Potentially, enables synthesis for software developers
Libraries/Object Code
Libraries/Object Code
Updated Binary
High-level Code
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftware
DecompilationHigh-level Synthesis
![Page 7: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/7.jpg)
7/55
Introduction – High-level Synthesis
Problem: Describing circuit using HDL is time consuming/difficult
Solution: High-level synthesis Create circuit from high-
level code [Gupta, DeMicheli 92][Camposano,
Wolf 91][Rabaey 96][Gajski, Dutt 92]
Allows developers to use higher-level specification
Potentially, enables synthesis for software developers
for (i=0; i < 16; i++) y[i] += c[i] * x[i]
* * * * * * * * * * * *
+ + + + + +
+ + +
+ +
+
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
DecompilationHigh-level Synthesis
![Page 8: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/8.jpg)
8/55
Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis
Key techniques for synthesis from binaries Decompilation
Current and Future Directions Multi-threaded Warp Processing Custom Communication
![Page 9: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/9.jpg)
9/55
Problems with High-Level Synthesis
Problem: High-level synthesis is unattractive to software developers
Requires specialized language
SystemC, NapaC, HandelC, …
Requires specialized compiler
Spark, ROCCC, CatapultC, …
Limited commercial success
Software developers reluctant to change tools
Libraries/Object Code
Libraries/Object Code
Updated BinaryHigh-level Code
DecompilationSynthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftware
Non-Standard Software Tool Flow
Updated BinarySpecialized Language
DecompilationSpecialized Compiler
![Page 10: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/10.jpg)
10/55
Warp Processing – “Invisible” Synthesis
Libraries/Object Code
Libraries/Object Code
Updated BinaryHigh-Level Code
DecompilationSynthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftware
Solution: Make synthesis “invisible” 2 Requirements
Standard software tool flow
Perform compilation before synthesis
Hide synthesis tool Move synthesis on
chip Similar to dynamic
binary translation [Transmeta]
But, translate to hw
DecompilationSynthesis
DecompilationCompiler
Updated BinaryHigh-level CodeLibraries/
Object Code
Libraries/Object Code
Updated BinarySoftware Binary
HardwareHardwareSoftwareSoftware
Move compilation before synthesis
Standard Software Tool Flow
![Page 11: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/11.jpg)
11/55
Warp Processing – “Invisible” Synthesis
Libraries/Object Code
Libraries/Object Code
Updated BinaryHigh-Level Code
DecompilationSynthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftwareDecompilationSynthesis
DecompilationCompiler
Updated BinaryHigh-level CodeLibraries/
Object Code
Libraries/Object Code
Updated BinarySoftware Binary
HardwareHardwareSoftwareSoftware
Solution: Make synthesis “invisible” 2 Requirements
Standard software tool flow
Perform compilation before synthesis
Hide synthesis tool Move synthesis on
chip Similar to dynamic
binary translation [Transmeta]
But, translate to hw
Warp processor looks like standard uP but invisibly synthesizes hardware
![Page 12: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/12.jpg)
12/55
Warp Processing – “Invisible” Synthesis
Libraries/Object Code
Libraries/Object Code
Updated BinaryHigh-Level Code
DecompilationSynthesis
BitstreamBitstream
uP FPGA
Linker
HardwareHardwareSoftwareSoftwareDecompilationSynthesis
DecompilationCompiler
Updated BinaryHigh-level CodeLibraries/
Object Code
Libraries/Object Code
Updated BinarySoftware Binary
HardwareHardwareSoftwareSoftware
Advantages Supports all
languages,compilers, IDEs
Supports synthesis of assembly code
Support synthesis of library code
Also, enables dynamic optimizations
Updated BinaryC, C++, Java, Matlab
Decompilationgcc, g++, javac, keil
Warp processor looks like standard uP but invisibly synthesizes hardware
![Page 13: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/13.jpg)
13/55
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
Initially, software binary loaded into instruction memory
11
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary
![Page 14: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/14.jpg)
14/55
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
ProfilerI Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryMicroprocessor executes
instructions in software binary
22
Time EnergyµP
![Page 15: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/15.jpg)
15/55
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryProfiler monitors instructions
and detects critical regions in binary
33
Time Energy
Profiler
add
add
add
add
add
add
add
add
add
add
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Critical Loop Detected
![Page 16: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/16.jpg)
16/55
µP
FPGAOn-chip CAD
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD reads in critical
region44
Time Energy
Profiler
On-chip CAD
![Page 17: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/17.jpg)
17/55
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD converts critical region
into control data flow graph (CDFG)55
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
![Page 18: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/18.jpg)
18/55
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD synthesizes
decompiled CDFG to a custom (parallel) circuit
66
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
![Page 19: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/19.jpg)
19/55
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software BinaryOn-chip CAD maps circuit onto
FPGA77
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
![Page 20: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/20.jpg)
20/55
µP
FPGADynamic Part. Module (DPM)
Warp Processing Background: Basic Idea
Profiler
µP
I Mem
D$
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
Software Binary88
Time Energy
Profiler
On-chip CAD
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0+ + ++ ++
+ ++
+
+
+
. . .
. . .
. . .
CLB
CLB
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
++
FPGA
On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more
Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA
Ret reg4
FPGA
Time Energy
Software-only“Warped”
![Page 21: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/21.jpg)
21/55
µP
Cache
Expandable Logic
RAM
Expandable RAM
uP
Performance
Profiler
µP
Cache
Warp Tools
DMA
FPGAFPGA
FPGA FPGA
RAM Expandable RAM – System detects RAM during start, improves performance invisibly
Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware.
Expandable Logic
![Page 22: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/22.jpg)
22/55
Expandable Logic
Allows for customization of platforms User can select FPGAs based on used
applicationsApplication
Portable Gaming
Performance
Unacceptable Performance
![Page 23: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/23.jpg)
23/55
Expandable Logic
Allows for customization of platforms User can select FPGAs based on used
applicationsApplication
Portable Gaming
Performance
. . . .
. . . .
•User can customize FPGAs to the desired amount of performance•Performance improvement is invisible – doesn’t require new binary from the developer
![Page 24: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/24.jpg)
24/55
Expandable Logic
Allows for customization of platforms User can select FPGAs based on used
applicationsApplicationWeb Browser
Performance
Acceptable Performance
No-FPGA
•Platform designer doesn’t have to decide on fixed amount of FPGA.
•User doesn’t have to pay for FPGA that isn’t needed
![Page 25: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/25.jpg)
25/55
uPI$
D$
FPGA
Profiler
On-chip CAD
Warp Processing Background: Basic Technology
Challenge: CAD tools normally require powerful workstations
Develop extremely efficient on-chip CAD tools
Requires efficient synthesis Requires specialized FPGA, physical
design tools (JIT FPGA compilation) [Lysecky FCCM05/DAC04],
University of Arizona
BinaryBinary
BinaryHW
Synthesis
Technology Mapping
Placement & Routing
Logic Optimization
BinaryUpdated Binary
JIT F
PG
A
com
pila
tio
n
![Page 26: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/26.jpg)
26/55
Warp Processing Background: On-Chip CAD
60 MB
9.1 s
Xilinx ISE
Manually performed
3.6MB0.2 s
On-chip CAD
On a 75Mhz ARM7: only 1.4 s
46x improvement30% perf. penalty
Log.
Opt
.
Tech
. Map
Plac
e
Rou
te
RT
Syn.
Synt
hesi
s
![Page 27: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/27.jpg)
27/55
Warp Processing: Initial Results - Embedded Applications
Average speedup of 6.3x Achieved completely transparently
Also, energy savings of 66%
0
3
6
9
12
15
brev
g3fa
x url
rocm
pktflo
wca
nrdr
bitm
np
tblo
ok
ttspr
k
mat
rix idct
g721
mpe
g2 fir
mat
mul
Avera
ge:
Med
ian:
Benchmarks
Sp
ee
du
p
![Page 28: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/28.jpg)
28/55
Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis
Key techniques for synthesis from binaries Decompilation
Current and Future Directions Multi-threaded Warp Processing Custom Communication
![Page 29: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/29.jpg)
29/55
Binary Synthesis Warp processors perform
synthesis from software binary – “binary synthesis”
Problem: No high-level information
Synthesis needs high-level constructs
> 10x slowdown
Can we recover high-level information for synthesis?
Make binary synthesis (and Warp processing) competitive with high-level synthesis
for (i=0; i < 128; i++) y[i] += c[i] * x[i]....
for (i=0; i < 128; i++) y[i] += c[i] * x[i]....
Compiler
Addi r1, r0, 0Ld r3, 256(r1)Ld r4, 512(r1)Subi r2, r1, 128Jnz r2, -5
No high-level constructs – arrays, loops, etc.
Binary Synthesis
Processor FPGAHardware can be > 10x to 100x
![Page 30: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/30.jpg)
30/55
Decompilation We realized decompilation recovers
high-level information But, generally used for binary translation or
source-code recovery May not be suitable for synthesis
We studied existing approaches [Cifuentes 94, 99, 01][Mycroft 99,01] DisC, dcc, Boomerang, Mocha, SourceAgain
Determined relevant techniques Adapted existing techniques for synthesis
![Page 31: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/31.jpg)
31/55
Decompilation – Control/Data Flow Graph Recovery
Recovery of control/data flow graph (CDFG) Format used by synthesis Difficult because of indirect jumps
Cannot statically analyze control flow But, heuristics are over 99% successful on standard benchmarks
[Cifuentes 99, 00]
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Control/Data Flow Graph CreationOriginal C Code
Corresponding Assembly
![Page 32: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/32.jpg)
32/55
Decompilation – Data Flow Analysis
Original purpose - remove temporary registers Area overhead – 130%
Need new techniques for binary synthesis
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
Corresponding Assembly
loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop
ret reg4
reg3 := 0reg4 := 0
Data Flow Analysis
![Page 33: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/33.jpg)
33/55
Decompilation – Data Flow Analysis
Strength Reduction – Compare-with-zero instructions
Operator Size Reduction
Sub reg3, reg4, reg5 Bz reg3, -5
reg4 reg5
Sub
reg3
=
0
Branch?
Not needed, wastes area
32-bit reg4
32-bit +
32-bit reg5
32-bit reg3
Lb reg4, 0(reg1)Mvi reg5, 16Add reg3, reg4, reg5 8-bit +
8-bit reg3Only 8-bit adder needed
reg4
=
reg5
Branch?
Optimized DFG
Area Overhead Reduced to 10%
8-bit reg4 5-bit reg5
Optimized DFG
Load Byte 16
![Page 34: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/34.jpg)
34/55
Decompilation – Function Recovery
Recover parameters and return values Def-use analysis of prologue/epilogue 100% success rate
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
Corresponding Assembly
long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}
Function Recovery
![Page 35: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/35.jpg)
35/55
Decompilation – Control Structure Recovery
Recover loops, if statements Uses interval analysis techniques
[Cifuentes 94]
100% success rate
long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}
Control Structure Recovery
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
Corresponding Assembly
![Page 36: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/36.jpg)
36/55
Decompilation – Array Recovery
Detect linear memory patterns and row-major ordering calculations
~ 95% success rate [Stitt, Guo, Najjar, Vahid 05] [Cifuentes 00]
Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
Corresponding Assembly
long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }
Array Recovery
![Page 37: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/37.jpg)
37/55
Comparison of Decompiled Code and Original Code
Decompiled code almost identical to original code
Only difference is variable names Binary synthesis is competitive with high-level
synthesis
long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}
Original C Code
long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }
Decompiled Code
Almost Identical Representations
![Page 38: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/38.jpg)
38/55
Libraries/Object Code
Binary Synthesis Tool Flow
Binary Synthesis
BinaryBinary
DecompilationDecompilation
HardwareHardwareSoftwareSoftware
Libraries/Object Code
Hardware Netlists
Hardware Netlists
BitstreamBitstream
ProfilingSynthesisProfilingBinary Updater
Hw/Sw Estimation
Hw/Sw Estimation
Hw/Sw Partitioning
Hw/Sw Partitioning
ProfilingProfiling
Updated Binary
High-level Source
DecompilationCompiler
BinaryBinary
BitstreamBitstream
uP FPGA
Updated Binary
Updated Binary
Initially, high-level source is compiled and linked to form a binary
Recovers high-level information needed for synthesis
Modifies binary to use synthesized hardware
~30,000 lines of C code
![Page 39: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/39.jpg)
39/55
0123456789
101112131415
Sp
eed
up
FIR Fi
lter
Beam
form
er
Vite
rbi
Brev
Url
BITMNP0
1
IDCTR
N01
PNTR
CH01
Aver
age
High-level
Binary-level
Binary Synthesis is Competitive with High-Level Synthesis
Binary synthesis competitive with high-level synthesis Binary speedup: 8x, High-level speedup: 8.2x High-level synthesis only 2.5% better
Commercial products beginning to appear Critical Blue, Binachip
Small difference in speedup
![Page 40: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/40.jpg)
40/55
Binary Synthesis with Software Compiler Optimizations
But, binaries generated with few optimizations Optimizations for software may hurt
hardware Need new decompilation techniquesC code
SW Compiler
Optimized Binary
uP FPGA
Binary Synthesis
Binary is optimized for software
Hardware synthesized from optimized binary may be inefficient
![Page 41: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/41.jpg)
41/55
Loop Rerolling
Solution: We introduce loop rerolling to undo loop unrolling
Problem: Loop unrolling may cause inefficient hardware
Longer synthesis times Super-linear heuristics Unrolling 100 times =>
synthesis time is 1002 times longer
Larger area requirements Unrolling by compiler unlikely
to match unrolling by synthesis Loop structure needed for
advanced synthesis techniques
Non-unrolled Unrolled
Synthesis Execution Times
Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5
Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1
Non-unrolled Loop Unrolled Loop
![Page 42: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/42.jpg)
42/55
Loop Rerolling – Identifying Unrolled Loops
x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;
Original C Code
Find Consecutive Repeating Substrings: Adjacent Nodes with Same SubstringUnrolled Loop
2 unrolled iterationsEach iteration = abc (Ld, Add, St)
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Binary
x= x + 1;a[0] = b[0]+1;a[1] = b[1]+1;y = x;
Unrolled Loop
Add r3, r3, 1 => BLd r0, b(0) => AAdd r1, r0, 1 => BSt a(0), r1 => CLd r0, b(1) => AAdd r1, r0, 1 => BSt a(1), r1 => CMov r4, r3 => D
Map to String
BABCABCD
String Representatio
n
Idea - Identify consecutively repeating instruction sequences
abc c db
abcabcd c abcd d abcd d
dabcd
Suffix Tree
[Ukkonen 95]
![Page 43: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/43.jpg)
43/55
Loop Rerolling
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Original C Code
Unrolled Loop Identificiation
Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3
Determine relationship of constants
1)
Add r3, r3, 1i=0loop:Ld r0, b(i)Add r1, r0, 1St a(i), r1Bne i, 2, loopMov r4, r3
Replace constants with induction variable expression
2)
reg3 = reg3 + 1;for (i=0; i < 2; i++) array1[i]=array2[i]+1;reg4=reg3;
Rerolled, decompiled code
3)
x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;
Average Speedup of 1.6x
![Page 44: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/44.jpg)
44/55
Strength Promotion
+
++
<< <<
B[i+1] 4B[i+1] 1
+
<< <<
B[i] 3 B[i] 1
+
<< <<
B[i+2] 5B[i+2]1
+
<< <<
B[i+3]6 B[i+3]1
+
A[i]
However, some of the strength reduction was beneficial
Strength promotion lets synthesis decide on strength reduction, not software compiler
Average Speedup of 1.5
Identify strength-reduced subgraphs
+
++
<< <<
B[i+1] 4B[i+1] 1
+
<< <<
B[i+2] 5B[i+2]1
+
<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
Replace with multiplication
++
+
<< <<
B[i+2] 5B[i+2]1
+
<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
B[i] 18
*
++
+
<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
B[i] 18
*
B[i] 34
*
++
+
A[i]
B[i] 10
*
B[i] 18
*
B[i] 34
*
B[i] 66
*
1
++
B[i+1] 18B[i] 10
+
<< <<
B[i+2] 5B[i+2]1
+
<< <<
B[i+3]6 B[i+3]
+
A[i]
* *
Synthesis reapplies strength reduction to get optimal DFG
Problem: Strength reduction may cause inefficient hardware
![Page 45: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/45.jpg)
45/55
Multiple ISA/Optimization Results
What about aggressive software compiler optimizations? May obscure binary, making decompilation impossible
What about different instructions sets? Side effects may degrade hardware performance
0
5
10
15
20
25
30
Speedups similar on MIPS for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar on ARM for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar between ARM and MIPS
Complex instructions of ARM didn’t hurt synthesis
MicroBlaze speedups much larger
MicroBlaze is a slower microprocessor
-O3 optimizations were very beneficial to hardware
0
5
10
15
20
25
30
MIP
S -O1
MIP
S -O3
ARM -O
1
ARM -O
3
Micr
oBlaz
e -O
1
Micr
oBlaz
e -O
3
Sp
eed
up
![Page 46: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/46.jpg)
46/55
High-level vs. Binary Synthesis: Proprietary H.264 Decoder
MPEG2 H.264
High-level synthesis vs. binary synthesis Collaboration with Freescale Semiconductor
H.264 Decoder MPEG-4 Part 10 Advanced Video Coding (AVC) 3x smaller than MPEG-2 Better quality
![Page 47: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/47.jpg)
47/55
High-level vs. Binary Synthesis: Proprietary H.264 Decoder
Binary synthesis was competitive with high-level synthesis High-level speedup – 6.56x Binary speedup – 6.55x
0
1
2
3
4
5
6
7
8
9
101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Number of Functions in Hardware
Sp
eed
up
Speedup (High-level)
Speedup (Binary)
Binary synthesis competitive with high- level synthesis
![Page 48: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/48.jpg)
48/55
Outline Introduction Warp Processing Overview Enabling Technology – Binary Synthesis
Key techniques for synthesis from binaries Decompilation
Current and Future Directions Multi-Threaded Warp Processing Custom Communication
![Page 49: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/49.jpg)
49/55
Thread Warping - Overview
Profiler
µP
Warp Tools
Warp FPGA
µP
µP µPOS
a( ) b( )
b( )
for (i=0; i < 10; i++) createThread( b );
Function a( )
OS
Thread Queue
b( ) b( ) b( ) b( )b( ) b( )b( )b( )
Warp Toolsb( )
Warp FPGA
b( )
b( )
b( )
b( )b( )
b( ) b( )
b( )
OS can only schedule 2 threads
Remaining 8 threads placed in thread queue
Warp tools create custom accelerators for b( )
OS schedules 4 threads to custom accelerators
3x more thread parallelism
Architectural Trend – Include more cores on chip
Result – More multi-threaded applications
![Page 50: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/50.jpg)
50/55
Thread Warping - Overview
Profiler
µP
Warp Tools
Warp FPGA
µP
µP µPOS
a( ) b( )
b( )
for (i=0; i < 10; i++) createThread( b );
Function a( )
Warp Toolsb( )
Profiler
Profiler detects performance critical loop in b( )
Warp FPGA
b( )
b( )
b( )
b( ) Warp tools create larger/faster accelerators
b( )b( ) b( )b( )
Potentially > 100x speedup
![Page 51: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/51.jpg)
51/55
130 502 63 130 38308
01020304050
Fir Prewitt Linear Moravec Wavelet Maxfilter 3DTrans N-body Avg. Geo.Mean
4-uP
TW
8-uP
16-uP
32-uP
64-uP
Thread Warping - ResultsThread warping 120x faster than 4-uP (ARM) system
Comparison of thread warping (TW) and multi-core
Simulated multi-cores ranging from 4 to 64 Thread warping – 4 cores + FPGA
![Page 52: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/52.jpg)
52/55
Warp Processing – Custom Communication
µP µP
µP µP
Problem: Best topology is application dependent
Bus Mesh
Bus Mesh
App1
App2
NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar]
Perf
orm
ance
Perf
orm
ance
![Page 53: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/53.jpg)
53/55
Warp Processing – Custom Communication
FPGA
NoC – Network on a Chip provides communication between multiple cores [Benini, DeMicheli][Hemani][Kumar]
Problem: Best topology is application dependent
Bus Mesh
Bus Mesh
App1
App2
µP µP
µP µP
Warp processing can dynamically choose topology – 2x to 100x improvement
FPGA
µP µP
µP µP
FPGA
µP µP
µP µP
Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing”
Perf
orm
ance
Perf
orm
ance
![Page 54: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/54.jpg)
54/55
Summary
uPI$
D$
FPGA
Profiler
On-chip CAD
Updated BinaryAny Language
Updated BinaryStandard Binary
DecompilationAny Compiler
Developer is unaware of FPGA/synthesis
BinaryBinary
BinaryHW
Binary Synthesis
JIT FPGA Compilation
BinaryUpdated Binary
Decompilation makes possible
FPGA
Expandable Logic
Warp Processing
uP
Performance
Warp processing invisibly achieves > 100x speedups
![Page 55: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813a3c550346895da22550/html5/thumbnails/55.jpg)
55/55
References Patent
Warp Processor for Dynamic Hardware/Software Partitioning. F. Vahid, R. Lysecky, G. Stitt. Patent Pending, 2004
1. Hardware/Software Partitioning of Software Binaries G. Stitt and F. VahidIEEE/ACM International Conference on Computer Aided Design (ICCAD), 2002, pp. 164- 170.
2. Warp Processors R. Lysecky, G. Stitt, and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2006, Volume 11, Number 3, pp. 659-681.
3. Binary Synthesis G. Stitt and F. Vahid Accepted for publication in ACM Transactions on Design Automation of Electronic Systems (TODAES)
4. Expandable Logic G. Stitt, F. Vahid Submitted to IEEE/ACM Conference on Design Automation (DAC), 2007.
5. New Decompilation Techniques for Binary-level Co-processor Generation G. Stitt, F. Vahid IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2005, pp. 547-554.
6. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode G.Stitt, F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 285-290.
7. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid IEEE/ACM Design Automation and Test in Europe (DATE), 2005, pp.396-397.
8. Dynamic Hardware/Software Partitioning: A First Approach G. Stitt, R. Lysecky and F. Vahid IEEE/ACM Conference on Design Automation (DAC), 2003, pp. 250-255.
Supported by NSF, SRC, Intel, IBM, Xilinx