Hardware-Accelerated Dynamic Binary Translation - Inria
Transcript of Hardware-Accelerated Dynamic Binary Translation - Inria
Rokicki Simon - Irisa / Université de Rennes 1
Steven Derrien - Irisa / Université de Rennes 1
Erven Rohou - Inria
Hardware-Accelerated Dynamic Binary Translation
Embedded Systems
Tight constraints in• Power consumption• Production cost• Performance
2Hardware Accelerated Dynamic Binary Translation
Systems on a Chip
• Complex heterogeneous designs• Heterogeneity brings new power/performance trade-off
3Hardware Accelerated Dynamic Binary Translation
Out-of-order SuperscalarIn-order core
Performance
Pow
er
Overhead from in-orderto Out-of-Order
Systems on a Chip
4Hardware Accelerated Dynamic Binary Translation
Out-of-order SuperscalarIn-order coreVLIW
Performance
Pow
er
• Complex heterogeneous designs• Heterogeneity brings new power/performance trade-off• Are there better trade-off?
Out-of-Order processor
• Dynamic Scheduling
• Performance portability
• Poor energy efficiency
VLIW processor
• Static scheduling
• No portability
• High energy efficiency
Architectural choice
VLI
W
ins2
ins3
Ins4
ins1
ins2
ins3
Ins4
ins1
ins2
ins3
Ins4
ins1
……
……
ins3 ins2 ins1ins4… ROBD
eco
de
& R
enam
ing
5Hardware Accelerated Dynamic Binary Translation
Dynamically translate native binaries into VLIW binaries:• Performance close to Out-of-Order processor
• Energy consumption close to VLIW processor
The best of both world ?
VLI
WBinaries(RISC-V)
VLIWBinaries
Dynamic BinaryTranslation
6Hardware Accelerated Dynamic Binary Translation
• Transmeta Code Morphing Software & Crusoe architectures• x86 on VLIW architecture• User experience polluted by cold-code execution penalty
• Nvidia Denver architecture• ARM on VLIW architecture
Existing approaches
7
• Translation overhead is critical
• Too few information on closed platforms
Hardware Accelerated Dynamic Binary Translation
• Hardware accelerated DBT framework Make the DBT cheaper (time & energy)
First approach that try to accelerate binary translation
• Open source framework Allows research
Our contribution
HardwareAccelerators
VLI
WBinaries(RISC-V)
VLIWBinaries
Dynamic BinaryTranslation
Hardware Accelerated Dynamic Binary Translation 7
• Hybrid-DBT Platform• How does it work? • What does it cost?• Focus on optimization levels
• Experimental Study• Impact on translation overhead• Impact on translation energy overhead• Impact on area utilization
• Conclusion & Future work
Outline
9Hardware Accelerated Dynamic Binary Translation
Outline
10
• Hybrid-DBT Platform• How does it work? • What does it cost?• Focus on optimization levels
• Experimental Study• Impact on translation overhead• Impact on translation energy overhead• Impact on area utilization
• Conclusion & Future work
Hardware Accelerated Dynamic Binary Translation
How does it work?
VLI
WRISC-Vbinaries
11
• RISC-V binaries cannot be executed on VLIW
Hardware Accelerated Dynamic Binary Translation
How does it work?
• Direct, naive translation from native to VLIW binaries• Does not take advantage of Instruction Level Parallelism
VLI
WRISC-Vbinaries
Direct Translation
VLIWbinaries
Optimizationlevel 0 No ILP
Hardware Accelerated Dynamic Binary Translation 12
13
How does it work?
• Build an Intermediate Representation (CFG + dependencies)• Reschedule Instructions on VLIW execution units
VLI
WRISC-Vbinaries
Direct Translation
VLIWbinaries
IR Builder IR SchedulerIR
Optimizationlevel 0
Optimizationlevel 1
No ILP
ILP
Hardware Accelerated Dynamic Binary Translation
14
How does it work?
• Code profiling to detect hotspot• Optimization level 1 only on hotspots
VLI
WRISC-Vbinaries
Direct Translation
VLIWbinaries
Insert Profiling
IR Builder IR SchedulerIR
Optimizationlevel 0
Optimizationlevel 1
No ILP
ILP
Hardware Accelerated Dynamic Binary Translation
VLI
WRISC-Vbinaries
Direct Translation
VLIWbinaries
Insert Profiling
IR Builder IR SchedulerIR
Optimizationlevel 0
Optimizationlevel 1
15
150 cycle/instr
400 cycle/instr 500 cycle/instr
What does it cost?
No ILP
ILP
• Cycle/instr : number of cycles to translate one RISC-V instruction• Need to accelerate time consuming parts of the translation
Hardware Accelerated Dynamic Binary Translation
Hybrid-DBT framework
VLI
WRISC-Vbinaries
First-PassTranslation
VLIWbinaries
Insert Profiling
IR Builder IR SchedulerIR
Optimizationlevel 0
Optimizationlevel 1
16
Software pass
Hardware accelerators
No ILP
ILP
• Hardware acceleration of critical steps of DBT• Can be seen as a hardware accelerated compiler back-end
Hardware Accelerated Dynamic Binary Translation
Focus on optimization level 0
VLI
WRISC-Vbinaries
First-PassTranslation
VLIWbinaries
Insert Profiling
Optimizationlevel 0 No ILP
17
• Critical for system reactivity
Hardware Accelerated Dynamic Binary Translation
First-Pass Translation
RISC-Vbinaries
rs1 funct rd opcodeimm12 rs1 rd opcodeimm13
VLIWbinaries
• Implemented as a Finite State Machine• Translate each native instruction separately• Produces 1 VLIW instruction per cycle• 1 RISC-V instruction => up to 2 VLIW instructions
• Simple because ISA are similar
18
First-PassTranslation
Hardware Accelerated Dynamic Binary Translation
Focus on optimization level 1
VLI
WRISC-Vbinaries
First-PassTranslation
VLIWbinaries
Insert Profiling
IR Builder IR SchedulerIR
Optimizationlevel 0
Optimizationlevel 1
No ILP
ILP
19
• Critical to start exploiting VLIW capabilities
Hardware Accelerated Dynamic Binary Translation
Goal of optimization level 1
20
stw r5,0(r3)
ldw r3,0(r2)
addi r4,r1,1
sub r4,r4,r3
stw r4,0(r3)
movi r3,0
Native Binaries
VLIW Binaries
nop stw r5,0(r3)
addi r4,r1,1 ldw r3,0(r2)
sub r4,r4,r3 nop
movi r3,0 stw r4,0(r3)
Exploit available ILP• Compute dependencies• Perform Instruction Scheduling
Hardware Accelerated Dynamic Binary Translation
Cost of optimization level 1
VLIWbinaries
IR Builder IR SchedulerIR
21
VLIWbinaries
400 cycle/instr 500 cycle/instr
acceleration is simple acceleration is challenging
• Generate high-level IR• Instruction scheduling on the IR
• Instruction decoding/encoding• Single FOR loop• Regular computations
• Difficult to parallelize• Complex control flow structure
Hardware Accelerated Dynamic Binary Translation
Cost of optimization level 1
VLIWbinaries
IR Builder IR SchedulerIR
22
VLIWbinaries
400 cycle/instr 500 cycle/instr
• Generate high-level IR• Instruction scheduling on the IR
• Instruction Scheduling is the bottleneck• IR is designed to speed-up scheduling
Hardware Accelerated Dynamic Binary Translation
acceleration is simple acceleration is challenging
Choosing an Intermediate Representation
23
g1 1 g3 0
stld+
-
st
r 3
012
3
4
0
mv5
g2 VLIW Binaries
nop stw r5,0(r3)
addi r4,r1,1 ldw r3,0(r2)
sub r4,r4,r3 nop
movi r3,0 stw r4,0(r3)
Hardware Accelerated Dynamic Binary Translation
stw r5,0(r3)
ldw r3,0(r2)
addi r4,r1,1
sub r4,r4,r3
stw r4,0(r3)
movi r3,0
Native Binaries
Choosing an Intermediate Representation
24
g1 1 g3 0
stld+
-
st
r 3
012
3
4
0
mv5
g2
op registers[4]
nbSuccnbDSuccnbDep96 64 32 0
0 - st
1 - ld
2 - addi
3 - sub
4 - st
@g3 = 0
r1 = @g2
g1 = g1 1
r3 = r1 g1
@g2 = r3
0
1
0
2
2
0
2
1
1
0
1
2
1
2
1
1
3
3
4
5
4
-
5
- - - - - -
-
- -
-- -
-
-
- -
-
-
-
-
-
- -
-
-
-
succNames[8]
-
-
-
-
-
5- mov r3 = 0 2 0 0 - - - -- - --
-
IR advantages:• Direct access to dependencies and successors• Regular structure (no pointers / variable size)
Hardware Accelerated Dynamic Binary Translation
Details on hardware accelerators
• Developing such accelerators using VHDL is out of reach• Accelerators are developed using High-Level Synthesis
• Loops unrolling/pipelining• Memory partitioning• Memory accesses factorization• Explicit forwarding
VLIWbinaries
IR Builder IR SchedulerIRVLIW
binaries
One-pass dependencies analysis List-scheduling algorithm
Hardware Accelerated Dynamic Binary Translation 25
See paper for more details !
Outline
26
• Hybrid-DBT Platform• How does it work? • What does it cost?• Focus on optimization levels
• Experimental Study• Impact on translation overhead• Impact on translation energy overhead• Impact on area utilization
• Conclusion & Future work
Hardware Accelerated Dynamic Binary Translation
Impact on translation overhead
27
VLI
WRISC-Vbinaries
First-PassTranslation
VLIWbinaries
IR BuilderIR
SchedulerIR
Optimizationlevel 0
Optimizationlevel 1
150 cycle/instr
400 cycle/instr 500 cycle/instr
• VLIW baseline is executed with ST200simVLIW
• Fully functionnal Hybrid-DBT platform on FPGA • JIT processor: Nios II
• Altera DE2-115
Hardware Accelerated Dynamic Binary Translation
Impact on translation overhead
0
1
2
3
4
5cycle/instruction
0
50
100
150
200
Speed-up vs Software DBT
First-Pass Translator IR Builder IR Scheduler
28
• Cost of optimization level 0 using the hardware accelerator
Hardware Accelerated Dynamic Binary Translation
Impact on translation overhead
14 15 13 14 13 13 13 13 12 1337
144
61
132106 112 106 116
60
105
0
50
100
150
200
cycle/instruction
05
10152025303540
Speed-up vs Software DBT
First-Pass Translator IR Builder IR Scheduler
29
• Cost of optimization level 1 using the hardware accelerator
Hardware Accelerated Dynamic Binary Translation
Impact on translation energy overhead
30
VLI
WRISC-Vbinaries
First-PassTranslation
VLIWbinaries
IR BuilderIR
SchedulerIR
Optimizationlevel 0
Optimizationlevel 1
?? J
?? J ?? J
• Hybrid-DBT platform on ASIC:• Compiled with design compiler for ASIC 65nm
• Design frequency: 250 MHz
• Gate-level simulation with Modelsim
• Accurate power estimation
Hardware Accelerated Dynamic Binary Translation
Impact on translation energy overhead
050
100150200250300350400
Energy-efficiency vs software DBT
First-Pass Translator IR Builder IR Scheduler
31
• Energy-efficiency improvement using the hardware accelerator
Hardware Accelerated Dynamic Binary Translation
Impact on translation energy overhead
0
20
40
60
80
100
Energy-efficiency vs software DBT
First-Pass Translator IR Builder IR Scheduler
32
• Energy-efficiency improvement using the hardware accelerator
Hardware Accelerated Dynamic Binary Translation
Impact on area/resource cost
19 220
6 3007 626
779
5 019
0
5000
10000
15000
20000
25000
VLIW DBTProcessor
IRScheduler
First-PassTranslator
IR Builder
NAND equivalent gates
33
Overhead from Hybrid-DBT
• Resource usage for all our platform components
Hardware Accelerated Dynamic Binary Translation
Conclusion
• Presentation of Hybrid-DBT framework• Hardware accelerated DBT
• Open-source DBT framework RISC-V to VLIW
• Tested FPGA prototype
• Sources are available on GitHub: https://github.com/srokicki/HybridDBT
34
HardwareAccelerators
VLI
WBinaries(RISC-V)
VLIWBinaries
Dynamic BinaryTranslation
Hardware Accelerated Dynamic Binary Translation
Future Work• DBT to support hardware adaptability
• Exploring cost/impact of optimizations
• Comparison with existing RISC-V implementations (BOOM)
35
HardwareAccelerators
Binaries(RISC-V) VLIW
Binaries
Dynamic BinaryTranslation
VLIWBinaries
VLIWBinaries
Hardware Accelerated Dynamic Binary Translation
Questions
https://github.com/srokicki/HybridDBT
36
?Hardware Accelerated Dynamic Binary Translation