Hardware-Accelerated Dynamic Binary Translation - Inria

36
Rokicki Simon - Irisa / Université de Rennes 1 Steven Derrien - Irisa / Université de Rennes 1 Erven Rohou - Inria Hardware-Accelerated Dynamic Binary Translation

Transcript of Hardware-Accelerated Dynamic Binary Translation - Inria

Page 1: Hardware-Accelerated Dynamic Binary Translation - Inria

Rokicki Simon - Irisa / Université de Rennes 1

Steven Derrien - Irisa / Université de Rennes 1

Erven Rohou - Inria

Hardware-Accelerated Dynamic Binary Translation

Page 2: Hardware-Accelerated Dynamic Binary Translation - Inria

Embedded Systems

Tight constraints in• Power consumption• Production cost• Performance

2Hardware Accelerated Dynamic Binary Translation

Page 3: Hardware-Accelerated Dynamic Binary Translation - Inria

Systems on a Chip

• Complex heterogeneous designs• Heterogeneity brings new power/performance trade-off

3Hardware Accelerated Dynamic Binary Translation

Out-of-order SuperscalarIn-order core

Performance

Pow

er

Overhead from in-orderto Out-of-Order

Page 4: Hardware-Accelerated Dynamic Binary Translation - Inria

Systems on a Chip

4Hardware Accelerated Dynamic Binary Translation

Out-of-order SuperscalarIn-order coreVLIW

Performance

Pow

er

• Complex heterogeneous designs• Heterogeneity brings new power/performance trade-off• Are there better trade-off?

Page 5: Hardware-Accelerated Dynamic Binary Translation - Inria

Out-of-Order processor

• Dynamic Scheduling

• Performance portability

• Poor energy efficiency

VLIW processor

• Static scheduling

• No portability

• High energy efficiency

Architectural choice

VLI

W

ins2

ins3

Ins4

ins1

ins2

ins3

Ins4

ins1

ins2

ins3

Ins4

ins1

……

……

ins3 ins2 ins1ins4… ROBD

eco

de

& R

enam

ing

5Hardware Accelerated Dynamic Binary Translation

Page 6: Hardware-Accelerated Dynamic Binary Translation - Inria

Dynamically translate native binaries into VLIW binaries:• Performance close to Out-of-Order processor

• Energy consumption close to VLIW processor

The best of both world ?

VLI

WBinaries(RISC-V)

VLIWBinaries

Dynamic BinaryTranslation

6Hardware Accelerated Dynamic Binary Translation

Page 7: Hardware-Accelerated Dynamic Binary Translation - Inria

• Transmeta Code Morphing Software & Crusoe architectures• x86 on VLIW architecture• User experience polluted by cold-code execution penalty

• Nvidia Denver architecture• ARM on VLIW architecture

Existing approaches

7

• Translation overhead is critical

• Too few information on closed platforms

Hardware Accelerated Dynamic Binary Translation

Page 8: Hardware-Accelerated Dynamic Binary Translation - Inria

• Hardware accelerated DBT framework Make the DBT cheaper (time & energy)

First approach that try to accelerate binary translation

• Open source framework Allows research

Our contribution

HardwareAccelerators

VLI

WBinaries(RISC-V)

VLIWBinaries

Dynamic BinaryTranslation

Hardware Accelerated Dynamic Binary Translation 7

Page 9: Hardware-Accelerated Dynamic Binary Translation - Inria

• Hybrid-DBT Platform• How does it work? • What does it cost?• Focus on optimization levels

• Experimental Study• Impact on translation overhead• Impact on translation energy overhead• Impact on area utilization

• Conclusion & Future work

Outline

9Hardware Accelerated Dynamic Binary Translation

Page 10: Hardware-Accelerated Dynamic Binary Translation - Inria

Outline

10

• Hybrid-DBT Platform• How does it work? • What does it cost?• Focus on optimization levels

• Experimental Study• Impact on translation overhead• Impact on translation energy overhead• Impact on area utilization

• Conclusion & Future work

Hardware Accelerated Dynamic Binary Translation

Page 11: Hardware-Accelerated Dynamic Binary Translation - Inria

How does it work?

VLI

WRISC-Vbinaries

11

• RISC-V binaries cannot be executed on VLIW

Hardware Accelerated Dynamic Binary Translation

Page 12: Hardware-Accelerated Dynamic Binary Translation - Inria

How does it work?

• Direct, naive translation from native to VLIW binaries• Does not take advantage of Instruction Level Parallelism

VLI

WRISC-Vbinaries

Direct Translation

VLIWbinaries

Optimizationlevel 0 No ILP

Hardware Accelerated Dynamic Binary Translation 12

Page 13: Hardware-Accelerated Dynamic Binary Translation - Inria

13

How does it work?

• Build an Intermediate Representation (CFG + dependencies)• Reschedule Instructions on VLIW execution units

VLI

WRISC-Vbinaries

Direct Translation

VLIWbinaries

IR Builder IR SchedulerIR

Optimizationlevel 0

Optimizationlevel 1

No ILP

ILP

Hardware Accelerated Dynamic Binary Translation

Page 14: Hardware-Accelerated Dynamic Binary Translation - Inria

14

How does it work?

• Code profiling to detect hotspot• Optimization level 1 only on hotspots

VLI

WRISC-Vbinaries

Direct Translation

VLIWbinaries

Insert Profiling

IR Builder IR SchedulerIR

Optimizationlevel 0

Optimizationlevel 1

No ILP

ILP

Hardware Accelerated Dynamic Binary Translation

Page 15: Hardware-Accelerated Dynamic Binary Translation - Inria

VLI

WRISC-Vbinaries

Direct Translation

VLIWbinaries

Insert Profiling

IR Builder IR SchedulerIR

Optimizationlevel 0

Optimizationlevel 1

15

150 cycle/instr

400 cycle/instr 500 cycle/instr

What does it cost?

No ILP

ILP

• Cycle/instr : number of cycles to translate one RISC-V instruction• Need to accelerate time consuming parts of the translation

Hardware Accelerated Dynamic Binary Translation

Page 16: Hardware-Accelerated Dynamic Binary Translation - Inria

Hybrid-DBT framework

VLI

WRISC-Vbinaries

First-PassTranslation

VLIWbinaries

Insert Profiling

IR Builder IR SchedulerIR

Optimizationlevel 0

Optimizationlevel 1

16

Software pass

Hardware accelerators

No ILP

ILP

• Hardware acceleration of critical steps of DBT• Can be seen as a hardware accelerated compiler back-end

Hardware Accelerated Dynamic Binary Translation

Page 17: Hardware-Accelerated Dynamic Binary Translation - Inria

Focus on optimization level 0

VLI

WRISC-Vbinaries

First-PassTranslation

VLIWbinaries

Insert Profiling

Optimizationlevel 0 No ILP

17

• Critical for system reactivity

Hardware Accelerated Dynamic Binary Translation

Page 18: Hardware-Accelerated Dynamic Binary Translation - Inria

First-Pass Translation

RISC-Vbinaries

rs1 funct rd opcodeimm12 rs1 rd opcodeimm13

VLIWbinaries

• Implemented as a Finite State Machine• Translate each native instruction separately• Produces 1 VLIW instruction per cycle• 1 RISC-V instruction => up to 2 VLIW instructions

• Simple because ISA are similar

18

First-PassTranslation

Hardware Accelerated Dynamic Binary Translation

Page 19: Hardware-Accelerated Dynamic Binary Translation - Inria

Focus on optimization level 1

VLI

WRISC-Vbinaries

First-PassTranslation

VLIWbinaries

Insert Profiling

IR Builder IR SchedulerIR

Optimizationlevel 0

Optimizationlevel 1

No ILP

ILP

19

• Critical to start exploiting VLIW capabilities

Hardware Accelerated Dynamic Binary Translation

Page 20: Hardware-Accelerated Dynamic Binary Translation - Inria

Goal of optimization level 1

20

stw r5,0(r3)

ldw r3,0(r2)

addi r4,r1,1

sub r4,r4,r3

stw r4,0(r3)

movi r3,0

Native Binaries

VLIW Binaries

nop stw r5,0(r3)

addi r4,r1,1 ldw r3,0(r2)

sub r4,r4,r3 nop

movi r3,0 stw r4,0(r3)

Exploit available ILP• Compute dependencies• Perform Instruction Scheduling

Hardware Accelerated Dynamic Binary Translation

Page 21: Hardware-Accelerated Dynamic Binary Translation - Inria

Cost of optimization level 1

VLIWbinaries

IR Builder IR SchedulerIR

21

VLIWbinaries

400 cycle/instr 500 cycle/instr

acceleration is simple acceleration is challenging

• Generate high-level IR• Instruction scheduling on the IR

• Instruction decoding/encoding• Single FOR loop• Regular computations

• Difficult to parallelize• Complex control flow structure

Hardware Accelerated Dynamic Binary Translation

Page 22: Hardware-Accelerated Dynamic Binary Translation - Inria

Cost of optimization level 1

VLIWbinaries

IR Builder IR SchedulerIR

22

VLIWbinaries

400 cycle/instr 500 cycle/instr

• Generate high-level IR• Instruction scheduling on the IR

• Instruction Scheduling is the bottleneck• IR is designed to speed-up scheduling

Hardware Accelerated Dynamic Binary Translation

acceleration is simple acceleration is challenging

Page 23: Hardware-Accelerated Dynamic Binary Translation - Inria

Choosing an Intermediate Representation

23

g1 1 g3 0

stld+

-

st

r 3

012

3

4

0

mv5

g2 VLIW Binaries

nop stw r5,0(r3)

addi r4,r1,1 ldw r3,0(r2)

sub r4,r4,r3 nop

movi r3,0 stw r4,0(r3)

Hardware Accelerated Dynamic Binary Translation

stw r5,0(r3)

ldw r3,0(r2)

addi r4,r1,1

sub r4,r4,r3

stw r4,0(r3)

movi r3,0

Native Binaries

Page 24: Hardware-Accelerated Dynamic Binary Translation - Inria

Choosing an Intermediate Representation

24

g1 1 g3 0

stld+

-

st

r 3

012

3

4

0

mv5

g2

op registers[4]

nbSuccnbDSuccnbDep96 64 32 0

0 - st

1 - ld

2 - addi

3 - sub

4 - st

@g3 = 0

r1 = @g2

g1 = g1 1

r3 = r1 g1

@g2 = r3

0

1

0

2

2

0

2

1

1

0

1

2

1

2

1

1

3

3

4

5

4

-

5

- - - - - -

-

- -

-- -

-

-

- -

-

-

-

-

-

- -

-

-

-

succNames[8]

-

-

-

-

-

5- mov r3 = 0 2 0 0 - - - -- - --

-

IR advantages:• Direct access to dependencies and successors• Regular structure (no pointers / variable size)

Hardware Accelerated Dynamic Binary Translation

Page 25: Hardware-Accelerated Dynamic Binary Translation - Inria

Details on hardware accelerators

• Developing such accelerators using VHDL is out of reach• Accelerators are developed using High-Level Synthesis

• Loops unrolling/pipelining• Memory partitioning• Memory accesses factorization• Explicit forwarding

VLIWbinaries

IR Builder IR SchedulerIRVLIW

binaries

One-pass dependencies analysis List-scheduling algorithm

Hardware Accelerated Dynamic Binary Translation 25

See paper for more details !

Page 26: Hardware-Accelerated Dynamic Binary Translation - Inria

Outline

26

• Hybrid-DBT Platform• How does it work? • What does it cost?• Focus on optimization levels

• Experimental Study• Impact on translation overhead• Impact on translation energy overhead• Impact on area utilization

• Conclusion & Future work

Hardware Accelerated Dynamic Binary Translation

Page 27: Hardware-Accelerated Dynamic Binary Translation - Inria

Impact on translation overhead

27

VLI

WRISC-Vbinaries

First-PassTranslation

VLIWbinaries

IR BuilderIR

SchedulerIR

Optimizationlevel 0

Optimizationlevel 1

150 cycle/instr

400 cycle/instr 500 cycle/instr

• VLIW baseline is executed with ST200simVLIW

• Fully functionnal Hybrid-DBT platform on FPGA • JIT processor: Nios II

• Altera DE2-115

Hardware Accelerated Dynamic Binary Translation

Page 28: Hardware-Accelerated Dynamic Binary Translation - Inria

Impact on translation overhead

0

1

2

3

4

5cycle/instruction

0

50

100

150

200

Speed-up vs Software DBT

First-Pass Translator IR Builder IR Scheduler

28

• Cost of optimization level 0 using the hardware accelerator

Hardware Accelerated Dynamic Binary Translation

Page 29: Hardware-Accelerated Dynamic Binary Translation - Inria

Impact on translation overhead

14 15 13 14 13 13 13 13 12 1337

144

61

132106 112 106 116

60

105

0

50

100

150

200

cycle/instruction

05

10152025303540

Speed-up vs Software DBT

First-Pass Translator IR Builder IR Scheduler

29

• Cost of optimization level 1 using the hardware accelerator

Hardware Accelerated Dynamic Binary Translation

Page 30: Hardware-Accelerated Dynamic Binary Translation - Inria

Impact on translation energy overhead

30

VLI

WRISC-Vbinaries

First-PassTranslation

VLIWbinaries

IR BuilderIR

SchedulerIR

Optimizationlevel 0

Optimizationlevel 1

?? J

?? J ?? J

• Hybrid-DBT platform on ASIC:• Compiled with design compiler for ASIC 65nm

• Design frequency: 250 MHz

• Gate-level simulation with Modelsim

• Accurate power estimation

Hardware Accelerated Dynamic Binary Translation

Page 31: Hardware-Accelerated Dynamic Binary Translation - Inria

Impact on translation energy overhead

050

100150200250300350400

Energy-efficiency vs software DBT

First-Pass Translator IR Builder IR Scheduler

31

• Energy-efficiency improvement using the hardware accelerator

Hardware Accelerated Dynamic Binary Translation

Page 32: Hardware-Accelerated Dynamic Binary Translation - Inria

Impact on translation energy overhead

0

20

40

60

80

100

Energy-efficiency vs software DBT

First-Pass Translator IR Builder IR Scheduler

32

• Energy-efficiency improvement using the hardware accelerator

Hardware Accelerated Dynamic Binary Translation

Page 33: Hardware-Accelerated Dynamic Binary Translation - Inria

Impact on area/resource cost

19 220

6 3007 626

779

5 019

0

5000

10000

15000

20000

25000

VLIW DBTProcessor

IRScheduler

First-PassTranslator

IR Builder

NAND equivalent gates

33

Overhead from Hybrid-DBT

• Resource usage for all our platform components

Hardware Accelerated Dynamic Binary Translation

Page 34: Hardware-Accelerated Dynamic Binary Translation - Inria

Conclusion

• Presentation of Hybrid-DBT framework• Hardware accelerated DBT

• Open-source DBT framework RISC-V to VLIW

• Tested FPGA prototype

• Sources are available on GitHub: https://github.com/srokicki/HybridDBT

34

HardwareAccelerators

VLI

WBinaries(RISC-V)

VLIWBinaries

Dynamic BinaryTranslation

Hardware Accelerated Dynamic Binary Translation

Page 35: Hardware-Accelerated Dynamic Binary Translation - Inria

Future Work• DBT to support hardware adaptability

• Exploring cost/impact of optimizations

• Comparison with existing RISC-V implementations (BOOM)

35

HardwareAccelerators

Binaries(RISC-V) VLIW

Binaries

Dynamic BinaryTranslation

VLIWBinaries

VLIWBinaries

Hardware Accelerated Dynamic Binary Translation

Page 36: Hardware-Accelerated Dynamic Binary Translation - Inria

Questions

https://github.com/srokicki/HybridDBT

36

?Hardware Accelerated Dynamic Binary Translation