Fast Dynamic Binary Translation for the Kernel
description
Transcript of Fast Dynamic Binary Translation for the Kernel
Fast Dynamic Binary Translationfor the Kernel
Piyus Kedia and Sorav BansalIIT Delhi
Applications of Dynamic Binary Translation (DBT)
OS Virtualization Testing and Verification of Compiled Programs Profiling and Debugging Software Fault Isolation Dynamic Optimizations Program Shepherding … and more
A Short Introduction toDynamic Binary Translation (DBT)
Dispatcher
Execute Block
Start
Translate Block
Native codeBlock terminates with branch to dispatcher instruction
Code Cache
Dispatcher
cached?
Execute fromCode Cache
Start
Translate Block
Native code
Store in code cache
no
yes
DBT Overheads• User-level DBT well understood• Near-native performance for application-level workloads
• DBT for the Kernel requires more mechanisms• Efficiently handling exceptions and interrupts• Case studies:
• VMware’s Software Virtualization• DynamoRio-Kernel (DRK) [ASPLOS ’12]
Interposition on Starting (Entry) Points
Dispatcher
cached?
Execute fromCode Cache
Start
Translate Block
Native code
Store in code cache
no
yes
Start
IDT now points to the dispatcher
Dispatcher
cached?
Execute fromCode Cache
Translate Block
Native code
Store in code cache
no
yes
Interrupt Descriptor Table
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)
CS registerPC
Flags
Guest Stack
SP
CS registerNative PC
Flags
Guest Stack
SP
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:
2. Emulates Precise Exceptions1. Converts interrupt state on stack to native values (e.g., PC)
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions
Precise ExceptionsBefore the execution of an exception handler, all instructions up to the executing instruction should have executed, and everything afterwards
must not have executed.
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions
• Rolls back partially executed translations
Precise Exceptions
pushstoreadd sub load mov pop
Executed Exception handlerexecutes
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions
• Rollback partially executed translations3. Emulates Precise Interrupts
What does the dispatcher do?
Before transferring control to the code cache, the dispatcher:1. Converts interrupt state on stack to native values (e.g., PC)2. Emulates Precise Exceptions
• Rollback partially executed translations3. Emulates Precise Interrupts
Precise Interrupts
pushstoreadd sub load mov pop
Interrupt handlerexecutes
Executed
• Delays interrupt delivery till start of next native instruction
Effect on Performance
Applications with high interrupt and exception activityexhibit large DBT overheads
VMware’s Software Virtualization Overheads
SpecInt kernel-compile apache0
20
40
60
80
100
120
140
2.9
27.11
123.48
Perc
enta
ge O
verh
ead
over
Nati
ve
benchmarks
Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.
VMware’s Software Virtualization Overheads
SpecInt kernel-compile apache 2D-graphics large-RAM forkwait0
100
200
300
400
500
600
700
2.927.11
123.48
57.8191.68
603.44
Perc
enta
ge O
verh
ead
over
Nati
ve
benchmarks m-benchmarks
Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.
VMware’s Software Virtualization Overheads
SpecInt kernel-compile apache 2D-graphics large-RAM forkwait divzero syscall0
100
200
300
400
500
600
700
800
900
2.9 27.11
123.4857.81
91.68
603.44
262.54
853.72
Perc
enta
ge O
verh
ead
over
Nati
vebenchmarks m-benchmarks nano-benchmarks
Data from “Comparison of Software and Hardware Techniques for x86 Virtualization”K. Adams, O. Agesen, VMware, ASPLOS 2006.
Dynamo-Rio Kernel (DRK) Overheads
Data from “Comprehensive Kernel Instrumentation via Dynamic Binary Translation”P. Feiner, A.D. Brown, A. Goel, U. Toronto, ASPLOS 2012.
fileserver webserver webproxy varmail apachebench0
50
100
150
200
250
300
350
400
212.3
351.85325.37
44.44
184.13
Perc
enta
ge O
verh
ead
over
Nati
ve
DRK vs BTKernel
fileserver webserver webproxy varmail apachebench0
50
100
150
200
250
300
350
400
212.3
351.85325.37
44.44
184.13
0.36 2.19 2.44 10.6 0.42
Perc
enta
ge O
verh
ead
over
Nati
ve
Fully Transparent Execution is not required
• The OS kernel rarely relies on precise exceptions
• The OS kernel rarely relies on precise interrupts
• The OS kernel seldom inspects the PC address pushed on stack. It is only used at the time of returning from interrupt using the iret instruction.
Faster Execution is Possible• Leave code cache addresses in kernel stacks.• An interrupt/exception directly jumps into the code cache, bypassing the
dispatcher.
• Allow imprecise interrupts and exceptions.
• Handle special cases specially.
IDT now points to the code cache
Dispatcher
cached?
Execute fromCode Cache
Translate Block
Native code
Store in code cache
no
yes
Interrupt Descriptor Table
IDT now points to the code cache
Dispatcher
cached?
Execute fromCode Cache
Translate Block
Native code
Store in code cache
noyes
Interrupt Descriptor Table
Correctness Concerns1. Read / Write of the interrupted PC address on stack will return
incorrect values.• Fortunately, this is rare in practice and can be handled specially
Read of an interrupted PC address
CS registertranslated PC
Flags
Guest Stack
SPload addr
Examples:
1. Exception Tables in Linux page fault handler
Exception Tables in Linux• Page faults are allowed in certain functions• e.g., copy_from_user(), copy_to_user().
• An exception table is constructed at compile time• contains the range of PC addresses that are allowed to page fault.
• At runtime, the faulting PC value is compared against the exception table• Panic only if PC not present in exception table
Read of an Interrupted PC address
CS registertranslated PC
Flags
Guest Stack
SPload addr
Problem:The faulting PC value is now a code-cache address.
Solution:Dispatcher adds potentially faulting code cache addresses to the exception table
Read of an Interrupted PC address
CS registertranslated PC
Flags
Guest Stack
SPload addr
Examples:
1. Exception Tables in Linux
2. MS Windows NT Structured Exception Handling__try / __except constructs in C/C++
__try / __except blocks in MS Windows NT
__try { <potentially faulting code>} __except { <fault handler>}
__try { copy_from_user();} __except { signal_process()}
Syntax: Example Usage:
Also implemented using exception tables in the Windows kernel
More examples in paperIn our experience, all such cases can be nicely handled!
Correctness Concerns1. Read / Write of the faulting PC address on stack will return incorrect
values.
2. Code-cache addresses will now live in kernel stacks.• What if code-cache addresses become invalid?
Code Cache Addresses can now live in Kernel Data Structures
CS registertranslated PC
Flags
Thread 1 Stack
SPCode Cache
Thread 2 Stack
SPContext Switch
Code Cache Addresses can now live in Kernel Data Structures
• Disallow Cache Replacement• Code Cache of around 10MB suffices for Linux
• Do not move or modify code cache blocks, once they are created• Ensures that a code cache address remains valid for the execution lifetime
• If the code cache gets full, switchoff and switch-back on the translator• Switchoff implemented by reverting to original IDT and other entry points.• This results in effectively flushing the code cache and starting afresh
Dynamic Switchon / Switchoff• Replace all entry points with shadow / original values• e.g., for switchoff, replace shadow interrupt descriptor table with original
• Iterate over the kernel’s list of threads• Identify PC values in thread stacks and convert them to code cache / native
values
• Translator reboot (switchoff followed by switchon) flushes the code cache
Correctness Concerns1. Read / Write of the faulting PC address on stack will return incorrect
values.
2. Code-cache addresses will now live in kernel stacks. What if code-cache addresses become invalid?
3. Imprecise Interrupts and Exceptions.
Imprecise Exceptions and Interrupts
Interestingly, an OS kernel typically never depends on precise exceptions and interrupts.
Reentrancy and Concurrency
Direct entries into the code cache introduce new reentrancy and concurrency issues
Detailed discussion in the paper.
Optimizations that worked
• L1 cache-aware Code Cache Layout
• Function call/return optimization
Code Cache Layout for Direct Branch Chaining
Dispatcher
Code Cache Edge Cache
Edge code:• executed only once, on the first execution of the block.• However, shares the same cache lines as all other code.Allocate edge code from a separate memory pool for better cache locality.
Edge code for branching to dispatcher
Function call/return optimization
Use identity translations for ‘call’ and ‘ret’ instructionsinstead of treating ‘ret’ as another indirect branch.
Involves careful handling of call instructions with indirect targets(discussed in the paper)
Experiments• BTKernel Performance vs. Native
• BTKernel Statistics
• Experience with some applications
Apache 1, 2, 4, 8 and 12 processors
1 2 4 8 120
2000400060008000
100001200014000
Native BTKernel BTKernel-no-callret
Number of Processors
Thro
ughp
ut (M
Bps)
Higher is better
Fileserver 1, 4, 8, 12 processors
1 4 8 120
500010000150002000025000300003500040000
Native BTKernel BTKernel-no-callret
Number of Processors
Thro
ughp
ut (o
ps/s
)
Higher is better
lmbench fork operations
execve exit sh0
200400600800
100012001400
Native BTKernel BTKernel-no-callret
lmbench microbenchmark
Tim
e (M
icro
seco
nds)
Lower is better
Number of Dispatcher ExitsWithout call/ret
optimizationWith call/ret optimization
Instructions Dispatcher Exits
Instructions Dispatcher Exits
Apache 56 b 7 m 59 b 125
Linux Build 570 b 72 m 590 b 33059
m = millionb = billion
Applications• We implemented Shadow Memory for a Linux guest• Identifies the CPU-private (read/write) and CPU-shared (read/write) bytes in
kernel address space
• Overheads range from 20% - 300%
• Significant improvement over the 10x overheads reported in DRK
Summary and Conclusion• Avoid back-and-forth translation between native and translated
values of interrupted PC• Relax precision requirements on exceptions and interrupts• Use cache-aware layout for the code cache• Use identity translations for the function call/ret instructions
Near-Native performance DBT implementation for unmodified LinuxAvailability: https://github.com/piyus/btkernel
Thank You.