Dynamic Binary Optimization – Part 1
description
Transcript of Dynamic Binary Optimization – Part 1
Dynamic Binary Optimization – Part 1
2006. 9.25
Nam, E Hyun
2
Contents
Overview Dynamic program Behavior Profiling Optimizing Translation blocks
3
Add1 %edx,4(%eax)Mov1 4(%eax),%edx
Addi r16,r4,4Lwzx r17,r2,r16Add r7,r17,r7Addi r16,r4,4Stwx r7,r2,r16
Addi r16,r4,4Lwzx r17,r2,r16Add r7,r17,r7Stwx r7,r2,r16
Overview : Optimization
Optimization Migration of VM consideration
from compatibility to performance Goal
To close the gap between a guest’ emulated performance and native platform performance
Type Translation block chaining Enlarging the translation block Reordering translated instructions Conventional complier
optimization techniques
4
Overview : Profile
Profile Statistics regarding a program’s behavior A guide for making optimization decision
Common optimization strategy is to use profiling to determine the path that are predominantly followed by control flow
Type of profile information Instructions( or Basic Blocks ), more heavily executed Sequence in which BB are most commonly executed Behavior of particular data variables and addresses
5
Overview : Profile
Advantage of profile information Providing information that may not have been available when a program
was originally compiled
BB A……R3 ß …R7 ß …R1 ß R2 + R3Br L1 if R3==0
BB B…R6 ß R1 + R6 ……
BB CL1: R1 ß 0
……
BB A……R3 ß …R7 ß …
Br L1 if R3==0
BB B…R6 ß R1 + R6 ……
BB CL1: R1 ß 0
……
BB A……R3 ß …R7 ß …
Br L1 if R3==0
BB B…R6 ß R1 + R6 ……
BB CL1: R1 ß 0
……
Compensation codeR1 ß R2 + R3
6
Overview : BB rearrangement
Definition Method, so that
predominant path has instructions in consecutive memory location
Advantages Nice localization Efficient instruction
fetching Type
Trace Superblock Tree group
BB A……R3 ß …R7 ß …R1 ß R2 + R3Br L1 if R3==0
BB B…R6 ß R1 + R6 ……
BB CL1: R1 ß 0
……
Superblock……R3 ß …R7 ß …Br L1 if R3!=0
L1: R1 ß 0……
BB B…R6 ß R1 + R6 ……
Compensation codeR1 ß R2 + R3
7
Overview : Staged emulation
Relation between emulation and optimization Tightly integrated with emulation Optimization is part of an emulation framework that support
staged emulation Staged emulation
Based on tradeoff between start-up time and steady state performance
Interpretation Binary translation Dynamic binary optimization
8
Overview : Staged emulation
Stages of staged emulation Interpretation BB translation( e.g. chaining ) Optimized translation( e.g. superblock ) Highly optimized translation
Interpreter
Binary memoryImage
BB cache Code cache Profile data
Translator Optimizer
Emulation manager
9
Overview : Spectrum of emulation
Interpret Basic translation Optimized blocksHighly optimized
blocks
Fast startup
Slow steady state
Simple profiling
Low overhead
Very slow startup
Fast steady state
Extensive profiling
High overhead
10
Overview : Staged emulation strategy
Strategy decision factors Source and target ISA Type of VM being implemented Design objective Tradeoff between Obtained optimization performance and
optimization, profiling overhead Example
Original HP Dynamo system, Digital FX!32 Interpret optimized, translated code
DynamoRIO Simple binary translation optimization
Shade Interpretation simple binary translation
11
Contents
Overview Dynamic program Behavior Profiling Optimizing Translation blocks
12
Dynamic program behavior
Goal Optimization depends on
program’s structure and dynamic behavior
By profiling, optimization system can learn about program’s structure and dynamic behavior
Important characteristics of program
High predictability of dynamic control flow
Correlation of branch direction, between current and most recent previous execution
0
10
20
30
40
50
0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90%
Percent taken
Frac
tion
of st
atic
con
dition
al b
ranc
hes
0
10
20
30
40
50
60
70
80
90
100
176.g
cc
181.m
cf
197.p
arse
r
252.e
on
256.b
zip2
171.s
wim
173.a
pplu
177.m
esa
187.f
acere
c
189.l
ucas
Perc
ent dy
nam
ic b
ranc
hes
deci
ded
sam
e as
pre
viou
s tim
e
13
Dynamic program behavior
Important characteristics of program
Backward instruction Is typically taken
Predictability of indirect jump Switch statement Return from procedure call
Predictability of data value
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 >9
Number of different destinations
Perc
ent
of in
dire
ct ju
mps
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
All Add/Sub Load Logic Shift Set
Instruction type
Frac
tion
wit
h co
nsta
nt v
alue
Static
Dynamic
14
Contents
Overview Dynamic program Behavior Profiling
Overview Role Type Collecting the profile data Profile during interpretation Profiling translated code Overhead
Optimizing Translation blocks
15
Profiling : Role
Definition The process of collecting instruction and data statistics for
an executing program Usage
Input to code-optimization process Principle of profiling
Predictability of program Past behavior will often hold for future behavior
16
Profiling : Role
Traditional profiling & optimization procedure
Decomposing the source program into control flow graph
Analyzing the graph and inserting probes to collect profile information
Program running with a typical data input
Generating profile data Static profile log analysis Generating optimized code
Property Fully analyzed Optimal placement of probe Entire program run and complete
profile
HLL Program
Compiler Frontend
A
B C
D
E
F
Compiler Backend
Instrumentedcode
Instrumentedcode
Test data
Program execution
Programstatistics
Optimizingcompiler
Optimized binary
17
Profiling : Role
Difficulty, requirement and limitation in dynamic optimization
Program structure is not known when a program begins
Program structure must be discovered in an incremental way
Inserting profiling probes in a globally optimal manner
Optimization decision must be made as early as possible
Statistics from a partial execution of the program
A
B
D
E
Programbinary
InterpreterPartial
Programstatistics
Translatoroptimizer
Programdata
18
Profiling : Role
Tradeoff between overhead and benefit Overhead : Initial analysis + actual collection of profile data Benefit : execution time reduction due to optimization
Static optimization Overhead are paid once
Dynamic optimization Overhead are paid every time a guest program runs Benefits must outweigh the Overhead
19
Profiling : Type of profile data
Frequency of Execution of different code region Hotspot Interpretation VS binary translation
Profile data which is based on Control flow( branch and Jump ) predictability Can be used for determining aspects of a program’s
dynamic execution behavior Used as basis for gathering and rearranging BBs into larger
unit Used to guide specific optimization
Address Data
20
Profiling : Type of profile data
Basics Nodes : BBs Edges : flow of control
BB profile Numbers are counts of the
corresponding BB’s execution
Edge profile BB profile can be derived
from edge profile Path profile
Approximate the path profile by using a heuristics based on edge profile
A(65)
B(50) C(15)
D(25)
E(48)
F(17)
A
B C
D
E
F
50
12 13
210
15
38
48
17
15
21
Profile : collecting the profile
Instrumentation based profiling Target program related events Count all instances of the event being profiled Many different events can be monitored simultaneously
Monitoring method : HW, SW Sampling based profiling
Program runs in its unmodified form Program is interrupted and an instances of program related event is
captured Tradeoff
Instrumentation based slow but can collect given number of profile data over much shorter period of
time Sampling based
fast but requires a longer time for collecting the same amount of profile information
22
Profile : collecting the profile
Strategy Collection technique depends on emulation spectrum
Interpretation SW instrumentation is about the only choice
Optimizing binary translation, dynamic optimization system Instrumentation
Already well optimized longer running program Sampling
23
Profile : profiling during interpretation
Key points Source instructions are actually access as data
Profiling code must be added to the interpret routine Profiling is applied to specific instruction type rather than specific
instruction It can be applied for Certain classes of instructions rather
than specific instruction E.g. Backward branch
Method BB profile
profile code should be added to all control transfer instructions after the PC bas been updated
Edge profile Both the PC of the control transfer instruction and the target PC are
used to define a specific instruction
24
Profile : profiling during interpretation
Profile Table Access method
BB profile : Via PC value of control transfer destination Edge profile : PC value that define an edge Hash function
Contents of entry Basic block or edge count For conditional branch, taken count and not taken count
25
Profile : profiling during interpretation
Instruction function list..Branch_conditional(inst){
BO = extract(inst,25,5);BI = extract(inst,20,5);displacement = extract( inst, 15, 14 ) * 4;..// code to compute whether branch should be taken..profile_addr = loopup(PC);if( branch_taken)
profile_cnt( profile_addr, taken );PC = PC + displacement;
elseprofile_cnt( profile_addr, nontaken);PC = PC + 4;
}
PCTakencount
Not-takencount
HASHBranch
PC
26
Profile : profiling during interpretation
Profile Count decaying Problem of profile table
A count field overflow Solution
Key point Optimization method focus on not absolute count but
relative frequency Recent program event history is more valuable than that
of past Decay process
Periodically divide all the profile count by 2
27
Profile : profiling during interpretation
Profiling Jump Instruction Difficulties of Jump compared with conditional branch
Switch statement : frequently change Return from procedure call : many target address
Solution Key point
Profile-driven optimization of indirect jump tend to be focused on those jumps that very frequently have the same target
Maintain profile table with a small number of target address and track only the more recently used target
28
Profile : profiling translated code
Instrumenting individual instructions Each individual instruction can have its own custom profiling code
= Profiling can be selectively applied = Profile counters can be assigned to each static instructions
Profile counters can be directly addressed without hashing Profile code can be easily inserted and removed as needed
Translated BasicBlock
Fall-throughstub
Branch targetstub
Increment edgeCounter(j)
If( counter(j) > trigger)invoke optimizer
Elsebranch to targetBB
Increment edgeCounter(i)
If( counter(i) > trigger)invoke optimizer
Elsebranch to fall-throughBB
29
Profiling : Overhead
Performance overhead Example
To access hash table : hash function + 1 load + 1 compare To increment proper count : 1 load + 1store + 1add
Profiling during interpretation VS profiling translated code Absolute overhead VS relative overhead
Memory overhead Profile table
Overhead reduction method Reducing the number of instrumentation point
Heuristic + Using collected data Code duplication
Attractive for same-ISA optimization ( 4.7 )
30
Contents
Overview Dynamic program Behavior Profiling Optimizing Translation blocks
Overview Improving locality Traces Superblocks Dynamic superblocks formation Tree group
31
Optimizing translation blocks : Overview
Two strategy Improving locality Optimization on enlarged translation blocks
32
Optimizing translation blocks : Improving locality Locality
Temporal Spatial
Problem Cache space Performance
Low instruction fetch
bandwidth
A
B D
C
G
30
29 68
68129
70
F
197
2
E
1
3
Br cond1 == true
A
B
C
Br cond2 == false
Br uncond
D
Br cond3 == true
E
Br uncond
F
G
Br cond4 == true
E(Br Uncond) F(----------------) F(----------------) F(----------------)
33
Optimizing translation blocks : Improving locality Rearrange the layout of the
blocks in memory Conditional branch tests are
reversed Unconditional branch
removal/Add Instruction fetch efficiency is
improved
G
Br cond1 == false
A
Br cond3 == true
D
E
Br cond4 == true
Br uncond
B
C
Br cond2 == false
Br uncond
F
Br uncond
Br uncond is removed
Br cond1 == true
A
B
C
Br cond2 == false
Br uncond
D
Br cond3 == true
E
Br uncond
F
G
Br cond4 == true
34
Optimizing translation blocks : Improving locality Procedure inlining A
Call proc xyz
B
.
.
.
K
Call proc xyz
L
X
proc xyz
Z
return
Y
A
B
X
Z
Y
A
B
X
Z
Y
35
Optimizing translation blocks : Improving locality Partial procedure inlining
In dynamic optimization system
A
Call proc xyz
B
.
.
.
K
Call proc xyz
L
X
proc xyz
Z
return
Y
A
B
X
Y
A
B
X
Z
36
Optimizing translation blocks : Improving locality Pros and Cons of procedure inlining
Pros Increase spatial locality Remove overhead
Call and return instructions are removed Save/restore instruction are removed
Cons Increase code size Increase register “pressure”
Inlined code needs more register than procedure call Con sequently, procedure inlining is typically used only
for those procedures that are very frequently called and are very small
37
Optimizing translation blocks
Three ways of rearranging basic blocks according to control flow Trace formation Superblock formation
Most widely used in VM implementation Tree group
Useful when control flow is difficult to predict Provide wider scope for optimization
38
Optimizing translation blocks : Traces
Traces Chunks of contiguous instructions containing multiple BBs Traces > Superblock
Static traces forming step 1. Profile collection using test data 2. Begin with start point
Most frequently executed BB ,not already part of a trace 3. Collection BB through most common control path, until a stopping
condition is met A block already belonging to another trace is reached The arrival at a procedure call/return boundary
4. Collect the BBs into a trace Reverse branch tests removing/adding unconditional branch
5. stop otherwise go to step 2 In dynamic environment, Traces are not commly used s translation blocks
39
Optimizing translation blocks : Traces
A
B D
C
G
30
29 68
68129
70
F
197
2
E
1
3
Trace1 Trace2 Trace3
G
Br cond1 == false
A
Br cond3 == true
D
E
Br cond4 == true
Br uncond
B
C
Br cond2 == false
Br uncond
F
Br uncond
Br uncond is removed
40
Optimizing translation blocks : Superblocks Superblocks VS Traces
Side entrance Problems in forming superblocks
Small and a number of superblocks Too small to provide many opportunities for optimizations
Tail duplication The process of replicating code that appears at the end of a
superblock in order to form other superblock
41
Optimizing translation blocks : Superblocks
A
B D
C
G
30
29 68
68129
70
F
197
2
E
1
3
A
B D
C
30
29 68
70
F
1
E
3
G G G
97
29 29 292
42
Optimizing translation blocks : Dynamic superblock formation : Overview
Dynamic Formed incrementally as the source code is being emulated
Complication BB replication leads to more choices
Key question Starting point Continuation Stopping point
43
Optimizing translation blocks : Dynamic superblock formation : starting point
Heavily used block By using Profile information
Method for determining profile points All basic block Heuristics
Targets of backward branches an candidates starting point Exit arc from an existing superblock
Start threshold When a profiled BB’s execution frequency reaches this
value, a new superblock is started Depends on emulation tradeoff A few tens to hundreds of execution is typical
44
Optimizing translation blocks : Dynamic superblock formation : Continuation
Continuation Which subsequent blocks should be collected and added as
the superblock is grown Most frequently used approach
Node profile information is used to identify the most likely successor BB
Continuation threshold A relatively complete set of profile data must be collected for
all BBs Typically half of start point threshold
Continuation set At the time superblock formation is to begin, the set of all BBs
that have reached the continuation threshold is collected
45
Optimizing translation blocks : Dynamic superblock formation : Continuation
Most frequently used procedureStart threshold reachedCollect continuation set
Build superblock from the hottest BB, following control flow edges
Including only BB’s in continuation set
Superblock is completed
Take a hottest as a new start pint
All block in the continuation set is exausted
Emulation process resume with profiling
Until another BB achieves the start threshold
46
Optimizing translation blocks : Dynamic superblock formation : Continuation
Most Recently used approach Edge profile information Algorithm
Assumption The very next sequence of blocks following a start point is
also likely to be a common path Simply follows the actual dynamic control flow path one edge
at a time Advantage
Only candidate start point need to be profiled = No need to use profiling for continuation blocks = Profile overhead is substantially reduced
47
Optimizing translation blocks : Dynamic superblock formation : stopping point
Type of heuristics to determine stop condition The start point of the same superblock is reached A start point of some other superblock is reached A superblock has reached some maximum length
A BB can be used in more than one superblock there may be multiple copies of a given BB Explosion of code size
When using the most frequently used heuristic, there are no more candidate BBs that have reached the candidate threshold
An indirect jump is reached, or there is a procedure call
48
Optimizing translation blocks : Dynamic superblock formation : Example
Most frequently used
A
B D
C
G
30
29 68
68129
70
F
197
2
E
1
3Start point threshold : 100Continuation threshold : 50
49
Optimizing translation blocks : Dynamic superblock formation : Example
Most Recently used Profile point is just A
because A is target of backward branch
Most likely ADEG BCG FG
However There is about 30% chance
ABCG DEG FG There are cases where a
most recently executed method may not select superblocks quite as well as most frequently executed method
A
B D
C
G
30
29 68
68129
70
F
197
2
E
1
3Start point threshold : 100Continuation threshold : 50
50
Optimizing translation blocks : Tree group
Background Problems when applying Superblock for Branches that tend to
almost evenly split their decision Side exit is frequently taken compensation code overhead Optimization are typically not done along the side exit losing
performance improvement opportunities Traces, Superblock VS Tree group
Tree group conditional branch outcomes are more evenly balanced Generalization of superblock Multiple flow of control
Superblocks Conditional branches are predominantly decided one way Single flow of control
51
Optimizing translation blocks : Tree group
A
B D
C
30
29 68
70
F
1
E
3
G G G
97
29 682