Post on 01-Jan-2016
1
Fast and EfficientPartial Code Reordering
Xianglong Huang (UT Austin, Adverplex)Stephen M. Blackburn (Intel)David Grove (IBM)Kathryn McKinley (UT Austin)
2
Software Trends
• By 2008, 80% of software will be written in Java or C#. [Gartner report]
• Java and C# are coming to your OS soon - Jnode, Singularity
• Advantages of modern programming languages: – Productivity, security, reliability…
• Performance?
3
Hardware Trends
2X/1.5yr
2X/10 yrs
1980
1990
2000
DRAM
CPU
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
cache
2005
4
Fully Associative IL1
-2.0%
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
14.0%
16.0%
18.0%
antlr
bloa
tfo
pjyt
hon
pmd
xalan
_201
_com
press
_202
_jess
_209
_db
_213
_java
c
_222
_mpe
gaud
io
_228
_jack
pseu
dojb
b
geo
mea
n
Imp
rove
men
t o
ver
bas
e ca
se
Fully Associative IL1
Improvement Potential
Base case: JikesRVM default with separate code space
. Cache configuration: 32K IL1 direct map, 512K L2(small programs on a big cache)
5
New and Better Opportunities
• Virtual machine monitors application behavior at runtime
• Dynamic recompilation– With dynamic feedback– Allocates instructions at runtime
6
Previous Work on Instruction Locality
• Static schemes– Static profile calling correlation and reorder
code at compile and link time [Pettis and Hansen 90]
– Cache coloring [Hashemi et al 97]
– Profile procedure interleaving [Gloy et al. 99]
– Static schemes are not flexible
• Dynamic scheme– JIT code reordering [Chen et al. 97]
– Used as our base case
7
Optimizations in Virtual Machine
• Static instruction allocation used at runtime, – e.g. Just-in-time
compilations – Invocation order
CompilerMemory Manager
Runtime
StaticOptimizations
8
Optimizations in Virtual Machine
• Dynamic instruction allocation/reordering adapt to the program behavior with low overhead
CompilerMemory Manager
Runtime
StaticOptimization
9
Opportunity for Instruction Locality
• Dynamic detection of hot methods, hot basic blocks
• Dynamic recompilation relocates methods at runtime
10
PCR Optimizations
• Reduce instruction capacity misses– Code space – Method separation– Code splitting
• Reduce instruction conflict misses– Code padding
11
PCR System
JikesRVM componentInput/Output
Optimized method
Baseline method
Data
BaselineCompiler
SourceCode
ExecutingCode
AdaptiveSampler Optimizing
Compiler
HotMethods
12
PCR System: Method Separation
Hot method (optimized code)
Cold method (baseline code)
Data
Data
Code
Data
Hot Methods
Cold MethodsCode
13
PCR System: Code Splitting
• Online edge profile identifies hot basic blocks in a method
• Code reordering moves hot basic blocks to the beginning of a method
• Code splitting to separate hot/cold basic blocks inside the heap
Cold basic blocks
Hot basic blocks
Method A:
14
PCR System: Code Splitting
Data
Hot Blocks
Cold Methods
Cold Blocks
Hot methods (optimized code)
Cold methods (baseline code)
Data Cold basic blocks
Hot basic blocks
Data
Hot Methods
Cold Methods
15
PCR Optimizations
• Reduce instruction capacity misses– Code Space – Method separation– Code splitting
• Reduce instruction conflict misses– Code padding
16
PCR System: Code Padding
BaselineCompiler
SourceCode
BinaryCode
AdaptiveSampler Optimizing
Compiler
Hot MethodsDynamic
Call Graph
JikesRVM componentInput/Output
18
Methodology
• Java virtual machine: Jikes RVM• Various Architectures
– x86 (Pentium 4)– PowerPC– Simulator: Dynamic SimpleScalar
• Use direct-mapped I-cache– Shorter latency– More conflict misses
21
-10.0%
-5.0%
0.0%
5.0%
10.0%
15.0%
20.0%
Imp
rovem
en
t o
ver
co
de s
pace
Code Splitting
Code Padding + Splitting
Fully Associative IL1
Impact of Code Padding
Base case: JikesRVM default + a separate code space. Cache configuration: 32K IL1 direct map, 512K L2
22
Conclusion
• Code space improve program performance by 6% (up to 30%) (Pentium 4)
• PCR has negligible overhead• PCR no obvious performance improvement
– On Pentium 4, no improvement on average– In simulation,
• PCR has 14% for one program• Not consistent, no improvement on average.
• Potential opportunities for dynamic optimizations
24
Cache: Small vs. Large
IL1 DL1 L2Size Assoc Latency Size Assoc Latency Size Asso
cLatenc
y
8K 1 2 8K 2 2 128K
2 5
16K 1 2 16K
2 3 256K
2 8
64K 1 4 64K
2 4 512K
2 10
Cacti, 90nm technology, 3GHz frequency
25
Cache-Size Comparison
_213_javac
0
5
10
15
20
25
30
35
40
45
50
Tota
l Cyc
les
(in b
illio
ns)
_202_jess
0
5
10
15
20
25
30
35
40
45
50
Tota
l Cyc
les
(in b
illio
ns)
26
Directmap vs. Two-way
0
1
2
3
4
5
6
7
8
Direct map IL1 with 2 cycle hitlatency
Two way IL1 with 3 cycle hit latency
cycl
es (
106 )
Cacti, 90nm technology, 3GHz
27
Improving Performance
• Classic optimizations not sufficient!• Different programming styles
– Automatic memory management– Pointer data structures– Many small methods
• Optimization costs incurred at runtime• Virtual Machine (VM) adds complexity
– Class loading, memory management, Just-in-time compiler…
28
Instruction Locality
• Instructions have better locality?– More instruction accesses – About same # of data cache misses
• Penalty in pipelined processor– Create bubbles in the pipeline
• Instruction locality can be more critical