Using Platform-Specific Performance Counters for Dynamic Compilation
description
Transcript of Using Platform-Specific Performance Counters for Dynamic Compilation
Oct 2005 1
Using Platform-Specific Performance Counters for
Dynamic Compilation
Florian Schneider and Thomas Gross
ETH Zurich
Oct 2005 2
Introduction & Motivation
• Dynamic compilers common execution platform for OO languages (Java, C#)
• Properties of OO programs difficult to analyze at compile-time
• JIT compiler can immediately use information obtained at run-time
Oct 2005 3
Introduction & Motivation
Types of information:
1. Profiles: e.g. execution frequency of methods / basic blocks
2. Hardware-specific properties: cache misses, TLB misses, branch prediction failures
Oct 2005 4
Outline
1. Introduction
2. Requirements
3. Related work
4. Implementation
5. Results
6. Conclusions
Oct 2005 5
Requirements
• Infrastructure flexible enough to measure different execution metrics – Hide machine-specific details from VM– Keep changes to the VM/compiler minimal
• Runtime overhead of collecting information from the CPU low
• Information must be precise to be useful for online optimization
Oct 2005 6
Related work
• Profile guided optimization – Code positioning [PettisPLDI90]
• Hardware performance monitors– Relating HPM data to basic blocks [Ammons PLDI97]– “Vertical profiling” [Hauswirth OOPSLA 2004]
• Dynamic optimization– Mississippi delta [Adl-Tabatabai PLDI2004]– Object reordering [Huang OOPSLA 2004]
• Our work:– No instrumentation– Use profile data + hardware info– Targets fully automatic dynamic optimization
Oct 2005 7
Hardware performance monitors
• Sampling-based counting– CPU reports state every n events– Precision platform-dependent (pipelines,
out-of-order execution)
• Sampling provides method, basic block, or instruction-level information
– Newer CPUs support precise sampling (e.g. P4, Itanium)
Oct 2005 8
Hardware performance monitors
• Way to localize performance bottlenecks– Sampling interval determines how fine-
grained the information is
• Smaller sampling interval more data– Trade-off: precision vs. runtime overhead– Need enough samples for a representative
picture of the program behavior
Oct 2005 9
Implementation
Main parts1. Kernel module: low level access to
hardware, per process counting
2. User-space library: hides kernel & device driver details from VM
3. Java VM thread: collects samples periodically, maps samples to Java code– Implemented on top of Jikes RVM
Oct 2005 10
System overview
Oct 2005 11
Implementation
• Supported events:– L1 and L2 cache misses– DTLB misses– Branch prediction
• Parameters of the monitoring module: – Buffer size (fixed)– Polling interval (fixed)– Sampling interval (adaptive)
• Keep runtime overhead constant by changing interval during run-time automatically
Oct 2005 12
From raw data to Java
• Determine method + bytecode instr– Build sorted method table– Map offset to bytecode
0x080485e1: mov 0x4(%esi),%esi0x080485e4: mov $0x4,%edi0x080485e9: mov (%esi,%edi,4),%esi0x080485ec: mov %ebx,0x4(%esi)0x080485ef: mov $0x4,%ebx0x080485f4: push %ebx0x080485f5: mov $0x0,%ebx0x080485fa: push %ebx0x080485fb: mov 0x8(%ebp),%ebx0x080485fe: push %ebx0x080485ff: mov (%ebx),%ebx0x08048601: call *0x4(%ebx)0x08048604: add $0xc,%esp0x08048607: mov 0x8(%ebp),%ebx0x0804860a: mov 0x4(%ebx),%ebx
GETFIELD
ARRAYLOAD
INVOKEVIRTUAL
Oct 2005 13
From raw data to Java• Sample gives PC + register contents• PC machine code compiled Java code
bytecode instruction
• For data address: use registers + machine code to calculate target address:– GETFIELD indirect loadmov 12(eax), eax // 12 = offset of field
Oct 2005 14
Engineering issues
• Lookup of PC to get method / BC instr must be efficient – Done in parallel with user program– Use binary search / hash table– Update at recompilation, GC
• Identify 100% of instructions (PCs):– Include samples from application, VM, and
library code– Dealing with native parts
Oct 2005 15
Infrastructure
• Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform
• Pentium 4, 3 GHz, 1G RAM, 1M L2 cache
• Measured data show:– Runtime overhead– Extraction of meaningful information
Oct 2005 16
Runtime overhead
Program Orig
[sec], [score]
Sampling interval 10000
Sampling interval 1000
javac 7.18 2.0% 2.4%
raytrace 4.04 2.4% 2.0%
jess 2.93 0.6% 0.1%
jack 2.73 3.5% 2.7%
db 10.49 0.1% 3.1%
compress 6.50 0.9% 1.5%
mpegaudio 6.54 1.3% 0.3%
jbb 6209.67 2.4% 4.6%
average 1.6% 2.1%
• Experiment setup: monitor L2 cache misses
Oct 2005 17
Runtime overhead: specJBB
1
1.01
1.02
1.03
1.04
1.05
1.06
0 10000 20000 30000 40000
sampling interval
Rel
. per
form
ance
to "
no s
ampl
ing"
Total cost / sample: ~ 3000 cycles
Oct 2005 18
Measurements
• Measure which instructions produce most events (cache misses, branch mispred)– Potential for data locality and control flow
optimizations
• Compare different spec-benchmarks– Find “hot spots”: instructions that produce
80% of all measured events
Oct 2005 19
L1/L2 Cache missesdb L1 misses
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
top-100 memory load instructions
# of
sam
ples
db L2 misses
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
top-100 memory load instructions
# of
sam
ples
80% quantile = 21 instructions
(N=571)
80% quantile = 13
(N=295)
Oct 2005 20
L1/L2 Cache missesspecJBB L1 misses
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
memory load instructions
# of
sam
ples
specJBB L2 misses
0
5000
10000
15000
20000
25000
memory load instructions
# o
f sa
mp
les
80% quantile = 76
(N=2361)
80% quantile = 477(N=8526)
Oct 2005 21
L1/L2 Cache missesjavac L1 misses
0
20
40
60
80
100
120
140
160
memory load instructions
# of
sam
ples
javac L2 misses
0
100
200
300
400
500
600
memory load instructions
# of
sam
ples
80% quantile = 1296(N=3172)
80% quantile = 153
(N=672)
Oct 2005 22
Branch predictionspecJBB branch mispredictions
0
500
1000
1500
2000
2500
3000
3500
4000
branch instructions
# o
f s
am
ple
s
javac branch mispredictions
0
100
200
300
400
500
600
700
800
branch instructions
# o
f s
am
ple
s
db branch mispredictions
0
2000
4000
6000
8000
10000
12000
branch instructions
# o
f sa
mp
les
80% quantile = 307(N=4193) 80% quantile = 1575
(N=7478)
Oct 2005 23
Summary
80%-quantile in % of total L1 misses L2 misses Branch pred.
specJBB 5.6% 3.2% 7.3%
javac 40.9% 22.7% 21.1%
db 3.7% 4.4% 0.8%
• Distribution of events over program differ significantly between benchmarks
• Challenge: Are data precise enough to guide optimizations in a dynamic compiler?
Oct 2005 24
Further work
• Apply information in optimizer– Data: access path expressions p.x.y– Control flow: inlining, I-cache locality
• Investigate flexible sampling interval
• Further optimizations of monitoring system– Replacing expensive JNI calls– Avoid copying of samples
Oct 2005 25
Concluding remarks
• Precise performance event monitoring is possible with low overhead (~ 2%)
• Monitoring infrastructure tied into Jikes RVM compiler
• Instruction level information allows optimizations to focus on “hot spots”
• Good platform to study coupling compiler decisions to hardware-specific platform properties