Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems
-
Upload
emmanuel-dalton -
Category
Documents
-
view
70 -
download
3
description
Transcript of Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems
Scalable Support for Multithreaded Applications on
Dynamic Binary Instrumentation Systems
Kim HazelwoodGreg Lueck
Robert Cohn
2 Hazelwood – ISMM 2009
Dynamic Binary Instrumentation
sub $0xff, %edx
cmp %esi, %edx
jle <L1>
mov $0x1, %edi
add $0x10, %eax
counter++;
counter++;
counter++;
counter++;
counter++;
Inserts or modify arbitrary instructions in executing binaries, e.g.: instruction count
3 Hazelwood – ISMM 2009
Instruction Count Output
$ /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out
$ pin -t inscount.so -- /bin/ls
Makefile imageload.out itrace proccount imageload inscount atrace itrace.out
Count 422838
4 Hazelwood – ISMM 2009
How Does it Work?
Generates and caches modified copies of instructions
Modified (cached) instructions are executed in lieu of original instructions
EXE
Transform
CodeCache
Execute
Profile
5 Hazelwood – ISMM 2009
Why “Dynamic” Instrumentation?
Robustness! No need to recompile or relink Discover code at runtime Handle dynamically-generated code Attach to running processes
The Code Discovery Problem on x86Instr 1 Instr 2
Instr 3 JumpReg DATA
Instr 5 Instr 6Uncond Branch PADDING
Instr 8
Indirect jump to ??
Data interspersed with code
Pad for alignment
6 Hazelwood – ISMM 20096
Intel Pin
• A dynamic binary instrumentation system
• Easy-to-use instrumentation interface
• Supports multiple platforms– Four ISAs – IA32, Intel64, IPF, ARM– Four OSes – Linux, Windows, FreeBSD, MacOS
• Popular and well supported– 32,000+ downloads– 400+ citations– 500+ mailing list subscribers
7 Hazelwood – ISMM 2009
Research Applications
• Gather profile information about applications
• Compare programs generated by competing compilers
• Generate a select stream of live information for event-driven simulation
• Add security features
• Emulate new hardware
• Anything and everything multicore
8 Hazelwood – ISMM 2009
The Problem with Modern Tools
• Many research tools do not support multithreaded guest applications
• Providing support for MT apps is mostly straightforward
• Providing scalable support can be tricky!
9 Hazelwood – ISMM 2009
Issues that Arise
• Gaining control of executing threads
• Determining what should be private vs. shared between threads
• Code cache maintenance and consistency
• Concurrent instruction writes
• Providing/handling thread-local storage
• Handling indirect branches
• Handling signals / system calls
10 Hazelwood – ISMM 2009
The Pin Architecture
JIT Compiler
Syscall Emulator
Signal Emulator Dis
pa
tch
er
Instrumentation CodeCall-Back Handlers
Analysis Code
Code Cache
Pin
Serialized Parallel
T1
T2
T1
T1 T1 T2
Pin Tool
11 Hazelwood – ISMM 2009
Code Cache Consistency
Cached code must be removed for a variety of reasons:
• Dynamically unloaded code
• Ephemeral/adaptive instrumentation
• Self-modifying code
• Bounded code caches
EXE
Transform
CodeCache
Execute
Profile
12 Hazelwood – ISMM 2009
Motivating a Bounded Code Cache
The Perl Benchmark
100%
150%
200%
250%
300%
350%
400%
input1 input2 input3 total
Perf
orm
ance
Rel
ative
to N
ative
Unlimited Code Cache2.5 MB Code Cache2.0 MB Code Cache1.5 MB Code Cache1.0 MB Code Cache
13 Hazelwood – ISMM 2009
• Option 1: All threads have a private code cache (oops, doesn’t scale)
• Option 2: Shared code cache across threads
• If one thread flushes the code cache, other threads may resume in stale memory
Flushing the Code Cache
0%
100%
200%
300%
400%
500%
600%
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Trac
e M
emor
y In
crea
se
14 Hazelwood – ISMM 2009
Naïve Flush
Wait for all threads to return to the code cache
Could wait indefinitely!
VM
VM
CC1
CC1
VM stall
VM stall
CC2
CC2
VM CC1 VM CC2
Flush Delay
Thread1
Thread2
Thread3
Time
15 Hazelwood – ISMM 2009
Generational Flush
Allow threads to continue to make progress in a separate area of the code cache
VM
VM
CC1
CC1
VM
VM
CC2
CC2
VM CC1 VM CC2
Thread1
Thread2
Thread3
Requires a high water mark
Time
16 Hazelwood – ISMM 2009
Memory Scalability of the Code Cache
Ensuring scalability also requires carefully configuring the code stored in the cache
Trace Lengths
• First basic block is non-speculative, others are speculative
• Longer traces = fewer entries in the lookup table, but more unexecuted code
• Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code
17 Hazelwood – ISMM 2009
Effect of Trace Length on Trace Count
0
2000
4000
6000
8000
10000
12000
14000
16000
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Tota
l Tra
ce C
ount
1 BB 2 BBs 4 BBs 8 BBs 16 BBs 32 BBs
18 Hazelwood – ISMM 2009
Effect of Trace Length on Memory
0
1500
3000
4500
6000
7500
9000
10500
01 BBs 02 BBs 04 BBs 08 BBs 16 BBs 32 BBsBasic Blocks Per Trace
Code
Cac
he F
ootp
rint (
KB)
Lookup TableLinksExit StubsTraces
19 Hazelwood – ISMM 2009
Rewriting Instructions
• Pin must regularly rewrite branches
• No atomic branch write on x86
• We use a neat trick*:
“old” 5-byte branch
2-byte self
branch
n-2 bytes of “new” branch
“new” 5-byte branch
* Sundaresan et al. 2006
20 Hazelwood – ISMM 2009
Performance Results
We use the SPEC OMP 2001 benchmarks
• OMP_NUM_THREADS environment variable
We compare
• Native performance and scalability
• Pin (no Pintool) performance scalability
• Pin (lightweight Pintool) scalability• InsCount Pintool – counts instructions at BB granularity
• Pin (middleweight Pintool) scalability •MemTrace Pintool – records memory addresses
• Pin (heavyweight Pintool) scalability• CMP$im – collects memory addresses and applies a software model of the CMP cache
21 Hazelwood – ISMM 2009
Native Scalability of SPEC OMP 2001
0 X
1 X
2 X
3 X
4 X
5 X
6 X
7 X
8 X
wupwiseswim mgrid applu galgelequake apsi gafort fma3d art ammp
Spee
dup
1 thread 2 threads 4 threads 8 threads
22 Hazelwood – ISMM 2009
Performance Scalability (No Instrumentation)
0%
20%
40%
60%
80%
100%
120%
140%
160%
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Runti
me
(Rel
ative
to N
ative
)
1 thread 2 threads 4 threads 8 threads
23 Hazelwood – ISMM 2009
Performance Scalability (LightWeight Instrumentation)
0%
20%
40%
60%
80%
100%
120%
140%
160%
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Runti
me
(Rel
ative
to N
ative
)
1 thread 2 threads 4 threads 8 threads
24 Hazelwood – ISMM 2009
Performance Scalability (MiddleWeight Instrumentation)
0%
100%
200%
300%
400%
500%
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Runti
me
(Rel
ative
to N
ative
)
1 thread 2 threads 4 threads 8 threads
25 Hazelwood – ISMM 2009
Performance Scalability (HeavyWeight Instrumentation)
0 X50 X
100 X150 X200 X250 X300 X350 X400 X450 X500 X
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Runti
me
(Rel
ative
to N
ative
)
1 thread 2 threads 4 threads 8 threads
26 Hazelwood – ISMM 2009
Memory Scalability
0
1000
2000
3000
4000
5000
6000
7000
wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp
Code
Cac
he S
ize
(KB)
1 thread 2 threads 4 threads 8 threads