Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems

Scalable Support for Multithreaded Applications on

Dynamic Binary Instrumentation Systems

Kim HazelwoodGreg Lueck

Robert Cohn

2 Hazelwood – ISMM 2009

Dynamic Binary Instrumentation

sub $0xff, %edx

cmp %esi, %edx

jle <L1>

mov $0x1, %edi

add $0x10, %eax

counter++;

counter++;

counter++;

counter++;

counter++;

Inserts or modify arbitrary instructions in executing binaries, e.g.: instruction count


Instruction Count Output

$ /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out

$ pin -t inscount.so -- /bin/ls

Makefile imageload.out itrace proccount imageload inscount atrace itrace.out

Count 422838


How Does it Work?

Generates and caches modified copies of instructions

Modified (cached) instructions are executed in lieu of original instructions

EXE

Transform

CodeCache

Execute

Profile


Why “Dynamic” Instrumentation?

Robustness! No need to recompile or relink Discover code at runtime Handle dynamically-generated code Attach to running processes

The Code Discovery Problem on x86Instr 1 Instr 2

Instr 3 JumpReg DATA

Instr 5 Instr 6Uncond Branch PADDING

Instr 8

Indirect jump to ??

Data interspersed with code

Pad for alignment


Intel Pin

• A dynamic binary instrumentation system

• Easy-to-use instrumentation interface

• Supports multiple platforms– Four ISAs – IA32, Intel64, IPF, ARM– Four OSes – Linux, Windows, FreeBSD, MacOS

• Popular and well supported– 32,000+ downloads– 400+ citations– 500+ mailing list subscribers


Research Applications

• Gather profile information about applications

• Compare programs generated by competing compilers

• Generate a select stream of live information for event-driven simulation

• Add security features

• Emulate new hardware

• Anything and everything multicore


The Problem with Modern Tools

• Many research tools do not support multithreaded guest applications

• Providing support for MT apps is mostly straightforward

• Providing scalable support can be tricky!


Issues that Arise

• Gaining control of executing threads

• Determining what should be private vs. shared between threads

• Code cache maintenance and consistency

• Concurrent instruction writes

• Providing/handling thread-local storage

• Handling indirect branches

• Handling signals / system calls


The Pin Architecture

JIT Compiler

Syscall Emulator

Signal Emulator Dis

pa

tch

er

Instrumentation CodeCall-Back Handlers

Analysis Code

Code Cache

Pin

Serialized Parallel

T1

T2

T1

T1 T1 T2

Pin Tool


Code Cache Consistency

Cached code must be removed for a variety of reasons:

• Dynamically unloaded code

• Ephemeral/adaptive instrumentation

• Self-modifying code

• Bounded code caches

EXE

Transform

CodeCache

Execute

Profile


Motivating a Bounded Code Cache

The Perl Benchmark

100%

150%

200%

250%

300%

350%

400%

input1 input2 input3 total

Perf

orm

ance

Rel

ative

to N

ative

Unlimited Code Cache2.5 MB Code Cache2.0 MB Code Cache1.5 MB Code Cache1.0 MB Code Cache


• Option 1: All threads have a private code cache (oops, doesn’t scale)

• Option 2: Shared code cache across threads

• If one thread flushes the code cache, other threads may resume in stale memory

Flushing the Code Cache

0%

100%

200%

300%

400%

500%

600%

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Trac

e M

emor

y In

crea

se


Naïve Flush

Wait for all threads to return to the code cache

Could wait indefinitely!

VM

VM

CC1

CC1

VM stall

VM stall

CC2

CC2

VM CC1 VM CC2

Flush Delay

Thread1

Thread2

Thread3

Time


Generational Flush

Allow threads to continue to make progress in a separate area of the code cache

VM

VM

CC1

CC1

VM

VM

CC2

CC2

VM CC1 VM CC2

Thread1

Thread2

Thread3

Requires a high water mark

Time


Memory Scalability of the Code Cache

Ensuring scalability also requires carefully configuring the code stored in the cache

Trace Lengths

• First basic block is non-speculative, others are speculative

• Longer traces = fewer entries in the lookup table, but more unexecuted code

• Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code


Effect of Trace Length on Trace Count

0

2000

4000

6000

8000

10000

12000

14000

16000


Tota

l Tra

ce C

ount

1 BB 2 BBs 4 BBs 8 BBs 16 BBs 32 BBs


Effect of Trace Length on Memory

0

1500

3000

4500

6000

7500

9000

10500

01 BBs 02 BBs 04 BBs 08 BBs 16 BBs 32 BBsBasic Blocks Per Trace

Code

Cac

he F

ootp

rint (

KB)

Lookup TableLinksExit StubsTraces


Rewriting Instructions

• Pin must regularly rewrite branches

• No atomic branch write on x86

• We use a neat trick*:

“old” 5-byte branch

2-byte self

branch

n-2 bytes of “new” branch

“new” 5-byte branch

* Sundaresan et al. 2006


Performance Results

We use the SPEC OMP 2001 benchmarks

• OMP_NUM_THREADS environment variable

We compare

• Native performance and scalability

• Pin (no Pintool) performance scalability

• Pin (lightweight Pintool) scalability• InsCount Pintool – counts instructions at BB granularity

• Pin (middleweight Pintool) scalability •MemTrace Pintool – records memory addresses

• Pin (heavyweight Pintool) scalability• CMP$im – collects memory addresses and applies a software model of the CMP cache


Native Scalability of SPEC OMP 2001

0 X

1 X

2 X

3 X

4 X

5 X

6 X

7 X

8 X

wupwiseswim mgrid applu galgelequake apsi gafort fma3d art ammp

Spee

dup

1 thread 2 threads 4 threads 8 threads


Performance Scalability (No Instrumentation)

0%

20%

40%

60%

80%

100%

120%

140%

160%


Runti

me

(Rel

ative

to N

ative

)



Performance Scalability (LightWeight Instrumentation)

0%

20%

40%

60%

80%

100%

120%

140%

160%


Runti

me

(Rel

ative

to N

ative

)



Performance Scalability (MiddleWeight Instrumentation)

0%

100%

200%

300%

400%

500%


Runti

me

(Rel

ative

to N

ative

)



Performance Scalability (HeavyWeight Instrumentation)

0 X50 X

100 X150 X200 X250 X300 X350 X400 X450 X500 X


Runti

me

(Rel

ative

to N

ative

)



Memory Scalability

0

1000

2000

3000

4000

5000

6000

7000


Code

Cac

he S

ize

(KB)



Summary

• Dynamic instrumentation tools are useful

• In the multicore era, we must provide support for MT application analysis and simulation

• Providing MT support in Pin was easy

• Making it robust and scalable was not easy

http://www.pintool.org

Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems

Documents

Transcript of Scalable Support for Multithreaded Applications on Dynamic Binary Instrumentation Systems