DynamoRIO Tutorial Apr2010

241
Building Dynamic Instrumentation Tools with DynamoRIO Saman Amarasinghe Derek Bruening Qin Zhao

Transcript of DynamoRIO Tutorial Apr2010

Page 1: DynamoRIO Tutorial Apr2010

Building Dynamic Instrumentation Tools

with DynamoRIO

Saman AmarasingheDerek Bruening

Qin Zhao

Page 2: DynamoRIO Tutorial Apr2010

2DynamoRIO Tutorial at CGO 24 April 2010

Tutorial Outline

• 1:30-1:40 Welcome + DynamoRIO History• 1:40-2:40 DynamoRIO Internals• 2:40-3:00 Examples, Part 1• 3:00-3:15 Break• 3:15-4:15 DynamoRIO API• 4:15-5:15 Examples, Part 2• 5:15-5:30 Feedback

Page 3: DynamoRIO Tutorial Apr2010

3DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO History

• Dynamo– HP Labs: PA-RISC late 1990’s– x86 Dynamo: 2000

• RIO DynamoRIO– MIT: 2001-2004

• Prior releases– 0.9.1: Jun 2002 (PLDI tutorial)– 0.9.2: Oct 2002 (ASPLOS tutorial)– 0.9.3: Mar 2003 (CGO tutorial)– 0.9.4: Feb 2005

Page 4: DynamoRIO Tutorial Apr2010

4DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO History

– 0.9.5: Apr 2008 (CGO tutorial)– 1.0 (0.9.6): Sep 2008 (GoVirtual.org launch)

• Determina– 2003-2007– Security company

• VMware– Acquired Determina (and DynamoRIO) in 2007

• Open-source BSD license– Feb 2009: 1.3.1 release– Dec 2009: 1.5.0 release– Apr 2010: 2.0.0 release

Page 5: DynamoRIO Tutorial Apr2010

DynamoRIO Internals1:30-1:40 Welcome + DynamoRIO History1:40-2:40 DynamoRIO Internals2:40-3:00 Examples, Part 13:00-3:15 Break3:15-4:15 DynamoRIO API4:15-5:15 Examples, Part 25:15-5:30 Feedback

Page 6: DynamoRIO Tutorial Apr2010

6DynamoRIO Tutorial at CGO 24 April 2010

Typical Modern Application: IIS

Page 7: DynamoRIO Tutorial Apr2010

7DynamoRIO Tutorial at CGO 24 April 2010

Runtime Interposition Layer

Page 8: DynamoRIO Tutorial Apr2010

8DynamoRIO Tutorial at CGO 24 April 2010

Design Goals

• Efficient– Near-native performance

• Transparent– Match native behavior

• Comprehensive– Control every instruction, in any application

• Customizable– Adapt to satisfy disparate tool needs

Page 9: DynamoRIO Tutorial Apr2010

9DynamoRIO Tutorial at CGO 24 April 2010

Challenges of Real-World Apps

• Multiple threads– Synchronization

• Application introspection– Reading of return address

• Transparency corner cases are the norm– Example: access beyond top of stack

• Scalability– Must adapt to varying code sizes, thread counts, etc.

• Dynamically generated code– Performance challenges

Page 10: DynamoRIO Tutorial Apr2010

10DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient– Software code cache overview– Thread-shared code cache– Cache capacity limits– Data structures

• Transparent• Comprehensive• Customizable

Page 11: DynamoRIO Tutorial Apr2010

11DynamoRIO Tutorial at CGO 24 April 2010

Direct Code Modification

Kernel32!TerminateProcess:7d4d1028 7c 05 jl 7d4d102f7d4d102a 33 c0 xor %eax,%eax7d4d102c 40 inc %eax7d4d102d eb 08 jmp 7d4d10377d4d102f 50 push %eax7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

e9 37 6f 48 92 jmp <callout>

Page 12: DynamoRIO Tutorial Apr2010

12DynamoRIO Tutorial at CGO 24 April 2010

Debugger Trap Too Expensive

Kernel32!TerminateProcess:7d4d1028 7c 05 jl 7d4d102f7d4d102a 33 c0 xor %eax,%eax7d4d102c 40 inc %eax7d4d102d eb 08 jmp 7d4d10377d4d102f 50 push %eax7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

cc int3 (breakpoint)

Page 13: DynamoRIO Tutorial Apr2010

13DynamoRIO Tutorial at CGO 24 April 2010

Variable-Length Instruction Complications

Kernel32!TerminateProcess:7d4d1028 7c 05 jl 7d4d102f7d4d102a 33 c0 xor %eax,%eax7d4d102c 40 inc %eax7d4d102d eb 08 jmp 7d4d10377d4d102f 50 push %eax7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

e9 37 6f 48 92 jmp <callout>

Page 14: DynamoRIO Tutorial Apr2010

14DynamoRIO Tutorial at CGO 24 April 2010

Entry Point Complications

Kernel32!TerminateProcess:7d4d1028 7c 05 jl 7d4d102f7d4d102a 33 c0 xor %eax,%eax7d4d102c 40 inc %eax7d4d102d eb 08 jmp 7d4d10377d4d102f 50 push %eax7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

e9 37 6f 48 92 jmp <callout>

Page 15: DynamoRIO Tutorial Apr2010

15DynamoRIO Tutorial at CGO 24 April 2010

Direct Code Modification: Too Limited

• Not transparent– Cannot write jump atomically if crosses cache line– Even if write is atomic, not safe if overwrites part of next

instruction– Jump may span code entry point

• Too limited– Not safe w/o suspending all threads and knowing all entry points– Limited to inserting callouts

• Code displaced by jump is a mini code cache– All the same consistency challenges of larger cache

• Inter-operation issues with other hooks

Page 16: DynamoRIO Tutorial Apr2010

16DynamoRIO Tutorial at CGO 24 April 2010

We Need Indirection

• Avoid transparency and granularity limitations of directly modifying application code

• Allow arbitrary modifications at unrestricted points in code stream

• Allow systematic, fine-grained modifications to code stream• Guarantee that all code is observed

Page 17: DynamoRIO Tutorial Apr2010

17DynamoRIO Tutorial at CGO 24 April 2010

Basic Interpreter

fetch

START

decode execute

Slowdown: ~300x

Page 18: DynamoRIO Tutorial Apr2010

18DynamoRIO Tutorial at CGO 24 April 2010

Improvement #1: Interpreter + Basic Block Cache

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

non-control-flow instructions

Non-control-flow instructions executed from software code cache

Slowdown: 300x 25x

Page 19: DynamoRIO Tutorial Apr2010

19DynamoRIO Tutorial at CGO 24 April 2010

Improvement #1: Interpreter + Basic Block Cache

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

A

B C

D

A B D

Page 20: DynamoRIO Tutorial Apr2010

20DynamoRIO Tutorial at CGO 24 April 2010

Example Basic Block Fragment

add %eax, %ecxcmp $4, %eaxjle $0x40106f

add %eax, %ecxcmp $4, %eaxjle <stub0>jmp <stub1>mov %eax, eax-slotmov &dstub0, %eaxjmp context_switchmov %eax, eax-slotmov &dstub1, %eaxjmp context_switch

frag7:

stub0:

stub1:

dstub0target: 0x40106f

dstub1target: fall-thru

Page 21: DynamoRIO Tutorial Apr2010

21DynamoRIO Tutorial at CGO 24 April 2010

Improvement #2: Linking Direct Branches

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

non-control-flow instructions

Direct branch to existing block can bypass dispatch

Slowdown: 300x 25x 3x

Page 22: DynamoRIO Tutorial Apr2010

22DynamoRIO Tutorial at CGO 24 April 2010

Improvement #2: Linking Direct Branches

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

A

B C

D

A B D

Page 23: DynamoRIO Tutorial Apr2010

23DynamoRIO Tutorial at CGO 24 April 2010

Direct Linking

add %eax, %ecxcmp $4, %eaxjle $0x40106f

add %eax, %ecxcmp $4, %eaxjle <frag8>jmp <stub1>mov %eax, eax-slotmov &dstub0, %eaxjmp context_switchmov %eax, eax-slotmov &dstub1, %eaxjmp context_switch

frag7:

stub0:

stub1:

dstub0target: 0x40106f

dstub1target: fall-thru

Page 24: DynamoRIO Tutorial Apr2010

24DynamoRIO Tutorial at CGO 24 April 2010

Improvement #3: Linking Indirect Branches

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

non-control-flow instructions

indirect branch lookup

Application address mapped to code cache

Slowdown: 300x 25x 3x 1.2x

Page 25: DynamoRIO Tutorial Apr2010

25DynamoRIO Tutorial at CGO 24 April 2010

Indirect Branch Transformation

retmov %ecx, ecx-slotpop %ecxjmp <ib_lookup>

frag8:

...

...

...

ib_lookup:

Page 26: DynamoRIO Tutorial Apr2010

26DynamoRIO Tutorial at CGO 24 April 2010

Improvement #4: Trace Building

• Traces reduce branching, improve layout and locality, and facilitate optimizations across blocks– We avoid indirect branch lookup

• Next Executing Tail (NET) trace building scheme [Duesterwald 2000]

A

B

D G

E

C F

H

I

J

K

L

ABEFHD

GKJ

Basic Block Cache Trace Cache

Page 27: DynamoRIO Tutorial Apr2010

27DynamoRIO Tutorial at CGO 24 April 2010

Incremental NET Trace Building

J

K

G

Basic Block Cache Trace Cache

G

J

K

GK

J

K

GKJ

G

G

Page 28: DynamoRIO Tutorial Apr2010

28DynamoRIO Tutorial at CGO 24 April 2010

Improvement #4: Trace Building

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

non-control-flow instructions

non-control-flow instructions

indirect branch lookup

Slowdown: 300x 26x 3x 1.2x 1.1x

indirect branch stays on trace?

Page 29: DynamoRIO Tutorial Apr2010

29DynamoRIO Tutorial at CGO 24 April 2010

Base Performance

SPEC CPU2000 DesktopServer

Page 30: DynamoRIO Tutorial Apr2010

30DynamoRIO Tutorial at CGO 24 April 2010

Sources of Overhead

• Extra instructions– Indirect branch target comparisons– Indirect branch hashtable lookups

• Extra data cache pressure– Indirect branch hashtable

• Branch mispredictions– ret becomes jmp*

• Application code modification

Page 31: DynamoRIO Tutorial Apr2010

31DynamoRIO Tutorial at CGO 24 April 2010

Time Breakdown for SPECINT

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

non-control-flow instructions

non-control-flow instructions

indirect branch lookup

indirect branch stays on trace?

< 1%

~ 0% ~ 2%~ 94%~ 4%

Page 32: DynamoRIO Tutorial Apr2010

32DynamoRIO Tutorial at CGO 24 April 2010

Not An Ordinary Application

• An application executing in DynamoRIO’s code cache looks different from what the underlying hardware has been tuned for

• The hardware expects:– Little or no dynamic code modification

• Writes to code are expensive– call and ret instructions

• Return Stack Buffer predictor

Page 33: DynamoRIO Tutorial Apr2010

33DynamoRIO Tutorial at CGO 24 April 2010

Performance Counter Data

Page 34: DynamoRIO Tutorial Apr2010

34DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient– Software code cache overview– Thread-shared code cache– Cache capacity limits– Data structures

• Transparent• Comprehensive• Customizable

Page 35: DynamoRIO Tutorial Apr2010

35DynamoRIO Tutorial at CGO 24 April 2010

Threading Model

Thread1

Running Program

Code Caching Runtime System

Operating System

Thread1

Thread1

Thread2

Thread2

Thread2

Thread3

Thread3

Thread3

ThreadN

ThreadN

ThreadN

Page 36: DynamoRIO Tutorial Apr2010

36DynamoRIO Tutorial at CGO 24 April 2010

Code Space

Thread

Thread

Thread1

Thread

Thread

Thread2

Thread

Thread

Thread1

Thread

Thread

Thread2

Running Program

Thread-Shared Code Cache

Operating System

Thread-Private Code Caches

Page 37: DynamoRIO Tutorial Apr2010

37DynamoRIO Tutorial at CGO 24 April 2010

Thread-Private versus Thread-Shared

• Thread-private– Less synchronization needed– Absolute addressing for thread-local storage– Thread-specific optimization and instrumentation

• Thread-shared– Scales to many-threaded apps

Page 38: DynamoRIO Tutorial Apr2010

38DynamoRIO Tutorial at CGO 24 April 2010

Database and Web Server Suite

Benchmark Server Processes

ab low IIS low isolation inetinfo.exe

ab med IIS medium isolation inetinfo.exe, dllhost.exe

guest low IIS low isolation, SQL Server 2000

inetinfo.exe, sqlservr.exe

guest med IIS medium isolation, SQL Server 2000

inetinfo.exe, dllhost.exe, sqlservr.exe

Page 39: DynamoRIO Tutorial Apr2010

39DynamoRIO Tutorial at CGO 24 April 2010

Memory Impact

ab med guest low guest med

Page 40: DynamoRIO Tutorial Apr2010

40DynamoRIO Tutorial at CGO 24 April 2010

Performance Impact

Page 41: DynamoRIO Tutorial Apr2010

41DynamoRIO Tutorial at CGO 24 April 2010

Scalability Limit

Page 42: DynamoRIO Tutorial Apr2010

42DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient– Software code cache overview– Thread-shared code cache– Cache capacity limits– Data structures

• Transparent• Comprehensive• Customizable

Page 43: DynamoRIO Tutorial Apr2010

43DynamoRIO Tutorial at CGO 24 April 2010

Added Memory Breakdown

Page 44: DynamoRIO Tutorial Apr2010

44DynamoRIO Tutorial at CGO 24 April 2010

Code Expansion

original code66%

net jumps8%

indirect branch target handling

7%

exit stubs19%

Page 45: DynamoRIO Tutorial Apr2010

45DynamoRIO Tutorial at CGO 24 April 2010

Cache Capacity Challenges

• How to set an upper limit on the cache size– Different applications have different working sets and different

total code sizes• Which fragments to evict when that limit is reached

– Without expensive profiling or extensive fragmentation

Page 46: DynamoRIO Tutorial Apr2010

46DynamoRIO Tutorial at CGO 24 April 2010

Adaptive Sizing Algorithm

• Enlarge cache if warranted by percentage of new fragments that are regenerated

• Target working set of application: don’t enlarge for once-only code

• Low-overhead, incremental, and reactive

Page 47: DynamoRIO Tutorial Apr2010

47DynamoRIO Tutorial at CGO 24 April 2010

Cache Capacity Settings

• Thread-private:– Working set size matching is on by default– Client may see blocks or traces being deleted in the absence of

any cache consistency event– Can disable capacity management via

• -no_finite_bb_cache• -no_finite_trace_cache

• Thread-shared:– Set to infinite size by default– Can enable capacity management via

• -finite_shared_bb_cache• -finite_shared_trace_cache

Page 48: DynamoRIO Tutorial Apr2010

48DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient– Software code cache overview– Thread-shared code cache– Cache capacity limits– Data structures

• Transparent• Comprehensive• Customizable

Page 49: DynamoRIO Tutorial Apr2010

49DynamoRIO Tutorial at CGO 24 April 2010

Two Modes of Code Cache Operation

• Fine-grained scheme– Supports individual code fragment unlink and removal– Separate data structure per code fragment and each of its exits,

memory regions spanned, and incoming links• Coarse-grained scheme

– No individual code fragment control– Permanent intra-cache links– No per-fragment data structures at all– Treat entire cache as a unit for consistency

Page 50: DynamoRIO Tutorial Apr2010

50DynamoRIO Tutorial at CGO 24 April 2010

Data Structures

• Fine-grained scheme– Data structures are highly tuned and compact

• Coarse-grained scheme– There are no data structures– Savings on applications with large amounts of code are typically

15%-25% of committed memory and 5%-15% of working set

Page 51: DynamoRIO Tutorial Apr2010

51DynamoRIO Tutorial at CGO 24 April 2010

Status in Current Release

• Fine-grained scheme– Current default

• Coarse-grained scheme– Select with –opt_memory runtime option– Possible performance hit on certain benchmarks– In the future will be the default option– Required for persisted and process-shared caches

Page 52: DynamoRIO Tutorial Apr2010

52DynamoRIO Tutorial at CGO 24 April 2010

Adaptive Level of Granularity

• Start with coarse-grain caches– Plus freezing and sharing/persisting

• Switch to fine-grain for individual modules or sub-regions of modules after significant consistency events, to avoid expensive entire-module flushes– Support simultaneous fine-grain fragments within coarse-grain

regions for corner cases• Match amount of bookkeeping to amount of code change

– Majority of application code does not need fine-grain

Page 53: DynamoRIO Tutorial Apr2010

53DynamoRIO Tutorial at CGO 24 April 2010

Many Varieties of Code Caches

• Coarse-grained versus fine-grained• Thread-shared versus thread-private• Basic blocks versus traces

Page 54: DynamoRIO Tutorial Apr2010

54DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient• Transparent

– Rules of transparency– Cache consistency– Synchronization

• Comprehensive• Customizable

Page 55: DynamoRIO Tutorial Apr2010

55DynamoRIO Tutorial at CGO 24 April 2010

Transparency

• Do not want to interfere with the semantics of the program• Dangerous to make any assumptions about:

– Register usage– Calling conventions– Stack layout– Memory/heap usage– I/O and other system call use

Page 56: DynamoRIO Tutorial Apr2010

56DynamoRIO Tutorial at CGO 24 April 2010

Painful, But Necessary

• Difficult and costly to handle corner cases• Many applications will not notice…• …but some will!

– Microsoft Office: Visual Basic generated code, stack convention violations

– COM, Star Office, MMC: trampolines– Adobe Premiere: self-modifying code– VirtualDub: UPX-packed executable– etc.

Page 57: DynamoRIO Tutorial Apr2010

57DynamoRIO Tutorial at CGO 24 April 2010

Rule 1: Avoid Resource Conflicts

• DynamoRIO system code executes at arbitrary points during application execution

• If DynamoRIO uses the same library routine as the application, it may call that routine in the middle of the same routine being called by the application

• Most library routines are not re-entrant!– Many are thread-safe, but that does not help us

Page 58: DynamoRIO Tutorial Apr2010

58DynamoRIO Tutorial at CGO 24 April 2010

Rule 1: Avoid Resource Conflicts

Linux Windows

Page 59: DynamoRIO Tutorial Apr2010

59DynamoRIO Tutorial at CGO 24 April 2010

Rule 2: If It’s Not Broken, Don’t Change It

• Threads• Executable on disk• Application data

– Including the stack!

Page 60: DynamoRIO Tutorial Apr2010

60DynamoRIO Tutorial at CGO 24 April 2010

Example Transparency Violation

Error

Error

Error

Error

SPEC CPU2000 DesktopServer

Error

Error

Error

Error

Error

Error

Page 61: DynamoRIO Tutorial Apr2010

61DynamoRIO Tutorial at CGO 24 April 2010

Rule 3: If You Change It, Emulate Original Behavior’s Visible Effects

• Application addresses• Address space• Error transparency• Code cache consistency

Page 62: DynamoRIO Tutorial Apr2010

62DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient• Transparent

– Rules of transparency– Cache consistency– Synchronization

• Comprehensive• Customizable

Page 63: DynamoRIO Tutorial Apr2010

63DynamoRIO Tutorial at CGO 24 April 2010

Code Change Mechanisms

I-Cache

Store BFlush BJump B

A:B:C:D:

D-CacheA:B:C:D:

RISCI-Cache

Store BJump B

A:B:C:D:

D-CacheA:B:C:D:

x86

Page 64: DynamoRIO Tutorial Apr2010

64DynamoRIO Tutorial at CGO 24 April 2010

How Often Does Code Change?

• Not just modification of code!• Removal of code

– Shared library unloading• Replacement of code

– JIT region re-use– Trampoline on stack

Page 65: DynamoRIO Tutorial Apr2010

65DynamoRIO Tutorial at CGO 24 April 2010

Code Change Events

Memory Unmappings

Generated Code Regions

Modified Code Regions

SPECFP 112 0 0

SPECINT 29 0 0

SPECJVM 7 3373 4591

Excel 144 21 20

Photoshop 1168 40 0

Powerpoint 367 28 33

Word 345 20 6

Page 66: DynamoRIO Tutorial Apr2010

66DynamoRIO Tutorial at CGO 24 April 2010

Detecting Code Removal

• Example: shared library being unloaded• Requires explicit request by application to operating system• Detect by monitoring system calls (munmap,

NtUnmapViewOfSection)

Page 67: DynamoRIO Tutorial Apr2010

67DynamoRIO Tutorial at CGO 24 April 2010

Detecting Code Modification

• On x86, no explicit app request required, as the icache is kept consistent in hardware – so any memory write could modify code!

I-Cache

Store BJump B

A:B:C:D:

D-CacheA:B:C:D:

x86

Page 68: DynamoRIO Tutorial Apr2010

68DynamoRIO Tutorial at CGO 24 April 2010

Page Protection Plus Instrumentation

• Invariant: application code copied to code cache must be read-only– If writable, hide read-only status from application

• Some code cannot or should not be made read-only– Self-modifying code– Windows stack– Code on a page with frequently written data

• Use per-fragment instrumentation to ensure code is not stale on entry and to catch self-modification

Page 69: DynamoRIO Tutorial Apr2010

69DynamoRIO Tutorial at CGO 24 April 2010

Adaptive Consistency Algorithm

• Use page protection by default– Most code regions are always read-only

• Subdivide written-to regions to reduce flushing cost of write-execute cycle– Large read-only regions, small written-to regions

• Switch to instrumentation if write-execute cycle repeats too often (or on same page)– Switch back to page protection if writes decrease

Bruening et al. “Maintaining Consistency and Bounding Capacity of Software Code Caches” CGO’05

Page 70: DynamoRIO Tutorial Apr2010

70DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient• Transparent

– Rules of transparency– Cache consistency– Synchronization

• Comprehensive• Customizable

Page 71: DynamoRIO Tutorial Apr2010

71DynamoRIO Tutorial at CGO 24 April 2010

Synchronization Transparency

• Application thread management should not interfere with the runtime system, and vice versa– Cannot allow the app to suspend a thread holding a runtime

system lock– Runtime system cannot use app locks

Page 72: DynamoRIO Tutorial Apr2010

72DynamoRIO Tutorial at CGO 24 April 2010

Code Cache Invariant

• App thread suspension requires safe spots where no runtime system locks are held

• Time spent in the code cache can be unbounded→ Our invariant: no runtime system lock can be held while executing in the code cache

Page 73: DynamoRIO Tutorial Apr2010

73DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient• Transparent• Comprehensive• Customizable

Page 74: DynamoRIO Tutorial Apr2010

74DynamoRIO Tutorial at CGO 24 April 2010

Above the Operating System

Page 75: DynamoRIO Tutorial Apr2010

75DynamoRIO Tutorial at CGO 24 April 2010

Kernel-Mediated Control Transfers

message handler

kernel modeuser mode

message pendingsave user context

no message pendingrestore context

time

majority of executed code in a typical Windows

application

Page 76: DynamoRIO Tutorial Apr2010

76DynamoRIO Tutorial at CGO 24 April 2010

register our own signal handler

DynamoRIO handler

Intercepting Linux Signals

signal handler

kernel modeuser mode

signal pendingsave user context

no signal pendingrestore context

time

Page 77: DynamoRIO Tutorial Apr2010

77DynamoRIO Tutorial at CGO 24 April 2010

dispatcher

Windows Messages

message handler

kernel modeuser mode

message pendingsave user context

no message pendingrestore context

time

Page 78: DynamoRIO Tutorial Apr2010

78DynamoRIO Tutorial at CGO 24 April 2010

modify shared library memory image

dispatcher

dispatcher

Intercepting Windows Messages

message handler

kernel modeuser mode

message pendingsave user context

no message pendingrestore context

time

Page 79: DynamoRIO Tutorial Apr2010

79DynamoRIO Tutorial at CGO 24 April 2010

Must Monitor System Calls

• To maintain control:– Calls that affect the flow of control: register signal handler,

create thread, set thread context, etc.• To maintain transparency:

– Queries of modified state app should not see• To maintain cache consistency:

– Calls that affect the address space• To support cache eviction:

– Interruptible system calls must be redirected

Page 80: DynamoRIO Tutorial Apr2010

80DynamoRIO Tutorial at CGO 24 April 2010

Operating System Dependencies

• System calls and their numbers– Monitor application’s usage, as well as for our own resource

management– Windows changes the numbers each major rel

• Details of kernel-mediated control flow– Must emulate how kernel delivers events

• Initial injection– Once in, follow child processes

Page 81: DynamoRIO Tutorial Apr2010

81DynamoRIO Tutorial at CGO 24 April 2010

Internals Outline

• Efficient• Transparent• Comprehensive• Customizable

Page 82: DynamoRIO Tutorial Apr2010

82DynamoRIO Tutorial at CGO 24 April 2010

Clients

• The engine exports an API for building a client

• System details abstracted away: client focuses on manipulating the code stream

Page 83: DynamoRIO Tutorial Apr2010

83DynamoRIO Tutorial at CGO 24 April 2010

Client Events

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

non-control-flow instructions

non-control-flow instructions

indirect branch lookup

indirect branch stays on trace?

client client

client

Page 84: DynamoRIO Tutorial Apr2010

Examples: Part I1:30-1:40 Welcome + DynamoRIO History1:40-2:40 DynamoRIO Internals2:40-3:00 Examples, Part 13:00-3:15 Break3:15-4:15 DynamoRIO API4:15-5:15 Examples, Part 25:15-5:30 Feedback

Page 85: DynamoRIO Tutorial Apr2010

85DynamoRIO Tutorial at CGO 24 April 2010 85

DynamoRIO Examples Part I Outline

• Common Steps of writing a DynamoRIO client

• Dynamic Instruction Counting Example

Page 86: DynamoRIO Tutorial Apr2010

86DynamoRIO Tutorial at CGO 24 April 2010

Common Steps

• Step 1: Register Events– DR_EXPORT void dr_init(client_id_t id)

• Step2: Implementation– Initialization– Finalization– Instrumentation

• Step 3: Optimization– Optimize the instrumentation to improve the performance

86

Register Function Eventsdr_register_bb_event Basic Block Buildingdr_register_thread_init_event Thread Initializationdr_register_exit_event Process Exit

Page 87: DynamoRIO Tutorial Apr2010

87DynamoRIO Tutorial at CGO 24 April 2010 87

DynamoRIO Examples Part I Outline

• Common Steps of writing a DynamoRIO client

• Dynamic Instruction Counting Example

Page 88: DynamoRIO Tutorial Apr2010

88DynamoRIO Tutorial at CGO 24 April 2010 88

A Simplified View of DynamoRIO

basic block builderSTART

dispatch

context switch

BASIC BLOCK CACHE

Page 89: DynamoRIO Tutorial Apr2010

89DynamoRIO Tutorial at CGO 24 April 2010

Step 1: Register Events

uint num_dyn_instrs;

static void event_init(void);static void event_exit(void);static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool translating);

DR_EXPORT void dr_init(client_id_t id) { /* register events */ dr_register_bb_event (event_basic_block); dr_register_exit_event(event_exit); /* process initialization event */ event_init();}

89

Page 90: DynamoRIO Tutorial Apr2010

90DynamoRIO Tutorial at CGO 24 April 2010

Step 2: Implementation (I)

static void event_init(void) { num_dyn_instrs = 0;}

static void event_exit(void) { dr_printf(“Total number of instruction executed: %u\n”, num_dyn_instrs);}

static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist, bool for_trace, bool translating) { int num_instrs; num_instrs = ilist_num_instrs(ilist); insert_count_code(drcontext, ilist, num_instrs); return DR_EMIT_DEFAULT;}

90

Page 91: DynamoRIO Tutorial Apr2010

91DynamoRIO Tutorial at CGO 24 April 2010

Step 2: Implementation (II)

static int ilist_num_instrs(instrlist_t *ilist) { instr_t *instr; int num_instrs = 0; /* iterate over instruction list to count number of instructions */ for (instr = instrlist_first(ilist); instr != NULL; instr = instr_get_next(instr)) num_instrs++; return num_instrs;}

static void do_ins_count(int num_instrs) { num_dyn_instrs += num_instrs; }

static void insert_count_code(void * drcontext, instrlist_t * ilist, int num_instrs) { dr_insert_clean_call(drcontext, ilist, instrlist_first(ilist), do_ins_count, false, 1, OPND_CREATE_INT32(num_instrs));}

91

Page 92: DynamoRIO Tutorial Apr2010

92DynamoRIO Tutorial at CGO 24 April 2010

Instrumented Basic Block

# switch stack# switch aflags and errorno# save all registers# call do_ins_countpush $0x00000003call $0xb7ef73e4 (do_ins_count)# restore registers# switch aflags and errorno back# switch stack back# application codeadd $0x0000e574 %ebx %ebxtest %al $0x08jz $0xb80e8a98

92

Page 93: DynamoRIO Tutorial Apr2010

93DynamoRIO Tutorial at CGO 24 April 2010

Step 3: Optimization (I): counter update inlining

static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) { instr_t *instr, *where; opnd_t opnd1, opnd2;

where = instrlist_first(ilist); /* save aflags */ dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); /* num_dyn_instrs += num_instrs */ opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR); opnd2 = OPND_CREATE_INT32(num_instrs); instr = INSTR_CREATE_add(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* restore aflags */ dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); }

93

Page 94: DynamoRIO Tutorial Apr2010

94DynamoRIO Tutorial at CGO 24 April 2010

Instrumented Basic Block

mov %eax %fs:0x0clahf %ahseto %aladd $0x00000003, 0xb7d25030add $0x7f %al %alsahf %ahmov %fs:0x0c %eax# application codeadd $0x0000e574 %ebx %ebxtest %al $0x08jz $0xb7f14a98

94

Page 95: DynamoRIO Tutorial Apr2010

95DynamoRIO Tutorial at CGO 24 April 2010

Step 3: Optimization (II): aflags stealing

static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) { … save_aflags = aflags_analysis(ilist); /* save aflags */ if (save_aflags) dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); /* num_dyn_instrs += num_instrs */ opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR); opnd2 = OPND_CREATE_INT32(num_instrs); instr = INSTR_CREATE_add(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* restore aflags */ if (save_aflags) dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1); }

95

Page 96: DynamoRIO Tutorial Apr2010

96DynamoRIO Tutorial at CGO 24 April 2010

Instrumented Basic Block

add $0x00000003, 0xb7d25030# application codeadd $0x0000e574 %ebx %ebxtest %al $0x08jz $0xb7f14a98

96

Page 97: DynamoRIO Tutorial Apr2010

97DynamoRIO Tutorial at CGO 24 April 2010

Step 3: Optimization (III): more optimizations

• Using lea (load effective address) instead of addlea [%reg, num_instr] %reg

• Register liveness analysis– Using dead register to avoid register save/restore for lea

• Global aflags/registers analysis– Analyze aflags/registers liveness over CFG

• Trace Optimization– Trace: single-entry multi-exit– Update counters only at trace exits

97

Page 98: DynamoRIO Tutorial Apr2010

98DynamoRIO Tutorial at CGO 24 April 2010

Other Issues

• Data race on counter update in multithreaded programs– Global lock for every update– Atomic update (lock prefixed add)

• LOCK(instr);– Thread private counter

• Thread-private code cache: different variable at different address• Thread-shared code cache: thread local storage

• 32-bit counter overflow– 64-bit counter:

• Two instructions on 32-bit architecture: add, adc– One 32-bit local counter and one 64-bit global counter

• Instrument to update 32-bit local counter• Update 64-bit global counter using time interrupt

98

Page 99: DynamoRIO Tutorial Apr2010

DynamoRIO API1:30-1:40 Welcome + DynamoRIO History1:40-2:40 DynamoRIO Internals2:40-3:00 Examples, Part 13:00-3:15 Break3:15-4:15 DynamoRIO API4:15-5:15 Examples, Part 25:15-5:30 Feedback

Page 100: DynamoRIO Tutorial Apr2010

100DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API Outline

• Building and Deploying• Events• Utilities• Instruction Manipulation• State Translation• Comparison with Pin• Troubleshooting

Page 101: DynamoRIO Tutorial Apr2010

101DynamoRIO Tutorial at CGO 24 April 2010

Clients

• The engine exports an API for building a client

• System details abstracted away: client focuses on manipulating the code stream

Page 102: DynamoRIO Tutorial Apr2010

102DynamoRIO Tutorial at CGO 24 April 2010

Cross-Platform Clients

• DynamoRIO API presents a consistent interface that works across platforms– Windows versus Linux– 32-bit versus 64-bit– Thread-private versus thread-shared

• Same client source code generally works on all combinations of platforms

• Some exceptions, noted in the documentation

Page 103: DynamoRIO Tutorial Apr2010

103DynamoRIO Tutorial at CGO 24 April 2010

Building a Client

• Include DR API header file– #include “dr_api.h”

• Set platform defines– WINDOWS or LINUX – X86_32 or X86_64

• Export a dr_init function– DR_EXPORT void dr_init (client_id_t client_id)

• Build a shared library

Page 104: DynamoRIO Tutorial Apr2010

104DynamoRIO Tutorial at CGO 24 April 2010

Auto-Configure Using CMake

add_library(myclient SHARED myclient.c) find_package(DynamoRIO)if (NOT DynamoRIO_FOUND) message(FATAL_ERROR "DynamoRIO package

required to build")endif(NOT DynamoRIO_FOUND)configure_DynamoRIO_client(myclient)

Page 105: DynamoRIO Tutorial Apr2010

105DynamoRIO Tutorial at CGO 24 April 2010

CMake

• Build system converted to CMake when open-sourced– Switch from frozen toolchain to supporting range of tools

• CMake generates build files for native compiler of choice– Makefiles for UNIX, nmake, etc.– Visual Studio project files

• http://www.cmake.org/

Page 106: DynamoRIO Tutorial Apr2010

106DynamoRIO Tutorial at CGO 24 April 2010

Library Usage and Transparency

• For best transparency: completely self-contained client– Imports only from DynamoRIO API– -nodefaultlibs or /nodefaultlib

• On Windows:– String and utility routines provided by forwards to ntdll– Cl.exe /MT static copy of C/C++ libraries– Custom loader loads private copy of client dependences

• On Linux:– Use ld –wrap to redirect malloc calls to DR’s heap– Older distributions shipped suitable static C/C++ lib– Newer distros: need to build yourself– Coming soon: custom loader for private copy of libs

Page 107: DynamoRIO Tutorial Apr2010

107DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO Extensions

• DynamoRIO API is extended via libraries called Extensions• Both static and shared supported• Built and packaged with DynamoRIO• Easy for a client to use

– use_DynamoRIO_extension(myclient drsyms)• Current Extensions:

– drsyms: symbol lookup (currently Windows-only)– drcontainers: hashtable

• Coming soon:– Umbra: shadow memory framework– Your utility library or framework contribution!

Page 108: DynamoRIO Tutorial Apr2010

108DynamoRIO Tutorial at CGO 24 April 2010

Application Configuration

• File-based scheme• Per-user local files

– $HOME/.dynamorio/ on Linux– $USERPROFILE/dynamorio/ on Windows

• Global files– /etc/dynamorio/ on Linux– Registry-specified directory on Windows

• Files are lists of var=value

Page 109: DynamoRIO Tutorial Apr2010

109DynamoRIO Tutorial at CGO 24 April 2010

Deploying Clients

• One-step configure-and-run usage model:– drrun <client> <options> <app cmdline>– Uses an invisible temporary one-time configuration file– Overrides any regular config file

• Two-step usage model giving control over children:– drconfig –reg <appname> <client> <options>– drinject <app cmdline>

• Systemwide injection:– drconfig –syswide_on –reg <appname> <client> <options>– <run app normally>

Page 110: DynamoRIO Tutorial Apr2010

110DynamoRIO Tutorial at CGO 24 April 2010

Deploying Clients On Linux

• drrun and drinject scripts: LD_PRELOAD-based– Take over after statically-dependent shared libs but before exe

• Suid apps ignore LD_PRELOAD– Place libdrpreload.so's full path in /etc/ld.so.preload– Copy libdynamorio.so to /usr/lib

• In the future:– Attach– Early injection

Page 111: DynamoRIO Tutorial Apr2010

111DynamoRIO Tutorial at CGO 24 April 2010

Deploying Clients On Windows

• drinject and drrun injection– Currently after all shared libs are initialized

• From-parent injection– Early: before any shared libs are loaded

• Systemwide injection via –syswide_on– Requires administrative privileges– Launch app normally: no need to run via drinject/drrun– Moderately early: during user32.dll initialization

• In the future:– Earliest injection for drrun/drinject and from-parent

Page 112: DynamoRIO Tutorial Apr2010

112DynamoRIO Tutorial at CGO 24 April 2010

Non-Standard Deployment

• Standalone API– Use DynamoRIO as a library of IA-32/AMD64 manipulation

routines• Start/Stop API

– Can instrument source code with where DynamoRIO should control the application

Page 113: DynamoRIO Tutorial Apr2010

113DynamoRIO Tutorial at CGO 24 April 2010

Runtime Options

• Pass options to drconfig/drrun• A large number of options; the most relevant are:

– -code_api– -client <client lib> <client ops> <client id>– -thread_private– -tracedump_text and –tracedump_binary– -prof_pcs

Page 114: DynamoRIO Tutorial Apr2010

114DynamoRIO Tutorial at CGO 24 April 2010

Runtime Options For Debugging

• Notifications:– -stderr_mask 0xN– -msgbox_mask 0xN

• Windows:– -no_hide

• Debug-build-only:– -loglevel N– -ignore_assert_list ‘*’

Page 115: DynamoRIO Tutorial Apr2010

115DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API Outline

• Building and Deploying• Events• Utilities• Instruction Manipulation• State Translation• Comparison with Pin• Troubleshooting

Page 116: DynamoRIO Tutorial Apr2010

116DynamoRIO Tutorial at CGO 24 April 2010

Client Events

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

non-control-flow instructions

non-control-flow instructions

indirect branch lookup

indirect branch stays on trace?

client client

client

Page 117: DynamoRIO Tutorial Apr2010

117DynamoRIO Tutorial at CGO 24 April 2010

Client Events: Code Stream

• Client has opportunity to inspect and potentially modify every single application instruction, immediately before it executes

• Entire application code stream– Basic block creation event: can modify the block– For comprehensive instrumentation tools

• Or, focus on hot code only– Trace creation event: can modify the trace– Custom trace creation: can determine trace end condition– For optimization and profiling tools

Page 118: DynamoRIO Tutorial Apr2010

118DynamoRIO Tutorial at CGO 24 April 2010

Simplifying Client View

• Several optimizations disabled– Elision of unconditional branches– Indirect call to direct call conversion– Shared cache sizing– Process-shared and persistent code caches

• Future release will give client control over optimizations

Page 119: DynamoRIO Tutorial Apr2010

119DynamoRIO Tutorial at CGO 24 April 2010

Basic Block Event

static dr_emit_flags_tevent_basic_block(void *drcontext, void *tag, instrlist_t *bb, bool for_trace, bool translating) { instr_t *inst; for (inst = instrlist_first(bb); inst != NULL; inst = instr_get_next(inst)) { /* … */ }

return DR_EMIT_DEFAULT;}DR_EXPORT void dr_init(client_id_t id) { dr_register_bb_event(event_basic_block);}

Page 120: DynamoRIO Tutorial Apr2010

120DynamoRIO Tutorial at CGO 24 April 2010

Trace Event

static dr_emit_flags_tevent_trace(void *drcontext, void *tag, instrlist_t *trace, bool translating) { instr_t *inst; for (inst = instrlist_first(trace); inst != NULL; inst = instr_get_next(inst)) { /* … */ }

return DR_EMIT_DEFAULT;}DR_EXPORT void dr_init(client_id_t id) { dr_register_trace_event(event_trace);}

Page 121: DynamoRIO Tutorial Apr2010

121DynamoRIO Tutorial at CGO 24 April 2010

Client Events: Application Actions

• Application thread creation and deletion• Application library load and unload• Application exception (Windows)

– Client chooses whether to deliver or suppress• Application signal (Linux)

– Client chooses whether to deliver, suppress, bypass the app handler, or redirect control

Page 122: DynamoRIO Tutorial Apr2010

122DynamoRIO Tutorial at CGO 24 April 2010

Client Events: Application System Calls

• Application pre- and post- system call– Platform-independent system call parameter access– Client can modify:

• Return value in post-, or set value and skip syscall in pre-• Call number• Params

– Client can invoke an additional system call as the app

Page 123: DynamoRIO Tutorial Apr2010

123DynamoRIO Tutorial at CGO 24 April 2010

Client Events: Bookkeeping

• Initialization and Exit– Entire process– Each thread– Child of fork (Linux-only)

• Basic block and trace deletion during cache management• Nudge received

– Used for communication into client• Itimer fired (Linux-only)

Page 124: DynamoRIO Tutorial Apr2010

124DynamoRIO Tutorial at CGO 24 April 2010

Multiple Clients

• It is each client's responsibility to ensure compatibility with other clients– Instruction stream modifications made by one client are visible to

other clients• At client registration each client is given a priority

– dr_init() called in priority order (priority 0 called first and thus registers its callbacks first)

• Event callbacks called in reverse order of registration– Gives precedence to first registered callback, which is given the

final opportunity to modify the instruction stream or influence DynamoRIO's operation

Page 125: DynamoRIO Tutorial Apr2010

125DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API Outline

• Building and Deploying• Events• Utilities• Instruction Manipulation• State Translation• Comparison with Pin• Troubleshooting

Page 126: DynamoRIO Tutorial Apr2010

126DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API: General Utilities

• Transparency support– Separate memory allocation and I/O– Alternate stack

• Thread support– Thread-local memory– Simple mutexes– Thread-private code caches, if requested

Page 127: DynamoRIO Tutorial Apr2010

127DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API: General Utilities, Cont’d

• Communication– Nudges: ping from external process– File creation, reading, and writing

• Sideline support– Create new client-only thread– Thread-private itimer (Linux-only)

Page 128: DynamoRIO Tutorial Apr2010

128DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API: General Utilities, Cont’d

• Application inspection– Address space querying– Module iterator– Processor feature identification– Symbol lookup (currently Windows-only)

• Third-party library support– If transparency is maintained!– -wrap support on Linux– ntdll.dll link support on Windows– Custom loader for private library copy on Windows

Page 129: DynamoRIO Tutorial Apr2010

129DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO Heap

• Three flavors:– Thread-private: no synchronization; thread lifetime– Global: synchronized, process lifetime– “Non-heap”: for generated code, etc.– No header on allocated memory: low overhead but must pass

size on free• Leak checking

– Debug build complains at exit if memory was not deallocated

Page 130: DynamoRIO Tutorial Apr2010

130DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API Outline

• Building and Deploying• Events• Utilities• Instruction Manipulation• State Translation• Comparison with Pin• Troubleshooting

Page 131: DynamoRIO Tutorial Apr2010

131DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API: Instruction Representation

• Full IA-32/AMD64 instruction representation• Instruction creation with auto-implicit-operands• Operand iteration• Instruction lists with iteration, insertion, removal• Decoding at various levels of detail• Encoding

Page 132: DynamoRIO Tutorial Apr2010

132DynamoRIO Tutorial at CGO 24 April 2010

Instruction Representation

cmp

shl

movzx

sub

mov

lea

jnl

8b 46 0c

8d 34 01

2b 46 1c

0f b7 4e 08

c1 e1 07

3b c1

0f 8d a2 0a 00 00 $0x77f52269

%eax %ecx

$0x07 %ecx -> %ecx

0x8(%esi) -> %ecx

0x1c(%esi) %eax -> %eax

0xc(%esi) -> %eax

(%ecx,%eax,1) -> %esi

WCPAZSO

WCPAZSO

-

WCPAZSO

-

-

RSOopcode operandsraw bytes eflags

Page 133: DynamoRIO Tutorial Apr2010

133DynamoRIO Tutorial at CGO 24 April 2010

Instruction Representation

cmp

shl

movzx

sub

mov

lea

jnl

c1 e1 07

3b c1

0f 8d a2 0a 00 00 $0x77f52269

%eax %ecx

$0x07 %ecx -> %ecx

0x8(%edi) -> %ecx

0x1c(%edi) %eax -> %eax

0xc(%edi) -> %eax

(%ecx,%eax,1) -> %edi

WCPAZSO

WCPAZSO

-

WCPAZSO

-

-

RSOopcode operandsraw bytes eflags

Page 134: DynamoRIO Tutorial Apr2010

134DynamoRIO Tutorial at CGO 24 April 2010

Instruction Creation

• Method 1: use the INSTR_CREATE_opcode macros that fill in implicit operands automatically:instr_t *instr = INSTR_CREATE_dec(dcontext,

opnd_create_reg(REG_EDX)); • Method 2: specify opcode + all operands (including implicit

operands):instr_t *instr = instr_create(dcontext);instr_set_opcode(instr, OP_dec);instr_set_num_opnds(dcontext, instr, 1, 1);instr_set_dst(instr, 0, opnd_create_reg(REG_EDX));instr_set_src(instr, 0, opnd_create_reg(REG_EDX));

Page 135: DynamoRIO Tutorial Apr2010

135DynamoRIO Tutorial at CGO 24 April 2010

Linear Control Flow

• Both basic blocks and traces are linear

• Instruction sequences are all single-entrance, multiple-exit

• Greatly simplifies analysis algorithms

Page 136: DynamoRIO Tutorial Apr2010

136DynamoRIO Tutorial at CGO 24 April 2010

64-Bit Versus 32-Bit

• 32-bit build of DynamoRIO only handles 32-bit code• 64-bit build of DynamoRIO decodes/encodes both 32-bit and

64-bit code– Current release does not support executing applications that mix

the two• IR is universal: covers both 32-bit and 64-bit

– Abstracts away underlying mode

Page 137: DynamoRIO Tutorial Apr2010

137DynamoRIO Tutorial at CGO 24 April 2010

64-Bit Thread and Instruction Modes

• When going to or from the IR, the thread mode and instruction mode determine how instrs are interpreted

• When decoding, current thread’s mode is used– Default is 64-bit for 64-bit DynamoRIO– Can be changed with set_x86_mode()

• When encoding, that instruction’s mode is used– When created, set to mode of current thread– Can be changed with instr_set_x86_mode()

Page 138: DynamoRIO Tutorial Apr2010

138DynamoRIO Tutorial at CGO 24 April 2010

64-Bit Clients

• Define X86_64 before including header files when building a 64-bit client

• Convenience macros for printf formats, etc. are provided– E.g.:

• printf(“Pointer is ”PFX“\n”, p);• Use “X” macros for cross-platform registers

– REG_XAX is REG_EAX when compiled 32-bit, and REG_RAX when compiled 64-bit

Page 139: DynamoRIO Tutorial Apr2010

139DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API: Code Manipulation

• Processor information• State preservation

– Eflags, arith flags, floating-point state, MMX/SSE state– Spill slots, TLS

• Clean calls to C code• Dynamic instrumentation

– Replace code in the code cache• Branch instrumentation

– Convenience routines

Page 140: DynamoRIO Tutorial Apr2010

140DynamoRIO Tutorial at CGO 24 April 2010

Processor Information

• Processor type– proc_get_vendor(), proc_get_family(), proc_get_type(),

proc_get_model(), proc_get_stepping(), proc_get_brand_string()• Processor features

– proc_has_feature(), proc_get_all_feature_bits()• Cache information

– proc_get_cache_line_size(), proc_is_cache_aligned(), proc_bump_to_end_of_cache_line(), proc_get_containing_page()

– proc_get_L1_icache_size(), proc_get_L1_dcache_size(), proc_get_L2_cache_size(), proc_get_cache_size_str()

Page 141: DynamoRIO Tutorial Apr2010

141DynamoRIO Tutorial at CGO 24 April 2010

State Preservation

• Spill slots for registers– 3 fast slots, 6/14 slower slots– dr_save_reg(), dr_restore_reg(), and dr_reg_spill_slot_opnd() – from C code: dr_read_saved_reg(), dr_write_saved_reg()

• Dedicated TLS field for thread-local data– dr_insert_read_tls_field(), dr_insert_write_tls_field() – from C code: dr_get_tls_field(), dr_set_tls_field()

• Arithmetic flag preservation– dr_save_arith_flags(), dr_restore_arith_flags()

• Floating-point/MMX/SSE state– dr_insert_save_fpstate(), dr_insert_restore_fpstate()

Page 142: DynamoRIO Tutorial Apr2010

142DynamoRIO Tutorial at CGO 24 April 2010

Thread-Local Storage (TLS)

• Absolute addressing– Thread-private only

• Application stack– Not reliable or transparent

• Stolen register– Performance hit

• Segment– Best solution for thread-shared

Page 143: DynamoRIO Tutorial Apr2010

143DynamoRIO Tutorial at CGO 24 April 2010

Clean Calls

if (instr_is_mbr(instr)) { app_pc address = instr_get_app_pc(instr); uint opcode = instr_get_opcode(instr); instr_t *nxt = instr_get_next(instr); dr_insert_clean_call(drcontext, ilist, nxt, (void *) at_mbr, false/*don't need to save fp state*/, 2 /* 2 parameters */, /* opcode is 1st parameter */ OPND_CREATE_INT32(opcode), /* address is 2nd parameter */ OPND_CREATE_INTPTR(address));}

• Saved interrupted application state can be accessed using dr_get_mcontext() and modified using dr_set_mcontext()

Page 144: DynamoRIO Tutorial Apr2010

144DynamoRIO Tutorial at CGO 24 April 2010

Dynamic Instrumentation

• Thread-shared: flush all code corresponding to application address and then re-instrument when re-executed– Can flush from clean call, and use dr_redirect_execution() since

cannot return to potentially flushed cache fragment• Thread-private: can also replace particular fragment (does not

affect other potential copies of the source app code)– dr_replace_fragment()

Page 145: DynamoRIO Tutorial Apr2010

145DynamoRIO Tutorial at CGO 24 April 2010

Flushing the Cache

• Immediately deleting or replacing individual code cache fragments is available for thread-private caches– Only removes from that thread’s cache

• Two basic types of thread-shared flush:– Non-precise: remove all entry points but let target cache code be

invalidated and freed lazily– Precise/synchronous:

• Suspend the world• Relocate threads inside the target cache code• Invalidate and free the target code immediately

Page 146: DynamoRIO Tutorial Apr2010

146DynamoRIO Tutorial at CGO 24 April 2010

Flushing the Cache

• Thread-shared flush API routines:– dr_unlink_flush_region(): non-precise flush– dr_flush_region(): synchronous flush– dr_delay_flush_region():

• No action until a thread exits code cache on its own• If provide a completion callback, synchronous once triggered• Without a callback, non-precise

Page 147: DynamoRIO Tutorial Apr2010

147DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API Outline

• Building and Deploying• Events• Utilities• Instruction Manipulation• State Translation• Comparison with Pin• Troubleshooting

Page 148: DynamoRIO Tutorial Apr2010

148DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API: Translation

• Translation refers to the mapping of a code cache machine state (program counter, registers, and memory) to its corresponding application state– The program counter always needs to be translated– Registers and memory may also need to be translated

depending on the transformations applied when copying into the code cache

Page 149: DynamoRIO Tutorial Apr2010

149DynamoRIO Tutorial at CGO 24 April 2010

Translation Case 1: Fault

• Exception and signal handlers are passed machine context of the faulting instruction.

• For transparency, that context must be translated from the code cache to the original code location

• Translated location should be where the application would have had the fault or where execution should be resumed

user context

faulting instr.

user context

faulting instr.

Page 150: DynamoRIO Tutorial Apr2010

150DynamoRIO Tutorial at CGO 24 April 2010

Translation Case 2: Relocation

• If one application thread suspends another, or DynamoRIO suspends all threads for a synchronous cache flush:– Need suspended target thread in a safe spot– Not always practical to wait for it to arrive at a safe spot (if in a

system call, e.g.)• DynamoRIO forcibly relocates the thread

– Must translate its state to the proper application state at which to resume execution

Page 151: DynamoRIO Tutorial Apr2010

151DynamoRIO Tutorial at CGO 24 April 2010

Translation Approaches

• Two approaches to program counter translation:– Store mappings generated during fragment building

• High memory overhead (> 20% for some applications, because it prevents internal storage optimizations) even with highly optimized difference-based encoding. Costly for something rarely used.

– Re-create mapping on-demand from original application code• Cache consistency guarantees mean the corresponding application

code is unchanged• Requires idempotent code transformations

• DynamoRIO supports both approaches– The engine mostly uses the on-demand approach, but stored

mappings are occasionally needed

Page 152: DynamoRIO Tutorial Apr2010

152DynamoRIO Tutorial at CGO 24 April 2010

Instruction Translation Field

• Each instruction contains a translation field• Holds the application address that the instruction corresponds

to• Set via instr_set_translation()

Page 153: DynamoRIO Tutorial Apr2010

153DynamoRIO Tutorial at CGO 24 April 2010

Context Translation Via Re-Creation

C1: mov %ebx, %ecxC2: add %eax, (%ecx)C3: cmp $4, (%eax)C4: jle <stub0>C5: jmp <stub1>

A1: mov %ebx, %ecxA2: add %eax, (%ecx)A3: cmp $4, (%eax)A4: jle 710349fb

D1: (A1) mov %ebx, %ecxD2: (A2) add %eax, (%ecx)D3: (A3) cmp $4, (%eax)D4: (A4) jle <stub0>D5: (A4) jmp <stub1>

Page 154: DynamoRIO Tutorial Apr2010

154DynamoRIO Tutorial at CGO 24 April 2010

Meta vs. Non-Meta Instructions

• Non-Meta instructions are treated as application instructions– They must have translations– Control flow changing instructions are modified to retain

DynamoRIO control and result in cache populating• Meta instructions are added instrumentation code

– Not treated as part of the application (e.g., calls run natively)– Cannot fault, so translations not needed

• Meta-may-fault instructions– Can fault, but should not be “interpreted”: won’t modify app code– Fault typically deliberate and handled by client

• Xref instr_set_ok_to_mangle() and instr_set_meta_may_fault()

Page 155: DynamoRIO Tutorial Apr2010

155DynamoRIO Tutorial at CGO 24 April 2010

Client Translation Support

• Instruction lists passed to clients are annotated with translation information– Read via instr_get_translation()– Clients are free to delete instructions, change instructions and

their translations, and add new meta and non-meta instructions (see dr_register_bb_event() for restrictions)

– An idempotent client that restricts itself to deleting app instructions and adding non-faulting meta instructions can ignore translation concerns

– DynamoRIO takes care of instructions added by API routines (insert_clean_call(), etc.)

• Clients can choose between storing or regenerating translations on a fragment by fragment basis.

Page 156: DynamoRIO Tutorial Apr2010

156DynamoRIO Tutorial at CGO 24 April 2010

Client Regenerated Translations

• Client returns DR_EMIT_DEFAULT from its bb or trace event callback

• Client bb & trace event callbacks are re-called when translations are needed with translating==true

• Client must exactly duplicate transformations performed when the block was generated

• Client must set translation field for all added non-meta instructions and all meta-may-fault instructions– This is true even if translating==false since DynamoRIO may

decide it needs to store translations anyway

Page 157: DynamoRIO Tutorial Apr2010

157DynamoRIO Tutorial at CGO 24 April 2010

Client Stored Translations

• Client returns DR_EMIT_STORE_TRANSLATIONS from its bb or trace event callback

• Client must set translation field for all added non-meta instructions and all meta-may-fault instructions

• Client bb or trace hook will not be re-called with translating==true

Page 158: DynamoRIO Tutorial Apr2010

158DynamoRIO Tutorial at CGO 24 April 2010

Register State Translation

• Translation may be needed at a point where some registers are spilled to memory– During indirect branch or RIP-relative mangling, e.g.

• DynamoRIO walks fragment up to translation point, tracking register spills and restores– Special handling for stack pointer around indirect calls and

returns• DynamoRIO tracks client spills and restores implicitly added

by API routines– Clean calls, etc.– Explicit spill/restore (e.g., dr_save_reg()) client’s responsibility

Page 159: DynamoRIO Tutorial Apr2010

159DynamoRIO Tutorial at CGO 24 April 2010

Client Register State Translation

• If a client adds its own register spilling/restoring code or changes register mappings it must register for the restore state event to correct the context

• The same event can also be used to fix up the application’s view of memory

• DynamoRIO does not internally store this kind of translation information ahead of time when the fragment is built– The client must maintain its own data structures

Page 160: DynamoRIO Tutorial Apr2010

160DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API Outline

• Building and Deploying• Events• Utilities• Instruction Manipulation• State Translation• Comparison with Pin• Troubleshooting

Page 161: DynamoRIO Tutorial Apr2010

161DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO versus Pin

• Basic interface is fundamentally different• Pin = insert callout/trampoline only

– Not so different from tools that modify the original code: Dyninst, Vulcan, Detours

– Uses code cache only for transparency• DynamoRIO = arbitrary code stream modifications

– Only feasible with a code cache– Takes full advantage of power of code cache– General IA-32/AMD64 decode/encode/IR support

Page 162: DynamoRIO Tutorial Apr2010

162DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO versus Pin

• Pin = insert callout/trampoline only– Pin tries to inline and optimize– Client has little control or guarantee over final performance

• DynamoRIO = arbitrary code stream modifications– Client has full control over all inserted instrumentation– Result can be significant performance difference

• PiPA Memory Profiler + Cache Simulator: 3.27x speedup w/ DynamoRIO vs 2.6x w/ Pin

Page 163: DynamoRIO Tutorial Apr2010

163DynamoRIO Tutorial at CGO 24 April 2010

Base Performance Comparison (No Tool)

171%

121%

Page 164: DynamoRIO Tutorial Apr2010

164DynamoRIO Tutorial at CGO 24 April 2010

Base Memory Comparison

15MB

2.6MB

44MB

3.5MB

Page 165: DynamoRIO Tutorial Apr2010

165DynamoRIO Tutorial at CGO 24 April 2010

BBCount Pin Tool

static int bbcount;VOID PIN_FAST_ANALYSIS_CALL docount() { bbcount++; }VOID Trace(TRACE trace, VOID *v) { for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_InsertCall(bbl, IPOINT_ANYWHERE, AFUNPTR(docount), IARG_FAST_ANALYSIS_CALL, IARG_END); }}int main(int argc, CHAR *argv[]) { PIN_InitSymbols(); PIN_Init(argc, argv); TRACE_AddInstrumentFunction(Trace, 0); PIN_StartProgram(); return 0;}

Page 166: DynamoRIO Tutorial Apr2010

166DynamoRIO Tutorial at CGO 24 April 2010

BBCount DynamoRIO Tool

static int global_count;static dr_emit_flags_tevent_basic_block(void *drcontext, void *tag, instrlist_t *bb, bool for_trace, bool translating) { instr_t *instr, *first = instrlist_first(bb); uint flags; /* Our inc can go anywhere, so find a spot where flags are dead. * Technically this can be unsafe if app reads flags on fault => * stop at instr that can fault, or supply runtime op */ for (instr = first; instr != NULL; instr = instr_get_next(instr)) { flags = instr_get_arith_flags(instr); /* OP_inc doesn't write CF but not worth distinguishing */ if (TESTALL(EFLAGS_WRITE_6, flags) && !TESTANY(EFLAGS_READ_6, flags)) break; } if (instr == NULL) dr_save_arith_flags(drcontext, bb, first, SPILL_SLOT_1); instrlist_meta_preinsert(bb, (instr == NULL) ? first : instr, INSTR_CREATE_inc(drcontext, OPND_CREATE_ABSMEM((byte *)&global_count, OPSZ_4))); if (instr == NULL) dr_restore_arith_flags(drcontext, bb, first, SPILL_SLOT_1); return DR_EMIT_DEFAULT;}DR_EXPORT void dr_init(client_id_t id) { dr_register_bb_event(event_basic_block);}

Page 167: DynamoRIO Tutorial Apr2010

167DynamoRIO Tutorial at CGO 24 April 2010

BBCount: Pin Inlining Importance

569%

231%

Page 168: DynamoRIO Tutorial Apr2010

168DynamoRIO Tutorial at CGO 24 April 2010

BBCount Performance Comparison

226%185%

233%

Page 169: DynamoRIO Tutorial Apr2010

169DynamoRIO Tutorial at CGO 24 April 2010

DynamoRIO API Outline

• Building and Deploying• Events• Utilities• Instruction Manipulation• State Translation• Comparison with Pin• Troubleshooting

Page 170: DynamoRIO Tutorial Apr2010

170DynamoRIO Tutorial at CGO 24 April 2010

Obtaining Help

• Read the documentation– http://dynamorio.org/docs/

• Look at the sample clients– In the documentation– In the release package: samples/

• Ask on the DynamoRIO Users discussion forum/mailing list– http://groups.google.com/group/dynamorio-users

Page 171: DynamoRIO Tutorial Apr2010

171DynamoRIO Tutorial at CGO 24 April 2010

Debugging Clients

• Use the DynamoRIO debug build for asserts– Often point out the problem

• Use logging– -loglevel N– stored in logs/ subdir of DR install dir

• Attach a debugger– gdb or windbg– -msgbox_mask 0xN– -no_hide– windbg: .reload myclient.dll=0xN

• More tips:– http://code.google.com/p/dynamorio/wiki/Debugging

Page 172: DynamoRIO Tutorial Apr2010

172DynamoRIO Tutorial at CGO 24 April 2010

Reporting Bugs

• Search the Issue Tracker off http://dynamorio.org first– http://code.google.com/p/dynamorio/issues/list

• File a new Issue if not found• Follow conventions on wiki

– http://code.google.com/p/dynamorio/wiki/BugReporting – CRASH, APP CRASH, HANG, ASSERT

• Example titles:– CRASH (1.3.1 calc.exe)

vm_area_add_fragment:vmareas.c(4466) – ASSERT (1.3.0 suite/tests/common/segfault)

study_hashtable:fragment.c:1745 ASSERT_NOT_REACHED

Page 173: DynamoRIO Tutorial Apr2010

173DynamoRIO Tutorial at CGO 24 April 2010

Changes From Prior Releases

• Backward compatible with 1.0 (0.9.6) and above– Except configuration and deployment scheme and tools:

switched to file-based scheme to support unprivileged and parallel execution on Windows

• Not backward compatible with 0.9.1-0.9.5

Page 174: DynamoRIO Tutorial Apr2010

Examples: Part 21:30-1:40 Welcome + DynamoRIO History1:40-2:40 DynamoRIO Internals2:40-3:00 Examples, Part 13:00-3:15 Break3:15-4:15 DynamoRIO API4:15-5:15 Examples, Part 25:15-5:30 Feedback

Page 175: DynamoRIO Tutorial Apr2010

175DynamoRIO Tutorial at CGO 24 April 2010 175

More Examples

• Dynamic Optimization– Strength Reduction– Software Prefetching

• Profiling– Memory Reference Trace– PiPA

• Shadow Memory– Umbra

• Dr. Memory

Page 176: DynamoRIO Tutorial Apr2010

176DynamoRIO Tutorial at CGO 24 April 2010 176

Dynamic Optimization Opportunities

• Traditional compiler optimizations– Compiler has limited view: application assembled at runtime– Some shipped products are built without optimizations

• Microarchitecture-specific optimizations– Feature set and relative performance of instructions varies– Combinatorial blowup if done statically

• Adaptive optimizations– Need runtime information: prior profiling runs not always

representative– Execution phase changes during execution

Page 177: DynamoRIO Tutorial Apr2010

177DynamoRIO Tutorial at CGO 24 April 2010 177

Dynamic Optimization in DynamoRIO

• Traces are natural unit for optimization– Focus only on hot code– Cross procedure, file and module boundaries

• Linear control flow– Single-entry, multi-exit simplifies analysis

• Support for adaptive optimization– Can replace traces dynamically

Page 178: DynamoRIO Tutorial Apr2010

178DynamoRIO Tutorial at CGO 24 April 2010 178

DynamoRIO with Trace Optimization

basic block builder trace selectorSTART

dispatch

context switch

BASIC BLOCK CACHE

TRACE CACHE

indirect branch lookup

indirect branch stays on trace?

Page 179: DynamoRIO Tutorial Apr2010

179DynamoRIO Tutorial at CGO 24 April 2010 179

Strength Reduction: inc to add

• Pentium 4– inc is slower add 1 – dec is slower than sub 1

• Pentium 3– inc is faster add 1 – dec is faster than sub 1

• Microarchitecture-specific optimization best performed dynamically

Page 180: DynamoRIO Tutorial Apr2010

180DynamoRIO Tutorial at CGO 24 April 2010 180

Look for inc / dec

EXPORT void dr_init() { if (proc_get_family() == FAMILY_PENTIUM_IV) dr_register_trace_event(event_trace);}static void event_trace(void *drcontext, app_pc tag, instrlist_t *trace, bool xl8) { instr_t *instr, *next_instr; int opcode; for (instr = instrlist_first(bb); instr != NULL; instr = next_instr) { next_instr = instr_get_next(instr); opcode = instr_get_opcode(instr); if (opcode == OP_inc || opcode == OP_dec) replace_inc_with_add(drcontext, instr, trace); } }}static bool replace_inc_with_add(void *drcontext, instr_t *instr, instrlist_t *trace) { instr_t *in; uint eflags; int opcode = instr_get_opcode(instr); bool ok_to_replace = false;

for (in = instr; in != NULL; in = instr_get_next(in)) { eflags = instr_get_arith_flags(in); if ((eflags & EFLAGS_READ_CF) != 0) return false; if ((eflags & EFLAGS_WRITE_CF) != 0) { ok_to_replace = true; break; } if (instr_is_exit_cti(in)) return false; } if (!ok_to_replace) return false; if (opcode == OP_inc) in = INSTR_CREATE_add(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); else in = INSTR_CREATE_sub(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); instr_set_prefixes(in, instr_get_prefixes(instr)); instrlist_replace(trace, instr, in); instr_destroy(drcontext, instr); return true;}

Pentium 4?

Replace with add / sub

Ensure eflags change ok

Page 181: DynamoRIO Tutorial Apr2010

181DynamoRIO Tutorial at CGO 24 April 2010 181

Strength Reduction Results

0.00.10.20.30.40.50.60.70.80.91.01.11.2

amm

p

appl

u

apsi art

equa

ke

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

har.

mea

n

B e nchmark

Nor

mal

ized

Exe

cutio

n Ti

me

base inc2add

2% mean speedup

Page 182: DynamoRIO Tutorial Apr2010

182DynamoRIO Tutorial at CGO 24 April 2010

Software Prefetching

• Ubiquitous Memory Introspection (cgo 2007)– Sampling to select hot traces– Instrument to collect memory references– Analyze to discover reference patterns

• Stride– Insert software prefetching instruction

182

Page 183: DynamoRIO Tutorial Apr2010

183DynamoRIO Tutorial at CGO 24 April 2010 183

Software Prefetching Results

Page 184: DynamoRIO Tutorial Apr2010

184DynamoRIO Tutorial at CGO 24 April 2010 184

More Examples

• Dynamic Optimization– Strength Reduction– Software Prefetching

• Profiling– Memory Reference Trace– PiPA

• Shadow Memory– Umbra

• Dr. Memory

Page 185: DynamoRIO Tutorial Apr2010

185DynamoRIO Tutorial at CGO 24 April 2010

Memory Reference Trace

• Memtrace– Profile format

• <r/w, addr>– Steps

• Thread initialization– Allocate buffer per thread

• Instrumentation– Fill the buffer– Dump to file if buffer is full

• Thread exit– Delete the buffer

– Optimization• Inline buffer filling code• Out-line the clean call invocation code

185

Page 186: DynamoRIO Tutorial Apr2010

186DynamoRIO Tutorial at CGO 24 April 2010

Register Events

DR_EXPORT voiddr_init(client_id_t id){

…mutex = dr_mutex_create();dr_register_exit_event(event_exit);dr_register_thread_init_event(event_thread_init);dr_register_thread_exit_event(event_thread_exit);dr_register_bb_event(event_basic_block);/* out-lined clean call invocation code */code_cache_init();…

}

186

Page 187: DynamoRIO Tutorial Apr2010

187DynamoRIO Tutorial at CGO 24 April 2010

Own Code Cache Initialization

static void code_cache_init(void) { drcontext = dr_get_current_drcontext();

code_cache = dr_nonheap_alloc(PAGE_SIZE, DR_MEMPROT_READ | DR_MEMPROT_WRITE |

DR_MEMPROT_EXEC); ilist = instrlist_create(drcontext);

where = INSTR_CREATE_jmp_ind(drcontext, opnd_create_reg(REG_XCX)); instrlist_meta_append(ilist, where);

dr_insert_clean_call(drcontext, ilist, where, (void *)clean_call, false, 0);end = instrlist_encode(drcontext, ilist, code_cache, false);DR_ASSERT((end - code_cache) < PAGE_SIZE);instrlist_clear_and_destroy(drcontext, ilist);dr_memory_protect(code_cache, PAGE_SIZE, DR_MEMPROT_READ |

DR_MEMPROT_EXEC);}

187

Page 188: DynamoRIO Tutorial Apr2010

188DynamoRIO Tutorial at CGO 24 April 2010

Clean Call

static void clean_call() {…drcontext = dr_get_current_drcontext();

data = dr_get_tls_field(drcontext); mem_ref = (mem_ref_t *)databuf_base; num_refs = (int)((mem_ref_t *)databuf_ptr - mem_ref);

for (i = 0; i < num_refs; i++) {dr_fprintf(datalog, "%c:"PFX"\n", mem_refwrite ? 'w' : 'r', mem_refaddr);

++mem_ref;}datanum_refs += num_refs;…

}

188

Page 189: DynamoRIO Tutorial Apr2010

189DynamoRIO Tutorial at CGO 24 April 2010

Thread initialization & exit

void event_thread_init(void *drcontext) {…data = dr_thread_alloc(drcontext, sizeof(per_thread_t));dr_set_tls_field(drcontext, data);databuf_base = dr_thread_alloc(drcontext, MAX_MEM_BUF_SIZE);…datanum_refs = 0;

}void event_thread_exit(void *drcontext) {

data = dr_get_tls_field(drcontext);dr_mutex_lock(mutex);num_refs += datanum_refs;dr_mutex_unlock(mutex);dr_thread_free(drcontext, databuf_base, MAX_MEM_BUF_SIZE);dr_thread_free(drcontext, data, sizeof(per_thread_t));

}

189

Page 190: DynamoRIO Tutorial Apr2010

190DynamoRIO Tutorial at CGO 24 April 2010

Basic Block Instrumentation

static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *bb,

bool for_trace, bool translating) { … for (instr = instrlist_first(bb); instr != NULL; instr = instr_get_next(instr)) { if (instr_get_app_pc(instr) == NULL) continue; if (instr_reads_memory(instr)) for (i = 0; i < instr_num_srcs(instr); i++) if (opnd_is_memory_reference(instr_get_src(instr, i))) instrument_mem(drcontext, bb, instr, i, false); if (instr_writes_memory(instr)) for (i = 0; i < instr_num_dsts(instr); i++) if (opnd_is_memory_reference(instr_get_dst(instr, i))) instrument_mem(drcontext, bb, instr, i, true);}

190

Page 191: DynamoRIO Tutorial Apr2010

191DynamoRIO Tutorial at CGO 24 April 2010

instrument_mem (I)

/* get memory reference address into reg1 */opnd1 = opnd_create_reg(reg1);

if (opnd_is_base_disp(ref)) { /* lea [ref] reg */opnd2 = ref;

opnd_set_size(&opnd2, OPSZ_lea); instr = INSTR_CREATE_lea(drcontext, opnd1, opnd2);

} else if(IF_X64(opnd_is_rel_addr(ref) ||) opnd_is_abs_addr(ref)) { /* mov addr reg */

opnd2 = OPND_CREATE_INTPTR(opnd_get_addr(ref)); instr = INSTR_CREATE_mov_imm(drcontext, opnd1, opnd2);

} else {instr = NULL;DR_ASSERT_MSG(false, "Unhandled instructions");

} instrlist_meta_preinsert(ilist, where, instr);

191

Page 192: DynamoRIO Tutorial Apr2010

192DynamoRIO Tutorial at CGO 24 April 2010

instrument_mem (II)

/* Move write/read to write field */ opnd1 = OPND_CREATE_MEM32(reg2, offsetof(mem_ref_t, write)); opnd2 = OPND_CREATE_INT32(write); instr = INSTR_CREATE_mov_imm(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr); /* Store address in memory ref */ opnd1 = OPND_CREATE_MEMPTR(reg2, offsetof(mem_ref_t, addr)); opnd2 = opnd_create_reg(reg1); instr = INSTR_CREATE_mov_st(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr);

/* Increment reg value by pointer size using lea instr */ opnd1 = opnd_create_reg(reg2); opnd2 = opnd_create_base_disp(reg2, REG_NULL, 0, sizeof(mem_ref_t), OPSZ_lea); instr = INSTR_CREATE_lea(drcontext, opnd1, opnd2); instrlist_meta_preinsert(ilist, where, instr);

192

Page 193: DynamoRIO Tutorial Apr2010

193DynamoRIO Tutorial at CGO 24 April 2010

instrument_mem (III)

/* jecxz call */call = INSTR_CREATE_label(drcontext);instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jecxz(drcontext, opnd_create_instr(call)));/* jump restore to skip clean call */restore = INSTR_CREATE_label(drcontext);instrlist_meta_preinsert(ilist, where,

INSTR_CREATE_jmp(drcontext, opnd_create_instr(restore)));instrlist_meta_preinsert(ilist, where, call);/* mov restore REG_XCX */instr = INSTR_CREATE_mov_st(drcontext,opnd_create_reg(reg2),opnd_create_instr(restore));instrlist_meta_preinsert(ilist, where, instr);/* jmp code_cache */opnd1 = opnd_create_pc(code_cache);instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jmp(drcontext, opnd1));/* restore %reg */instrlist_meta_preinsert(ilist, where, restore);

193

Page 194: DynamoRIO Tutorial Apr2010

194DynamoRIO Tutorial at CGO 24 April 2010

PiPA

• Pipelined Profiling and Analysis (cgo 2008)– Stages (thread/process)

• Profiling• Reconstruction/extraction• Analysis

– Profiling• Runtime Execution Profile (REP)

– Communication• Double buffer

– Analysis• Parallel cache simulation

194

Page 195: DynamoRIO Tutorial Apr2010

195DynamoRIO Tutorial at CGO 24 April 2010 195

More Examples

• Dynamic Optimization– Strength Reduction

• Profiling– Memory Reference Trace

• Shadow Memory– Umbra

• Dr. Memory

Page 196: DynamoRIO Tutorial Apr2010

196DynamoRIO Tutorial at CGO 24 April 2010 196

Shadow Memory

• Application– Store meta-data to track properties of application memory

• Millions of software watchpoints• Dynamic information flow tracking (taint propagation)• Race detection• Memory usage debugging tool (MemCheck/Dr. Memory)

• Issues– Performance– Multi-thread applications– Flexibility– Platform dependent– Development challenges

Page 197: DynamoRIO Tutorial Apr2010

197DynamoRIO Tutorial at CGO 24 April 2010

Umbra (CGO 2010)

• Design• Implementation• Optimization• Download

– http://people.csail.mit.edu/qin_zhao/umbra/

197

Page 198: DynamoRIO Tutorial Apr2010

198DynamoRIO Tutorial at CGO 24 April 2010

Design

• Address Space– A collection of fixed size units

• 4G (64-bit)• Application, Shadow, Unused

• Translation Table– Translation from application memory unit to

corresponding shadow memory unit

198

App Mem Shd Mem Offset[0x00000000, 0x10000000)

[0x20000000, 0x30000000)

0x20000000

[0x60000000, 0x70000000)

[0x40000000, 0x50000000)

-0x20000000

[0x80000000, 0x90000000)

[0x50000000, 0x60000000)

-0x20000000

App Mem 1

Shd Mem 1

Unused

Unused

App Mem 3

App Mem 2

Shd Mem 2

Shd Mem 3offsetscaleaddraddr appshd +×=

Page 199: DynamoRIO Tutorial Apr2010

199DynamoRIO Tutorial at CGO 24 April 2010

Implementation

• Memory Manager– Monitor and control application memory allocation

• brk, mmap, munmap, mremap• dr_register_pre_syscall_event• dr_register_post_syscall_event

– Allocate shadow memory– Maintain translation table

• Instrumenter– Instrument every memory reference

• Context save• Address calculation• Address translation• Shadow memory update• Context restore

199

Page 200: DynamoRIO Tutorial Apr2010

200DynamoRIO Tutorial at CGO 24 April 2010

Instrument Code Example

Context Save mov %ecx [ECX_SLOT]mov %edx [EDX_SLOT] mov %eax [EAX_SLOT]lahf %ahseto %al

Address Calculation lea [%ebx, 16] %ecx

Address Translation mov 0 %edx… # table lookup codeadd %ecx, table[%edx].offset %ecx

Shadow Memory Update mov 1 [%ecx]

Context Restore add %al 0x7fsahf mov [ECX_SLOT] %ecxmov [EDX_SLOT] %edx mov [EAX_SLOT] %eax

Application memory reference mov 0 [%ebx, 16]

200

Page 201: DynamoRIO Tutorial Apr2010

201DynamoRIO Tutorial at CGO 24 April 2010

Optimization

• Translation Optimization– Thread Local Translation Table– Memoization Check– Reference Check

• Instrumentation Optimization– Context Switch Reduction– Reference Grouping– 3-stage Code Layout

201

Page 202: DynamoRIO Tutorial Apr2010

202DynamoRIO Tutorial at CGO 24 April 2010

Translation Optimization

• Caching

202

Global translation table

App 1

Shd 1

Shd 2

App 2

Page 203: DynamoRIO Tutorial Apr2010

203DynamoRIO Tutorial at CGO 24 April 2010

Translation Optimization

• Thread Local Translation Optimization– Local translation table per thread– Synchronize with global translation table when necessary– Avoid lock contention

203

Global translation table

Thread Local translation table

Thread 1

Thread 2

Page 204: DynamoRIO Tutorial Apr2010

204DynamoRIO Tutorial at CGO 24 April 2010

Translation Optimization

• Memoization Cache– Software cache per thread– Stores frequently used translation entries

• Stack• Units found in last table lookup

204

Global translation table

Thread Local translation table

Memoization Cache

Thread 1

Thread 2

Page 205: DynamoRIO Tutorial Apr2010

205DynamoRIO Tutorial at CGO 24 April 2010

Translation Optimization

• Reference Cache– Software cache per static application memory reference

• Last reference unit tag• Last translation offset

205

Reference cache

Thread 1

Thread 2

Global translation table

Thread Local translation table

Memoization Cache

Page 206: DynamoRIO Tutorial Apr2010

206DynamoRIO Tutorial at CGO 24 April 2010

Instrumentation Optimization

• Context Switch Reduction– Registers liveness analysis

• Reference Grouping– One translation lookup for multiple references

• Stack local variables• Different members of the same object

• 3-stage Code Layout– Inline stub

• Quick inline check code with minimal context switch– Lean procedure

• Simple assembly procedure with partial context switch– Callout

• C function with complete context switch

206

Page 207: DynamoRIO Tutorial Apr2010

207DynamoRIO Tutorial at CGO 24 April 2010

3-stage Code Layout

• Inline stub– Reference cache check– Jump to lean procedure if miss

• Lean procedure– Memoization cache check – Local table lookup– Clean call to call out

• Callout– Global table synchronization– Local table lookup

207

Page 208: DynamoRIO Tutorial Apr2010

208DynamoRIO Tutorial at CGO 24 April 2010

Instrumentation Optimization

208

# reference cache check lea [ref] %r1 %r1 & 0xffffffff00000000 %r1 cmp %r1, ref.tag je .update_shadow_memory# jmp-and-link to lean procedure mov %r1 ref.tag mov .update_ref_cache [ret_pc] jmp lean_procedure.update_ref_cache mov %r1 ref.offset# shadow memory update.update_shadow_memory lea [ref] %r1 add %r1 + ref.offset %r1 mov 1 [%r1]

# memorization check cmp %r1, cache1.tag jne .cache1_miss mov cache1.offset %r1 jmp [ret_pc].cache1_miss cmp %r1, cache2.tag jne .cache2_miss mov cache1.offset %r1 jmp [ret_pc].cache2_miss# table lookup mov %r1 cache2.tag mov %r2 [R2_SLOT] … mov [R2_SLOT] %r2 mov %r1 cache2.offset jmp [ret_pc]

Inline Stub Lean Procedure

Page 209: DynamoRIO Tutorial Apr2010

209DynamoRIO Tutorial at CGO 24 April 2010

Performance Evaluation

209

Page 210: DynamoRIO Tutorial Apr2010

210DynamoRIO Tutorial at CGO 24 April 2010

Umbra Client: Shared Memory Detection

static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* test [%reg].tid_map, tid_map*/ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instrlist_meta_preinsert(ilist, where, INSTR_CREATE_test(drcontext, opnd1, opnd2)); /* jnz where */ opnd1 = opnd_create_instr(where); instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jcc(drcontext, OP_jnz, opnd1)); /* or */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map | 1); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr);}

210

Page 211: DynamoRIO Tutorial Apr2010

211DynamoRIO Tutorial at CGO 24 April 2010 211

More Examples

• Dynamic Optimization– Strength Reduction

• Profiling– Memory Reference Trace

• Shadow Memory– Umbra

• Dr. Memory

Page 212: DynamoRIO Tutorial Apr2010

212DynamoRIO Tutorial at CGO 24 April 2010

Dr. Memory

• Detects reads of uninitialized memory• Detects heap errors

– Out-of-bounds accesses (underflow, overflow)– Access to freed memory– Invalid frees– Memory leaks

• Detects other accesses to invalid memory– Stack tracking– Thread-local storage slot tracking

• Operates at runtime on unmodified Windows & Linux binaries

Page 213: DynamoRIO Tutorial Apr2010

213DynamoRIO Tutorial at CGO 24 April 2010

Dr. Memory Instrumentation

• Monitor all memory accesses, stack adjustments, and heap allocations

• Shadow each byte of app memory• Each byte’s shadow stores one of 4 values:

– Unaddressable– Uninitialized– Defined at byte level– Defined at bit level escape to extra per-bit shadow values

Page 214: DynamoRIO Tutorial Apr2010

214DynamoRIO Tutorial at CGO 24 April 2010

Dr. Memory

defined

invalid

undefineddefined

Shadow StackStack

Shadow HeapHeap

redzone

malloc

freed

redzone

invalid

invalid

invalid

defined

undefineddefined

Page 215: DynamoRIO Tutorial Apr2010

215DynamoRIO Tutorial at CGO 24 April 2010

Partial-Word Defines But Whole-Word Transfers

• Sub-dword variables are moved around as whole dwords• Cannot raise error when a move reads uninitialized bits• Must propagate on moves and thus must shadow registers

– Propagate shadow values by mirroring app data flow• Check system call reads and propagate system call writes

– Else, false negatives (reads) or positives (writes)• Raise errors instead of propagating at certain points

– Report errors only on “significant” reads

Page 216: DynamoRIO Tutorial Apr2010

216DynamoRIO Tutorial at CGO 24 April 2010

Shadowing Registers

• Use multiple TLS slots– dr_raw_tls_calloc()– Alternative: steal register

• Can read and write w/o spilling• Bring into spilled register to combine w/ other args

– Defined=0, uninitialized=1– Combine via bitwise or

Page 217: DynamoRIO Tutorial Apr2010

217DynamoRIO Tutorial at CGO 24 April 2010

Monitoring Stack Changes

• As stack is extended and contracts again, must update stack shadow as unaddressable vs uninitialized

• Push, pop, or any write to stack pointer• Try to distinguish large alloc/dealloc from stack swap

Page 218: DynamoRIO Tutorial Apr2010

218DynamoRIO Tutorial at CGO 24 April 2010

Kernel-Mediated Stack Changes

• Kernel places data on the stack and removes it again– Windows: APC, callback, and exception– Linux: signals

• Linux signals as an example:– intercept sigaltstack changes– intercept handler registration to instrument handler code– use DR's signal event to record app xsp at interruption point– when see event followed by handler, check which stack and

mark from either interrupted xsp or altstack base to cur xsp as defined (ignoring padding)

– record cur xsp in handler, and use to undo on sigreturn

Page 219: DynamoRIO Tutorial Apr2010

219DynamoRIO Tutorial at CGO 24 April 2010

Types Of Instrumentation

• Clean call– Simplest, but expensive in both time and space

• Shared clean call– Saves space

• Lean procedure– Shared routine with smaller context switch than full clean call

• Inlined– Smallest context switch, but should limit to small sequences of

instrumentation

Page 220: DynamoRIO Tutorial Apr2010

220DynamoRIO Tutorial at CGO 24 April 2010

Non-Code-Cache Code

• Use dr_nonheap_alloc() to allocate space to store code• Generate code using DR’s IR and emit to target space• Mark read-only once emitted via dr_memory_protect()

Page 221: DynamoRIO Tutorial Apr2010

221DynamoRIO Tutorial at CGO 24 April 2010

Jump-and-Link

• Rather than using call+return, avoid stack swap cost by using jump-and-link– Store return address in a register or TLS slot– Direct jump to target– Indirect jump back to source

PRE(bb, inst, INSTR_CREATE_mov_st(drcontext, spill_slot_opnd(drcontext, SPILL_SLOT_2), opnd_create_instr(appinst)));PRE(bb, inst, INSTR_CREATE_jmp(drcontext, opnd_create_pc(shared_slowpath_region)));...PRE(ilist, NULL, INSTR_CREATE_jmp_ind(drcontext, spill_slot_opnd(SPILL_SLOT_2)));

Page 222: DynamoRIO Tutorial Apr2010

222DynamoRIO Tutorial at CGO 24 April 2010

Inter-Instruction Storage

• Spill slots provided by DR are only guaranteed to be live during a single app instr– In practice, live until next selfmod instr

• Allocate own TLS for spill slots– dr_raw_tls_calloc()

• Steal registers across whole bb– Restore before each app read– Update spill slot after each app write– Restore on fault

Page 223: DynamoRIO Tutorial Apr2010

223DynamoRIO Tutorial at CGO 24 April 2010

Using Faults For Faster Common Case Code

• Instead of explicitly checking for rare cases, use faults to handle them and keep common case code path fast

• Signal and exception event and restore state extended event all provide pre- and post-translation contexts and containing fragment information

• Client can return failure for extended restore state event– When can support re-execution of faulting cache instr, but not

re-start translation for relocation

Page 224: DynamoRIO Tutorial Apr2010

224DynamoRIO Tutorial at CGO 24 April 2010

Address Space Iteration

• Repeated calls to dr_query_memory_ex()• Check dr_memory_is_in_client() and

dr_memory_is_dr_internal()• Heap walk

– API on Windows• Initial structures on Windows

– TEB, TLS, etc.– PEB, ProcessParameters, etc.

Page 225: DynamoRIO Tutorial Apr2010

225DynamoRIO Tutorial at CGO 24 April 2010

Intercepting Library Routines

• Common task• Dr. Memory monitors malloc, calloc, realloc, free,

malloc_usable_size, etc.– Alternative is to replace w/ own copies

• Locating entry point– Module API

• Pre-hooks are easy• Post-hooks are hard

– Three techniques, each with its own limitations

Page 226: DynamoRIO Tutorial Apr2010

226DynamoRIO Tutorial at CGO 24 April 2010

Intercepting Library Routines: Technique 1

• CFG analysis at init time– Statically analyze code from entry point and find return

instruction(s)– Post-hook placed at each return instruction

• Complications:– Not always easy or even possible to statically analyze

• Hot/cold and other layout optimizations• Switches or other indirection• Mixed code/data

– Tailcall from hooked routine A to hooked routine B will skip a return in A

– Longjmp or SEH unwind can skip any post-hook

Page 227: DynamoRIO Tutorial Apr2010

227DynamoRIO Tutorial at CGO 24 April 2010

Intercepting Library Routines: Technique 2

• At call site identify target– Direct calls/jmp: easy– Indirect through PLT/IAT: easy– Indirect through register/unknown memory: not always easy– Post-hook is placed at post-call-site– Flush post-call-site if it exists

• Complications:– Indirect targets that are not “statically” analyzable– Same call targets multiple hooked routines– Tailcall from hooked routine A to hooked routine B will skip post-

call of site in A– Longjmp or SEH unwind can skip any post-hook

Page 228: DynamoRIO Tutorial Apr2010

228DynamoRIO Tutorial at CGO 24 April 2010

Intercepting Library Routines: Technique 3

• Inside callee, obtain return address– Flush that address if it already exists– Post-hooks are placed at return address

• Complications:– Same call targets multiple hooked routines

• Store which inside callee– Tailcall from hooked routine A to hooked routine B will skip

return address of B• Identify tailcall and store target B; process B and then A at post-A

– Longjmp or SEH unwind can skip any post-hook• Try to intercept

Page 229: DynamoRIO Tutorial Apr2010

229DynamoRIO Tutorial at CGO 24 April 2010

Modifying Library Routine Parameters

• Use cases for Dr. Memory:– Add redzone to heap allocations– Delay frees

• Simply clobber the actual parameter– Need to know calling convention– Use dr_safe_write() for robustness

Page 230: DynamoRIO Tutorial Apr2010

230DynamoRIO Tutorial at CGO 24 April 2010

Replacing Library Routines

• Dr. Memory replaces libc routines containing optimized code that raises false positives– memcpy, strlen, strchr, etc.

• Simplification: arrange for routines to always be entered in a new bb– Do not request elision or indcall2direct from DR

• Want to interpret replaced routines– DR treats native execution differently: aborts on fault, etc.

• Replace entire bb with jump to replacement routine

Page 231: DynamoRIO Tutorial Apr2010

231DynamoRIO Tutorial at CGO 24 April 2010

State Across Windows Callbacks

• Per-thread state varies in whether should be shared or private across callbacks– Pre-to-post syscall data must be callback-private

• Callback entry:– Ntdll!KiUserCallbackDispatcher

• Callback exit:– NtCallbackReturn system call– Int 0x2b

• Make these DynamoRIO Events? (Issue 241)

Page 232: DynamoRIO Tutorial Apr2010

232DynamoRIO Tutorial at CGO 24 April 2010

Delayed Fragment Deletion

• Due to non-precise flushing we can have a flushed bb made inaccessible but not actually freed for some time

• When keeping state per bb, if a duplicate bb is seen, replace the state and increment a counter ignore_next_delete

• On a deletion event, decrement and ignore unless below 0• Can't tell apart from duplication due to thread-private copies:

but this mechanism handles that if saved info is deterministic and identical for each copy

Page 233: DynamoRIO Tutorial Apr2010

233DynamoRIO Tutorial at CGO 24 April 2010

Callstack Walking

• Use case: error reporting• Technique:

– Start with xbp as frame pointr (fp)– Look for <fp,retaddr> pairs where retaddr = inside a module

• Interesting issues:– When scanning for frame pointer (in frameless func, or at bottom

of stack), querying whether in a module dominates performance– msvcr80!malloc pushes ebx and then ebp, requiring special

handling– When displaying, use retaddr-1 for symbol lookup

Page 234: DynamoRIO Tutorial Apr2010

234DynamoRIO Tutorial at CGO 24 April 2010

Client Files Closed By Application

• Some applications close all file descriptors• Solution:

– Keep table of file descriptors owned by client– Intercept close system call– Turn close into nop when called on client descriptors

Page 235: DynamoRIO Tutorial Apr2010

235DynamoRIO Tutorial at CGO 24 April 2010

Suspending The World

• Use case: Dr. Memory leak check– GC-like memory scan

• Use dr_suspend_all_other_threads() and dr_resume_all_other_threads()

• Cannot hold locks while suspending

Page 236: DynamoRIO Tutorial Apr2010

236DynamoRIO Tutorial at CGO 24 April 2010

Using Nudges

• Daemon apps do not exit• Request results mid-run• Cross-platform

– Signal on Linux– Remote thread on Windows

Page 237: DynamoRIO Tutorial Apr2010

237DynamoRIO Tutorial at CGO 24 April 2010

Tool Packaging

• DynamoRIO is redistributable, so can include a copy with your tool

• Front end to configure and launch app– On Linux use a script that execs drrun– On Windows use drinjectlib.dll

Page 238: DynamoRIO Tutorial Apr2010

Feedback1:30-1:40 Welcome + DynamoRIO History1:40-2:40 DynamoRIO Internals2:40-3:00 Examples, Part 13:00-3:15 Break3:15-4:15 DynamoRIO API4:15-5:15 Examples, Part 25:15-5:30 Feedback

Page 239: DynamoRIO Tutorial Apr2010

239DynamoRIO Tutorial at CGO 24 April 2010

Contributors

• Looking for contributors to DynamoRIO– Work on major features

• Windows 7 support• Auto-inline callouts• Attach to running process• MacOS port

– Add new Extensions– Run nightly test suites– Maintain particular platforms– Etc.

Page 240: DynamoRIO Tutorial Apr2010

240DynamoRIO Tutorial at CGO 24 April 2010

Future Releases

• Better Linux library support (STL, etc.)– Custom loader

• Auto-inline callouts • Persistent and process-shared caches• Attach to a process• Symbol table lookup support on Linux• 64-bit client controlling 32-bit app• Your suggestion here

Page 241: DynamoRIO Tutorial Apr2010

241DynamoRIO Tutorial at CGO 24 April 2010

Feedback

• Questions for you– Feedback on what you want to see in API

• Thank you!