Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

40
Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009

Transcript of Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

Page 1: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

Day 2: Building Process Virtualization Systems

Kim HazelwoodACACES Summer

SchoolJuly 2009

Page 2: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization

Course Outline

• Day 1 – What is Process Virtualization?

• Day 2 – Building Process Virtualization Systems

• Day 3 – Using Process Virtualization Systems

• Day 4 – Symbiotic Optimization

2

Page 3: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization3

JIT-Based Process Virtualization

Application

Transform

CodeCache

Execute

Profile

Page 4: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization4

What are the Challenges?

Performance!Solutions:• Code caches – only transform code once• Trace selection – focus on hot paths• Branch linking – only perform cache lookup once• Indirect branch hash tables / chaining• Memory “management”

Correctness – self-modifying code, munmaps, multithreading

Transparency – context switching, eflags

Page 5: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization5

What is the Overhead?

The latest Pin overhead numbers …

100%

120%

140%

160%

180%

200%

perlb

ench

sjen

g

xala

ncbm

k

gobm

k

gcc

h264

ref

omne

tpp

bzip

2

libqu

antu

m mcf

asta

r

hmm

er

Rel

ativ

e to

Nat

ive

Page 6: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization6

Sources of Overhead

Internal

• Compiling code & exit stubs (region detection, region formation, code generation)

• Managing code (eviction, linking)

• Managing directories and performing lookups

• Maintaining consistency (SMC, DLLs)

External

• User-inserted instrumentation

Page 7: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization7

Improving Performance: Code Caches

Code CacheBranch Target

Address

Hit

Region Formation & Optimization

Evict Code

UpdateHash Table

Miss

No

YesYesNo

Interpret

Code is Hot?Room in

Code Cache?

Insert

Sta

rt Hash TableLookup

Counter++Delete

Exit Stub

Page 8: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization8

Software-Managed Code Caches

• Store transformed code at run time to amortize overhead of process VMs• Contain a (potentially altered) copy of application

code

Application

Transform

CodeCache

Execute

Profile

Page 9: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization9

Code Cache Contents

Every application instruction executed is stored in the code cache (at least)

Code Regions

• Altered copies of application code

• Basic blocks and/or traces

Exit stubs

• Swap applicationVM state

• Return control to the process VM

Page 10: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization10

Code Regions

Basic Blocks

Traces

A

BBL A:Inst1Inst2Inst3Branch B

C

A

B

D

CFG

A

B

C

D

In Memory

A

B C

DD

Trace

Page 11: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization11

Exit Stubs

One exit stub exists for every exit from every trace or basic block

Functionality

Prepare for context switch

Return control to VM dispatch

Details

Each exit stub ≈ 3 instructions

A

B

DExit to C

Exit to E

Page 12: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization12

A

B C

D E

F G

HI

Call

Return

CFG

Performance: Trace Selection

• Interprocedural path• Single entry, multiple exit

ABCDI

GH

EF

Call

Return

Layout in Memory

Exit to C

Exit to F

ABDEGHI

Layout in Code Cache

Trace (superblock)

Page 13: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization13

Performance: Cache Linking

Trace #2

Exit #1a

Exit #1b

Trace #1

Dispatch

Trace #3

Page 14: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization14

Linking Traces

Proactive linkingLazy linking

Exit to C

Exit to F

ABDEGHI

Exit to A

FHI

Exit to A

CDEGHI

Exit to F

A

B C

D E

F G

HI

Call

Return

ABDEGHI

CDEGHI

FHI

Page 15: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization15

Are Links Highly Beneficial?

Bench-mark

With Linking

Without Linking

Slow-down

gzip 230 sec 7951 sec 3357%

vpr 333 sec 2474 sec 643%

gcc 206 sec 3284 sec 1494%

mcf 368 sec 2014 sec 447%

crafty 215 sec 3547 sec 1550%

parser 350 sec 6795 sec 1841%

perlbmk 336 sec 6945 sec 1967%

gap 195 sec 4231 sec 2070%

vortex 382 sec 4655 sec 1119%

bzip2 287 sec 4294 sec 1396%

twolf 658 sec 6490 sec 886%

Page 16: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization16

Code Cache Visualization

Page 17: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization17

Challenge: Rewriting Instructions

•We must regularly rewrite branches

•No atomic branch write on x86

•Pin uses a neat trick*:

“old” 5-byte branch

2-byte self

branch

n-2 bytes of “new” branch

“new” 5-byte branch

* Sundaresan et al. 2006

Page 18: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization18

Pretend as though the original program is executing

Original Code:0x1000 call 0x4000

Challenge: Achieving Transparency

Code cache address mapping:

0x1000 0x7000 “caller”

0x4000 0x8000 “callee”

Translated Code:0x7000 push 0x10060x7006 jmp 0x8000

Push 0x1006 on stack, then jump to 0x4000

SPC TPC

Page 19: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization19

Challenge: Self-Modifying Code

The problem

Code cache must detect SMC and invalidate corresponding cached traces

Solutions

Many proposed … but without HW support, they are very expensive!• Changing page protection• Memory diff prior to execution• On ARM, there is an explicit instruction for SMC!

Page 20: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization20

Self-Modifying Code Handler

(Written by Alex Skaletsky)

void main (int argc, char **argv) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(InsertSmcCheck,0); PIN_StartProgram(); // Never returns}void InsertSmcCheck () { . . . memcpy(traceCopyAddr, traceAddr, traceSize); TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck,

IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, IARG_UINT32, traceSize, IARG_CONTEXT, IARG_END);

}void DoSmcCheck (VOID* traceAddr, VOID *traceCopyAddr,

USIZE traceSize, CONTEXT* ctxP) { if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) { CODECACHE_InvalidateTrace((ADDRINT)traceAddr); PIN_ExecuteAt(ctxP); }}

Page 21: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization21

Challenge: Parallel Applications

JIT Compiler

Syscall Emulator

Signal Emulator Dis

pa

tch

er

Instrumentation CodeCall-Back Handlers

Analysis Code

Code Cache

Pin

Serialized Parallel

T1

T2

T1

T1 T1 T2

Pin Tool

Page 22: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization22

Challenge: Code Cache Consistency

Cached code must be removed for a variety of reasons:•Dynamically unloaded code•Ephemeral/adaptive instrumentation•Self-modifying code•Bounded code caches

EXE

Transform

CodeCache

Execute

Profile

Page 23: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization23

Motivating a Bounded Code Cache

The Perl Benchmark

100%

150%

200%

250%

300%

350%

400%

input1 input2 input3 total

Perf

orm

ance

Rel

ative

to N

ative

Unlimited Code Cache2.5 MB Code Cache2.0 MB Code Cache1.5 MB Code Cache1.0 MB Code Cache

Page 24: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization24

• Option 1: All threads have a private code cache (oops, doesn’t scale)

• Option 2: Shared code cache across threads

• If one thread flushes the code cache, other threads may resume in stale memory

Flushing the Code Cache

0%

100%

200%

300%

400%

500%

600%

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Trac

e M

emor

y In

crea

se

Page 25: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization25

Naïve Flush

Wait for all threads to return to the code cache

Could wait indefinitely!

VM

VM

CC1

CC1

VM stall

VM stall

CC2

CC2

VM CC1 VM CC2

Flush Delay

Thread1

Thread2

Thread3

Time

Page 26: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization26

Generational Flush

Allow threads to continue to make progress in a separate area of the code cache

VM

VM

CC1

CC1

VM

VM

CC2

CC2

VM CC1 VM CC2

Thread1

Thread2

Thread3

Requires a high water mark

Time

Page 27: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization

% pin –cache_size 40960 –t flusher -- /bin/ls

SWOOSH!

SWOOSH!

27

Build-Your-Own Cache Replacement

void main(int argc, char **argv) { PIN_Init(argc,argv); CODECACHE_CacheIsFull(FlushOnFull); PIN_StartProgram(); //Never returns}void FlushOnFull() { CODECACHE_FlushCache(); cout << “SWOOSH!” << endl;}

Eviction Granularities• Entire Cache• One Cache Block• One Trace• Address Range

Page 28: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization28

A Graphical Front-End

Page 29: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization29

Memory Scalability of the Code Cache

Ensuring scalability also requires carefully configuring the code stored in the cache

Trace Lengths

• First basic block is non-speculative, others are speculative

• Longer traces = fewer entries in the lookup table, but more unexecuted code

• Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code

Page 30: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization30

Effect of Trace Length on Trace Count

0

2000

4000

6000

8000

10000

12000

14000

16000

wupwise swim mgrid applu galgel equake apsi gafort fma3d art ammp

Tota

l Tra

ce C

ount

1 BB 2 BBs 4 BBs 8 BBs 16 BBs 32 BBs

Page 31: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization31

Effect of Trace Length on Memory

0

1500

3000

4500

6000

7500

9000

10500

01 BBs 02 BBs 04 BBs 08 BBs 16 BBs 32 BBsBasic Blocks Per Trace

Code

Cac

he F

ootp

rint (

KB)

Lookup TableLinksExit StubsTraces

Page 32: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization32

Sources of Overhead

Internal

• Compiling code & exit stubs (region detection, region formation, code generation)

• Managing code (eviction, linking)

• Managing directories and performing lookups

• Maintaining consistency (SMC, DLLs)

External

• User-inserted instrumentation

Page 33: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization33

Adding Instrumentation

100%

200%

300%

400%

500%

600%

700%

800%

perlb

ench

sjen

g

xala

ncbm

k

gobm

k

gcc

h264

ref

omne

tpp

bzip

2

libqu

antu

m mcf

asta

r

hmm

er

Rel

ativ

e to

Nat

ive Pin

Pin+icount

Page 34: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization34

“Normal Pin” Execution Flow

Instrumentation is interleaved with application

Uninstrumented Application

Instrumented Application

Pin Overhead

Instrumentation Overhead

“Pinned” Application

time

Page 35: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization35

“SuperPin” Execution Flow

SuperPin creates instrumented slices

Uninstrumented Application

SuperPinned Application

Instrumented Slices

Page 36: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization36

Issues and Design Decisions

Creating slices• How/when to start a slice• How/when to end a slice

System calls

Merging results

Page 37: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization37

for k

S6+fo

r k

S5+fo

r k

S4+fo

r kre

cord

si

gr3

, sl

eep

S3+

for k

S2+

sleep

S1+

Execution Timelinefo

r k

S1 S2 S3 S4 S5 S6

dete

ct

sigr4

dete

ct

exit

resu

me

dete

ct

sigr3

dete

ct

sigr6

dete

ct

sigr2

dete

ct

sigr5

resu

me

reco

rd

sigr4

, sl

eep

CPU2

CPU3

CPU4

time

reco

rd

sigr2

, sl

eep

resu

me

resu

me

reco

rd

sigr5

, sl

eep

resu

me

reco

rd

sigr6

, sl

eep

resu

me

original application

instrumentedapplication slices

CPU1

Page 38: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization38

Performance – icount1

% pin –t icount1 -- <benchmark>

0%

500%

1000%

1500%

2000%

2500%

3000%

amm

pap

plu

apsi art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isA

VG

Pin SuperPin

Page 39: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization

What Did We Learn Today?

Building Process VMs is only half the battle

Robustness, correctness, performance are paramount

Lots of “tricks” are in play

• Code caches, trace selection, etc.

Knowing about these tricks is beneficial

• Lots of research opportunities

• Understanding the inner workings often helps you write better tools

39

Page 40: Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009.

ACACES 2009 – Process Virtualization

Want More Info?

• Read the seminal Dynamo paper

• See the more recent papers by the Pin, DynamoRIO, Valgrind teams

• Relevant conferences: VEE, CGO, ASPLOS, PLDI, PACT

40

Day 1 – What is Process Virtualization?Day 2 – Building Process Virtualization SystemsDay 3 – Using Process Virtualization SystemsDay 4 – Symbiotic Optimization