380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining...

Post on 17-Jan-2016

215 views 0 download

Tags:

Transcript of 380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining...

380C lecture 19

• Where are we & where we are going– Managed languages

• Dynamic compilation• Inlining• Garbage collection

– Opportunity to improve data locality on-the-fly– Other opportunities?

– Why you need to care about workloads– Alias analysis– Dependence analysis– Loop transformations– EDGE architectures

1CS380C Lecture 19

2

Garbage Collection Advantage:

Improving Program Locality

Xianglong Huang (UT)Stephen M Blackburn (ANU), Kathryn S McKinley (UT)

J Eliot B Moss (UMass), Zhenlin Wang (MTU), Perry Cheng (IBM)

CS380C Lecture 19

3

Today: Advanced Topics

• Generational Garbage Collection• Copying objects is an opportunity

• Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT), J Eliot B Moss (UMass), Zhenlin Wang (MTU), Perry Cheng (IBM), “The Garbage Collection Advantage: Improving Program Locality,” OOPSLA 2004.

CS380C Lecture 19

4

Motivation

• Memory gap problem• OO programs become more popular• OO programs exacerbates memory gap

problem– Automatic memory management– Pointer data structures– Many small methods

Goal: improve OO program locality

CS380C Lecture 19

5

Allocation Mechanisms

Fast (increment & bounds check)

contemporaneous object locality

Can't incrementally free & reuse: must free en masse

Bump-Pointer

CS380C Lecture 19

6

Allocation Mechanisms

Fast (increment & bounds check)

contemporaneous object locality

Can't incrementally free & reuse: must free en masse

Bump-Pointer

CS380C Lecture 19

7

Allocation Mechanisms

Fast (increment & bounds check)

contemporaneous object locality

Can't incrementally free & reuse: must free en masse

Bump-Pointer Free-List

Slightly slower (consult list for fit) Mystery locality

Can incrementally free & reuse cells

CS380C Lecture 19

8

State-of-the-art throughput Copying Generational GC

• Requirements– write-barrier to track inter-generation pointers

• remsets, cards– copy reserve

• Advantages:– Minimizes copying of older objects– Compaction of long-lived objects

• Problems:– Not very incremental– Very youngest objects always copied– What order should GC use to copy objects?

etc. etc …

‘nursery’ ‘older generation’

CS380C Lecture 19

9

Opportunity

• Generational copying garbage collector reorders objects at runtime

CS380C Lecture 19

10

1

4

65

7

2 3

Copying of Linked Objects

BreadthFirst

65

7

432

1

CS380C Lecture 19

11

71 2 3 4 5 6

1

4

65

7

2 3

Copying of Linked Objects

65

7

432

1

BreadthFirst

DepthFirst

CS380C Lecture 19

12

71 2 3 4 5 6

Copying of Linked Objects

DepthFirst

OnlineObjectReordering

1 4BreadthFirst

61 2 3 4 75

1

4

65

7

2 3

65

7

432

1

41

CS380C Lecture 19

13

Outline

• Motivation• Online Object Reordering

(OOR)• Methodology• Experimental Results• Conclusion

CS380C Lecture 19

14

Cache Performance Matters

_213_javac

05

10152025303540

8K DL1, 8K IL1, 128K L2Perfect L2 Perfect IL1, Perfect DL1Total Cycles (in billions)

CS380C Lecture 19

15

Online Object Reordering

• Where are the cache misses?• How to identify hot field accesses

at runtime?• How to reorder the objects?

CS380C Lecture 19

16

Where Are The Cache Misses?

VM Objects StackOlder

Generation

• Heap structure:

Nursery

Not to scale

CS380C Lecture 19

17

Where Are The Cache Misses?

_209_db

0200400600800

100012001400160018002000

VM ObjectsStack Older Gen NurseryTotal Accesses (in millions)

L2 hits

L2 misses

CS380C Lecture 19

18

Where Are The Cache Misses?

• Two opportunities to reorder objects in the older generation– Promote nursery objects– Full heap collection

CS380C Lecture 19

19

How to Find Hot Fields?

• Runtime info (intercept every read)?

• Compiler analysis?• Runtime information + compiler

analysis Key: Low overhead estimation

CS380C Lecture 19

20

Which Classes Need Reordering?

Step 1: Compiler analysis– Excludes cold basic blocks– Identifies field accesses

Step 2: JIT adaptive sampling identifies hot methods– Mark as hot field accesses in hot

methods

Key: Low overhead estimation

CS380C Lecture 19

21

Example: Compiler Analysis

Compiler

Hot BBCollect access info

Cold BBIgnore

Compiler

Access List:1. A.b2. ….….

Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c }}

CS380C Lecture 19

22

Example: Adaptive Sampling

Method Foo { Class A a; try { …=a.b;

… } catch(Exception e){

…a.c }}

Adaptive Sampling

Foo is hot

Foo Accesses:1. A.b2. ….….

A.b is hot

A

B

b…..

c A’s type information

c b

CS380C Lecture 19

23

1

4

65

7

2 3

Copying of Linked Objects

65

7

43

OnlineObjectReordering

Type Information

143

2

1

Hot space Cold space

CS380C Lecture 19

24

OOR System Overview

BaselineCompiler

SourceCode

ExecutingCode

AdaptiveSampling Optimizing

Compiler

HotMethods

Access InfoDatabase

Register HotField Accesses

Look Up

AddsEntries

GC: CopiesObjects

Affects Locality

AdviceGC: CopiesObjects

OOR additionJikesRVM componentInput/Output

OptimizingCompiler

AdaptiveSampling

Improves Locality

CS380C Lecture 19

25

Outline

• Motivation• Online Object Reordering• Methodology• Experimental Results• Conclusion

CS380C Lecture 19

26

Methodology: Virtual Machine

• Jikes RVM– VM written in Java– High performance– Timer based adaptive sampling – Dynamic optimization

• Experiment setup– Pseudo-adaptive – 2nd iteration [Eeckhout et al.]

CS380C Lecture 19

27

Methodology: Memory Management

• Memory Management Toolkit (MMTk):– Allocators and garbage collectors– Multi-space heap

• Boot image• Large object space (LOS)• Immortal space

• Experiment setup– Generational copying GC with 4M

bounded nurseryCS380C Lecture 19

28

Overhead: OOR Analysis Only

Benchmark Base Execution Time (sec)

w/ only OOR Analysis (sec)

Overhead

jess 4.39 4.43 0.84%

jack 5.79 5.82 0.57%

raytrace 4.63 4.61 -0.59%

mtrt 4.95 4.99 0.70%

javac 12.83 12.70 -1.05%

compress 8.56 8.54 0.20%

pseudojbb 13.39 13.43 0.36%

db 18.88 18.88 -0.03%

antlr 0.94 0.91 -2.90%

hsqldb 160.56 158.46 -1.30%

ipsixql 41.62 42.43 1.93%

jython 37.71 37.16 -1.44%

ps-fun 129.24 128.04 -1.03%

Mean -0.19%CS380C Lecture 19

29

Detailed Experiments

• Separate application and GC time• Vary thresholds for method heat• Vary thresholds for cold basic

blocks• Three architectures

– x86, AMD, PowerPC

• x86 Performance counter: – DL1, trace cache, L2, DTLB, ITLB

CS380C Lecture 19

30

Performance javac

CS380C Lecture 19

31

Performance db

CS380C Lecture 19

32

Performance jython

Any static ordering leaves you vulnerable to pathological cases.

CS380C Lecture 19

33

Phase Changes

CS380C Lecture 19

34

Related Work

• Evaluate static orderings [Wilson et al.]– Large performance variation

• Static profiling [Chilimbi et al., and others]– Lack of flexibility

• Instance-based object reordering [Chilimbi et al.]– Too expensive

CS380C Lecture 19

35

Conclusion

• Static traversal orders have up to 25% variation

• OOR improves or matches best static ordering

• OOR has very low overhead• Past predicts future

CS380C Lecture 19

380C

• Where are we & where we are going– Managed languages

• Dynamic compilation• Inlining• Garbage collection

– Why you need to care about workloads & methodology

• Read: Blackburn et al., Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century, ACM CACM, 51(8): 83--89, August, 2008.

– Alias analysis– Dependence analysis– Loop transformations– EDGE architectures

36CS380C Lecture 19