OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research...

OOPSLA 2003

Mostly Concurrent Garbage Collection Revisited

Katherine Barabash - IBM Haifa Research Lab. IsraelYoav Ossia - IBM Haifa Research Lab. IsraelErez Petrank - Technion. Israel

IBM Labs in Haifa

OOPSLA 20032

Outline

The mostly concurrent garbage collection (GC) Internals of the collector

Write barrier and card table Incremental collection

Our two improvements And their implications on performance

Results Conclusions

IBM Labs in Haifa

OOPSLA 20033

Mark Sweep Stop-The-World (STW) Garbage Collection

The basic method Mark all objects that are reachable from roots Sweep - reclaim all unmarked objects Done when Java mutation is suspended (STW)

Pause time - the length of the STW phase

Motivation for the mostly concurrent GC Reduce the pause time at acceptable throughput hit

IBM Labs in Haifa

OOPSLA 20034

Mostly Concurrent GC - The Basic Method

Perform marking concurrently with Java mutation Traditionally done by a separate thread While concurrent marking is active, record changes in

objects Otherwise…

When marking terminates do a short STW phase Re-trace from

Roots Marked objects that were not traced yet Marked and changed objects

Sweep

IBM Labs in Haifa

OOPSLA 20035

Mostly Concurrent GC - Perspective

Related Work Steele and Dijkstra et al - Concurrent GC. 1976 Baker - Incremental collection. 1977 Boehm at al - Mostly concurrent collection. 1991 Printezis and Detlefs - Mostly concurrent and generational GC. 2000 Many others… Ossia at al - Parallel, incremental and concurrent GC. 2002

Status Collector well accepted both in academic research and industry Used in many production JVMs: IBM, Sun, BEA JRockit

IBM Labs in Haifa

OOPSLA 20036

Outline


Write Barrier and card table Incremental collection


Results Conclusions

IBM Labs in Haifa

OOPSLA 20037

The Write Barrier and Object “cleaning” interaction

Tracer:

Marks and traces

Java Mutator:

Modifies Blue and Green objects

Write barrier on objects

Tracer:

Traces rest of graph

Tracer:

Clean blue object

IBM Labs in Haifa

OOPSLA 20038

Mostly Concurrent GC – Card Cleaning

Heap is logically divided into cards A card table is used, with a byte entry per each heap card A card-marking write barrier

Whenever a reference field is modified, dirty the card table entry of the modified object

Card cleaning Clean dirty mark Retrace all marked objects on card

Card cleaning can be done concurrently While this is done, more cards will be dirtied Additional STW card cleaning phase must be done

IBM Labs in Haifa

OOPSLA 20039

Incremental Mostly Concurrent GC

Marking done by the Java mutator threads When allocating

Tracing Rate (TR) - configurable by user The ratio between requested allocation size and required tracing

work Per every allocation request of K bytes, trace K* TR bytes of objects Allocation rate of application & tracing rate of collector imply CPU

percentage dedicated to the concurrent collection Starting the concurrent collector

Must be done on time, to complete tracing when the heap is exhausted

IBM Labs in Haifa

OOPSLA 200310

Concurrent Behavior Patterns

Higher tracing rate implies shorter concurrent cycle with smaller CPU share for Java Numbers below refer to

SPECjbb STW GC

100% CPU for Java mutation

Mostly Concurrent Tracing Rate 8 28% CPU for Java mutation

Mostly Concurrent Tracing Rate 1 72% CPU for Java mutation

CPU

Utilization

Time

Java mutation

Incremental

Tracing

Parallel STW

IBM Labs in Haifa

OOPSLA 200311

Mostly concurrent GC – Summary of the Base Algorithm

Fast card marking write barrier always active (JITed) Kickoff concurrent tracing when free space reached kickoff point

Reset the card table Trace (incrementally) all objects reachable from roots Do a single concurrent card cleaning pass on the card table Initiate final (short) STW phase

Trace again the roots for new objects Do another card cleaning pass Trace all newly marked objects Sweep

IBM Labs in Haifa

OOPSLA 200312

Outline




Results Conclusions

IBM Labs in Haifa

OOPSLA 200313

The Repetitive Work Problem

Observations Suppose a card is dirtied while concurrent marking is

executing Newly reached objects in this card are marked and traced by

the collector All these objects will later be traced again in the card cleaning

phase Outcome: repeated tracing

Improvement: Don’t trace through dirty cards

IBM Labs in Haifa

OOPSLA 200314

Don’t Trace Through Dirty Cards

If an object resides in a dirty card, omit its tracing Only a single tracing will be done, in the concurrent card cleaning

phase Advantages

Less marking work Reduced floating garbage

More later… Reduced cache miss rate

More later… Thus, substantial throughput

improvements Disadvantage

Increased pause time

Java mutation

Concurrent

tracing

STW tracing

Concurrent

card cleaning

STW card

cleaning

Base Method

Don’t trace dirties

IBM Labs in Haifa

OOPSLA 200315

Timing of Card Dirtying

Observations Card (and object) dirtying indicates that previous tracing

may have been insufficient New objects may be reachable only from the dirtied object

Dirtying information is needed only if tracing was already done

Prior to tracing, card dirtying is irrelevant Improvement: Undirdy cards with no traced objects

Undirtying via scanning Undirtying via local allocation caches

IBM Labs in Haifa

OOPSLA 200316

Undirtying via Scanning

Undirtying can be done periodically on the whole card table An indication of “traced” cards is needed

Mark bit vector A “traced” card table (first traced object marks the card as traced)

Method: scan the card table and undirty all cards with no traced objects Very effective in undirtying cards (cuts cleaning by 65%) Some extra cost of card table scan Should be done frequently

Catch these cards before any marking or tracing occurs!

IBM Labs in Haifa

OOPSLA 200317

Undirtying via Local Allocation Caches

Local allocation caches are used by most modern JVMs Most cards of active caches are dirty

Objects usually have write barriers while (and shortly after) initializing Objects in active cache are (usually) not traced before the cache is

replaced If no tracing in the active cache is guarantied, we can undirty its cards Method: cooperation between allocators and concurrent tracers

Allocator (when replacing a local cache): Undirty all the cache’s inner cards Mark all cards as “traceable” Take a new cache and mark all its cards as “untraceable”

Concurrent tracer Defer tracing of objects in “untraceable” cards “for a while” BTW, this hardly ever happens

Cuts the amount of dirty cards by more than 35%, at no cost

IBM Labs in Haifa

OOPSLA 200318

Undirdy Cards with No Traced Objects

Advantages Without “Don’t trace through

dirty cards” – less work With it, reduces the STW card

cleaning significantly

Java mutation

Concurrent

tracing

STW tracing

Concurrent

card cleaning

STW card

cleaning

Base Method

Don’t trace dirties

Don’t trace dirties + Undirty

IBM Labs in Haifa

OOPSLA 200319

Characteristics of Dirty Cards

We believe that a recently dirtied card is good indication for more modification of objects in the near future Change of references Other writing activities Indication applied to all the objects in the card

A recently dirtied card is probably hot and active But we don’t trace through dirty cards! By the time we get to clean them they will probably become more

stable and colder

IBM Labs in Haifa

OOPSLA 200320

Reduced Floating Garbage

Floating garbage is created when tracing is done before the object modification

Don’t trace through dirty cards! Will probably defer the tracing until the card gets stable Objects are no longer modified No floating garbage will be created as a result of this late

tracing

IBM Labs in Haifa

OOPSLA 200321

Reduced Cache Miss Rate

Reducing the tracing work affects the cache miss rate As tracing the object graph intensifies cache capacity misses

But also cache coherency misses are reduced A write barriered card (hot and active) is probably modified by Java

mutators If a concurrent tracer scans objects on such card, it will suffer

coherency misses Don’t trace through dirty cards! Deferring the tracing of these objects to the card cleaning phase

reduces cache coherency misses Our improved collector reduces L2 cache miss rate by 6.4% Out of which 3.7% is reduction in cache coherency misses

IBM Labs in Haifa

OOPSLA 200322

Outline




Results Conclusions

IBM Labs in Haifa

OOPSLA 200323

Implementation and Tests

Implementation On top of the mostly concurrent collector that is part of the IBM

production JVM 1.4.0. Platforms

Tested on both an IBM 6-way pSeries server and an IBM 4-way Netfinity server

Benchmarks The SPECjbb2000 benchmark and the SPECjvm98 benchmark suite

Measurements Performance of the base collector Vs. the improved version The effect of each improvement separately, and more…

IBM Labs in Haifa

OOPSLA 200324

Results - Throughput Improvement

SPECjbb. 6-way PPC. Heap size 448 MB 26.7% improvement (in tracing rate 1)

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

5

10

15

20

25

30

35

40

45

Th

ou

san

ds

TP

M

Base

Improved

Throughput

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

0

5

10

15

20

25

30

35

Ch

an

ge

(%

)

Throughput Improvement

IBM Labs in Haifa

OOPSLA 200325

Results - Floating Garbage Reduction

SPECjbb. 6-way PPC. Heap size 448 MB 13.4% improvement in heap residency (in tracing rate 1) Almost all floating garbage eliminated

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

0

50

100

150

200

250

300

350

MB

Base

Improved

STW

Heap Residency

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

0

5

10

15

20

25

Cha

nge

(%) Improved

Optimal

Heap Residency Reduction

IBM Labs in Haifa

OOPSLA 200326

Results - Pause Time Reduction SPECjbb. 6-way PPC. Heap size 448 MB 33.3% improvement

in average pause 36.4% improvement

in max pause

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

20

40

60

80

100

120

140

160

180

200

Mill

ise

con

ds

Base

Improved

Average Pause Time

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

0

10

20

30

40

50

Ch

an

ge

(%

)

Average Pause Time Reduction

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

20

40

60

80

100

120

140

160

180

200

Mill

ise

con

ds

Base

Improved

Max Pause Time

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

0

10

20

30

40

50

Ch

an

ge

(%

)

Max Pause Time Reduction

IBM Labs in Haifa

OOPSLA 200327

Conclusions

Introducing two improvements to the mostly concurrent GC Reduces repetitive GC work (don’t trace through dirty cards) Reduces number of dirty cards (undirty cards with no traced objects)

Substantial improvement of the mostly concurrent GC Improved throughput by 26% Almost eliminated floating garbage (heap residency reduced by 13%) Reduced average pause time by 33%

Additional effects of not tracing into dirty cards Reduced floating garbage Reduced cache miss rate

The improved algorithm has been incorporated into IBM's production JVM

IBM Labs in Haifa

OOPSLA 200328

End

IBM Labs in Haifa

OOPSLA 200329

Analyzing the Performance of Lower Tracing Rate

Throughput hit rate Relative to MS STW GC

Java utilization Relative to MS STW GC

Live Rate Relative to heap size Floating Garbage

Marked objects that become unreachable before the STW phase

More objects to trace. Less free space (more GCs)

Card cleaning rate Relative to total number of cards More work. Longer final STW phase

1 2 4 8

Tracing rate

0

10

20

30

40

50

60

70

80

90

100

Rat

io (

%)

Throughput hit rate

Mutator Utilization

Live Rate

Cards Cleaning rate

SPECjbb characteristics

IBM Labs in Haifa

OOPSLA 200330

Results - Throughput Improvement

SPECjbb. 6-way PPC. Heap size 448 MB 26.7% improvement in tracing rate 1

12

34

56

78

910

1112

Warehouses

0

10

20

30

40

50

Th

ou

san

ds

TP

M

Tr 1

Tr 8

Base Collector

12

34

56

78

910

1112

Warehouses

0

10

20

30

40

50

Th

ou

san

ds

TP

M

Tr 1

Tr 8

Improved Collector

IBM Labs in Haifa

OOPSLA 200331

Results - Floating Garbage Reduction

SPECjbb. 6-way PPC. Heap size 448 MB 13.4% improvement in tracing rate 1

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

0

50

100

150

200

250

300

350

Liv

e s

et s

ize

(M

B)

Tr 1

Tr 8

Base Collector

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

0

50

100

150

200

250

300

350

Liv

e s

et s

ize

(M

B)

Tr 1

Tr 8

Improved Collector

IBM Labs in Haifa

OOPSLA 200332

Results - Average Pause Time

SPECjbb. 6-way PPC. Heap size 448 MB

12

34

56

78

910

1112

Warehouses

0

20

40

60

80

100

120

140

Mill

ise

con

ds

Tr 1

Tr 8

Base Collector

12

34

56

78

910

1112

Warehouses

0

20

40

60

80

100

120

140

Mill

ise

con

ds

Tr 1

Tr 8

Improved Collector

IBM Labs in Haifa

OOPSLA 200333

Throughput Improvement for All Tracing Rates

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

0

3

6

9

12

15

18

21

24

27

30

Ch

an

ge

(%

)

Tr. rate 1

Tr. rate 2

Tr. rate 4

Tr. rate 8

IBM Labs in Haifa

OOPSLA 200334

Heap Residency Reduction for All Tracing Rates

1 2 3 4 5 6 7 8 9 10 11 12

Warehouses

-20

-16

-12

-8

-4

0

Ch

an

ge

(%

)

Tr. rate 1

Tr. rate 2

Tr. rate 4

Tr. rate 8

IBM Labs in Haifa

OOPSLA 200335

IBM Labs in Haifa

OOPSLA 200336

Reduced floating garbage

Potential Floating garbage root – reachable object that de-reference its sub-graph and thus make it unreachable To become a floating garbage root,

it must first be traced and then have a write barrier

We believe that a freshly dirty card is good indication for more write barriers

Deferring the tracing into a dirty card will defer the tracing to after the write barriers

Dirty card

Floating Garbage Root

Floating Garbage

OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research...

Documents

Transcript of OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research...