OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research...
-
date post
18-Dec-2015 -
Category
Documents
-
view
222 -
download
1
Transcript of OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research...
OOPSLA 2003
Mostly Concurrent Garbage Collection Revisited
Katherine Barabash - IBM Haifa Research Lab. IsraelYoav Ossia - IBM Haifa Research Lab. IsraelErez Petrank - Technion. Israel
IBM Labs in Haifa
OOPSLA 20032
Outline
The mostly concurrent garbage collection (GC) Internals of the collector
Write barrier and card table Incremental collection
Our two improvements And their implications on performance
Results Conclusions
IBM Labs in Haifa
OOPSLA 20033
Mark Sweep Stop-The-World (STW) Garbage Collection
The basic method Mark all objects that are reachable from roots Sweep - reclaim all unmarked objects Done when Java mutation is suspended (STW)
Pause time - the length of the STW phase
Motivation for the mostly concurrent GC Reduce the pause time at acceptable throughput hit
IBM Labs in Haifa
OOPSLA 20034
Mostly Concurrent GC - The Basic Method
Perform marking concurrently with Java mutation Traditionally done by a separate thread While concurrent marking is active, record changes in
objects Otherwise…
When marking terminates do a short STW phase Re-trace from
Roots Marked objects that were not traced yet Marked and changed objects
Sweep
IBM Labs in Haifa
OOPSLA 20035
Mostly Concurrent GC - Perspective
Related Work Steele and Dijkstra et al - Concurrent GC. 1976 Baker - Incremental collection. 1977 Boehm at al - Mostly concurrent collection. 1991 Printezis and Detlefs - Mostly concurrent and generational GC. 2000 Many others… Ossia at al - Parallel, incremental and concurrent GC. 2002
Status Collector well accepted both in academic research and industry Used in many production JVMs: IBM, Sun, BEA JRockit
IBM Labs in Haifa
OOPSLA 20036
Outline
The mostly concurrent garbage collection (GC) Internals of the collector
Write Barrier and card table Incremental collection
Our two improvements And their implications on performance
Results Conclusions
IBM Labs in Haifa
OOPSLA 20037
The Write Barrier and Object “cleaning” interaction
Tracer:
Marks and traces
Java Mutator:
Modifies Blue and Green objects
Write barrier on objects
Tracer:
Traces rest of graph
Tracer:
Clean blue object
IBM Labs in Haifa
OOPSLA 20038
Mostly Concurrent GC – Card Cleaning
Heap is logically divided into cards A card table is used, with a byte entry per each heap card A card-marking write barrier
Whenever a reference field is modified, dirty the card table entry of the modified object
Card cleaning Clean dirty mark Retrace all marked objects on card
Card cleaning can be done concurrently While this is done, more cards will be dirtied Additional STW card cleaning phase must be done
IBM Labs in Haifa
OOPSLA 20039
Incremental Mostly Concurrent GC
Marking done by the Java mutator threads When allocating
Tracing Rate (TR) - configurable by user The ratio between requested allocation size and required tracing
work Per every allocation request of K bytes, trace K* TR bytes of objects Allocation rate of application & tracing rate of collector imply CPU
percentage dedicated to the concurrent collection Starting the concurrent collector
Must be done on time, to complete tracing when the heap is exhausted
IBM Labs in Haifa
OOPSLA 200310
Concurrent Behavior Patterns
Higher tracing rate implies shorter concurrent cycle with smaller CPU share for Java Numbers below refer to
SPECjbb STW GC
100% CPU for Java mutation
Mostly Concurrent Tracing Rate 8 28% CPU for Java mutation
Mostly Concurrent Tracing Rate 1 72% CPU for Java mutation
CPU
Utilization
Time
Java mutation
Incremental
Tracing
Parallel STW
IBM Labs in Haifa
OOPSLA 200311
Mostly concurrent GC – Summary of the Base Algorithm
Fast card marking write barrier always active (JITed) Kickoff concurrent tracing when free space reached kickoff point
Reset the card table Trace (incrementally) all objects reachable from roots Do a single concurrent card cleaning pass on the card table Initiate final (short) STW phase
Trace again the roots for new objects Do another card cleaning pass Trace all newly marked objects Sweep
IBM Labs in Haifa
OOPSLA 200312
Outline
The mostly concurrent garbage collection (GC) Internals of the collector
Write Barrier and card table Incremental collection
Our two improvements And their implications on performance
Results Conclusions
IBM Labs in Haifa
OOPSLA 200313
The Repetitive Work Problem
Observations Suppose a card is dirtied while concurrent marking is
executing Newly reached objects in this card are marked and traced by
the collector All these objects will later be traced again in the card cleaning
phase Outcome: repeated tracing
Improvement: Don’t trace through dirty cards
IBM Labs in Haifa
OOPSLA 200314
Don’t Trace Through Dirty Cards
If an object resides in a dirty card, omit its tracing Only a single tracing will be done, in the concurrent card cleaning
phase Advantages
Less marking work Reduced floating garbage
More later… Reduced cache miss rate
More later… Thus, substantial throughput
improvements Disadvantage
Increased pause time
Java mutation
Concurrent
tracing
STW tracing
Concurrent
card cleaning
STW card
cleaning
Base Method
Don’t trace dirties
IBM Labs in Haifa
OOPSLA 200315
Timing of Card Dirtying
Observations Card (and object) dirtying indicates that previous tracing
may have been insufficient New objects may be reachable only from the dirtied object
Dirtying information is needed only if tracing was already done
Prior to tracing, card dirtying is irrelevant Improvement: Undirdy cards with no traced objects
Undirtying via scanning Undirtying via local allocation caches
IBM Labs in Haifa
OOPSLA 200316
Undirtying via Scanning
Undirtying can be done periodically on the whole card table An indication of “traced” cards is needed
Mark bit vector A “traced” card table (first traced object marks the card as traced)
Method: scan the card table and undirty all cards with no traced objects Very effective in undirtying cards (cuts cleaning by 65%) Some extra cost of card table scan Should be done frequently
Catch these cards before any marking or tracing occurs!
IBM Labs in Haifa
OOPSLA 200317
Undirtying via Local Allocation Caches
Local allocation caches are used by most modern JVMs Most cards of active caches are dirty
Objects usually have write barriers while (and shortly after) initializing Objects in active cache are (usually) not traced before the cache is
replaced If no tracing in the active cache is guarantied, we can undirty its cards Method: cooperation between allocators and concurrent tracers
Allocator (when replacing a local cache): Undirty all the cache’s inner cards Mark all cards as “traceable” Take a new cache and mark all its cards as “untraceable”
Concurrent tracer Defer tracing of objects in “untraceable” cards “for a while” BTW, this hardly ever happens
Cuts the amount of dirty cards by more than 35%, at no cost
IBM Labs in Haifa
OOPSLA 200318
Undirdy Cards with No Traced Objects
Advantages Without “Don’t trace through
dirty cards” – less work With it, reduces the STW card
cleaning significantly
Java mutation
Concurrent
tracing
STW tracing
Concurrent
card cleaning
STW card
cleaning
Base Method
Don’t trace dirties
Don’t trace dirties + Undirty
IBM Labs in Haifa
OOPSLA 200319
Characteristics of Dirty Cards
We believe that a recently dirtied card is good indication for more modification of objects in the near future Change of references Other writing activities Indication applied to all the objects in the card
A recently dirtied card is probably hot and active But we don’t trace through dirty cards! By the time we get to clean them they will probably become more
stable and colder
IBM Labs in Haifa
OOPSLA 200320
Reduced Floating Garbage
Floating garbage is created when tracing is done before the object modification
Don’t trace through dirty cards! Will probably defer the tracing until the card gets stable Objects are no longer modified No floating garbage will be created as a result of this late
tracing
IBM Labs in Haifa
OOPSLA 200321
Reduced Cache Miss Rate
Reducing the tracing work affects the cache miss rate As tracing the object graph intensifies cache capacity misses
But also cache coherency misses are reduced A write barriered card (hot and active) is probably modified by Java
mutators If a concurrent tracer scans objects on such card, it will suffer
coherency misses Don’t trace through dirty cards! Deferring the tracing of these objects to the card cleaning phase
reduces cache coherency misses Our improved collector reduces L2 cache miss rate by 6.4% Out of which 3.7% is reduction in cache coherency misses
IBM Labs in Haifa
OOPSLA 200322
Outline
The mostly concurrent garbage collection (GC) Internals of the collector
Write Barrier and card table Incremental collection
Our two improvements And their implications on performance
Results Conclusions
IBM Labs in Haifa
OOPSLA 200323
Implementation and Tests
Implementation On top of the mostly concurrent collector that is part of the IBM
production JVM 1.4.0. Platforms
Tested on both an IBM 6-way pSeries server and an IBM 4-way Netfinity server
Benchmarks The SPECjbb2000 benchmark and the SPECjvm98 benchmark suite
Measurements Performance of the base collector Vs. the improved version The effect of each improvement separately, and more…
IBM Labs in Haifa
OOPSLA 200324
Results - Throughput Improvement
SPECjbb. 6-way PPC. Heap size 448 MB 26.7% improvement (in tracing rate 1)
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
5
10
15
20
25
30
35
40
45
Th
ou
san
ds
TP
M
Base
Improved
Throughput
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
0
5
10
15
20
25
30
35
Ch
an
ge
(%
)
Throughput Improvement
IBM Labs in Haifa
OOPSLA 200325
Results - Floating Garbage Reduction
SPECjbb. 6-way PPC. Heap size 448 MB 13.4% improvement in heap residency (in tracing rate 1) Almost all floating garbage eliminated
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
0
50
100
150
200
250
300
350
MB
Base
Improved
STW
Heap Residency
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
0
5
10
15
20
25
Cha
nge
(%) Improved
Optimal
Heap Residency Reduction
IBM Labs in Haifa
OOPSLA 200326
Results - Pause Time Reduction SPECjbb. 6-way PPC. Heap size 448 MB 33.3% improvement
in average pause 36.4% improvement
in max pause
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
20
40
60
80
100
120
140
160
180
200
Mill
ise
con
ds
Base
Improved
Average Pause Time
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
0
10
20
30
40
50
Ch
an
ge
(%
)
Average Pause Time Reduction
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
20
40
60
80
100
120
140
160
180
200
Mill
ise
con
ds
Base
Improved
Max Pause Time
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
0
10
20
30
40
50
Ch
an
ge
(%
)
Max Pause Time Reduction
IBM Labs in Haifa
OOPSLA 200327
Conclusions
Introducing two improvements to the mostly concurrent GC Reduces repetitive GC work (don’t trace through dirty cards) Reduces number of dirty cards (undirty cards with no traced objects)
Substantial improvement of the mostly concurrent GC Improved throughput by 26% Almost eliminated floating garbage (heap residency reduced by 13%) Reduced average pause time by 33%
Additional effects of not tracing into dirty cards Reduced floating garbage Reduced cache miss rate
The improved algorithm has been incorporated into IBM's production JVM
IBM Labs in Haifa
OOPSLA 200328
End
IBM Labs in Haifa
OOPSLA 200329
Analyzing the Performance of Lower Tracing Rate
Throughput hit rate Relative to MS STW GC
Java utilization Relative to MS STW GC
Live Rate Relative to heap size Floating Garbage
Marked objects that become unreachable before the STW phase
More objects to trace. Less free space (more GCs)
Card cleaning rate Relative to total number of cards More work. Longer final STW phase
1 2 4 8
Tracing rate
0
10
20
30
40
50
60
70
80
90
100
Rat
io (
%)
Throughput hit rate
Mutator Utilization
Live Rate
Cards Cleaning rate
SPECjbb characteristics
IBM Labs in Haifa
OOPSLA 200330
Results - Throughput Improvement
SPECjbb. 6-way PPC. Heap size 448 MB 26.7% improvement in tracing rate 1
12
34
56
78
910
1112
Warehouses
0
10
20
30
40
50
Th
ou
san
ds
TP
M
Tr 1
Tr 8
Base Collector
12
34
56
78
910
1112
Warehouses
0
10
20
30
40
50
Th
ou
san
ds
TP
M
Tr 1
Tr 8
Improved Collector
IBM Labs in Haifa
OOPSLA 200331
Results - Floating Garbage Reduction
SPECjbb. 6-way PPC. Heap size 448 MB 13.4% improvement in tracing rate 1
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
0
50
100
150
200
250
300
350
Liv
e s
et s
ize
(M
B)
Tr 1
Tr 8
Base Collector
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
0
50
100
150
200
250
300
350
Liv
e s
et s
ize
(M
B)
Tr 1
Tr 8
Improved Collector
IBM Labs in Haifa
OOPSLA 200332
Results - Average Pause Time
SPECjbb. 6-way PPC. Heap size 448 MB
12
34
56
78
910
1112
Warehouses
0
20
40
60
80
100
120
140
Mill
ise
con
ds
Tr 1
Tr 8
Base Collector
12
34
56
78
910
1112
Warehouses
0
20
40
60
80
100
120
140
Mill
ise
con
ds
Tr 1
Tr 8
Improved Collector
IBM Labs in Haifa
OOPSLA 200333
Throughput Improvement for All Tracing Rates
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
0
3
6
9
12
15
18
21
24
27
30
Ch
an
ge
(%
)
Tr. rate 1
Tr. rate 2
Tr. rate 4
Tr. rate 8
IBM Labs in Haifa
OOPSLA 200334
Heap Residency Reduction for All Tracing Rates
1 2 3 4 5 6 7 8 9 10 11 12
Warehouses
-20
-16
-12
-8
-4
0
Ch
an
ge
(%
)
Tr. rate 1
Tr. rate 2
Tr. rate 4
Tr. rate 8
IBM Labs in Haifa
OOPSLA 200335
IBM Labs in Haifa
OOPSLA 200336
Reduced floating garbage
Potential Floating garbage root – reachable object that de-reference its sub-graph and thus make it unreachable To become a floating garbage root,
it must first be traced and then have a write barrier
We believe that a freshly dirty card is good indication for more write barriers
Deferring the tracing into a dirty card will defer the tracing to after the write barriers
Dirty card
Floating Garbage Root
Floating Garbage