Lp seminar
-
Upload
guestdff961 -
Category
Documents
-
view
349 -
download
0
Transcript of Lp seminar
On-the-Fly Garbage Collection Using Sliding
Views
Erez PetrankTechnion – Israel Institute of
Technology
Joint work with Yossi Levanoni, Hezi Azatchi, and Harel Paz
Erez Petrank GC via Sliding Views 2
Garbage Collection User allocates space dynamically, the
garbage collector automatically frees the space when it “no longer needed”.
Usually “no longer needed” = unreachable by a path of pointers from program local references (roots).
Programmer does not have to decide when to free an object. (No memory leaks, no dereferencing of freed objects.)
Built into Java, C#.
Erez Petrank GC via Sliding Views 3
Garbage Collection
Two Classic Approaches
Reference counting [Collins 1960]: keep a reference count for each object, reclaim objects with count 0.
Tracing [McCarthy 1960]: trace reachable objects, reclaim objects not traced.
Traditional Wisdom
Good Problematic
Erez Petrank GC via Sliding Views 4
What (was) Bad about RC ?
Does not reclaim cycles
A heavy overhead on pointer modifications.
Traditional belief: “Cannot be used efficiently with parallel processing”
A
B
Erez Petrank GC via Sliding Views 5
What’s Good about RC ? Reference Counting work is
proportional to work on creations and modifications. Can tracing deal with tomorrow’s huge
heaps? Reference counting has good locality.
The Challenge: RC overhead on pointer modification
seems too expensive. RC seems impossible to “parallelize”.
Erez Petrank GC via Sliding Views 6
Garbage Collection Today
Today’s advanced environments: multiprocessors + large memories
Dealing with multiprocessors
Single-threaded stop the world
Erez Petrank GC via Sliding Views 7
Garbage Collection Today
Today’s advanced environments: multiprocessors + large memories
Dealing with multiprocessors
Concurrent collectionParallel collection
Erez Petrank GC via Sliding Views 8
Terminology(stop the world, parallel, concurrent, …)
Stop-the-World
Parallel (STW)
Concurrent
On-the-Fly
programGC
Erez Petrank GC via Sliding Views 9
Benefits & Costs
InformalPausetimes
200ms
2ms
20ms ThroughputLoss: 10-20%
Stop-the-World
Parallel (STW)
Concurrent
On-the-Fly
programGC
Erez Petrank GC via Sliding Views 10
This Talk Introduction: RC and Tracing, Coping with SMP’s. RC introduction and parallelization problem. Main focus: a novel concurrent reference
counting algorithm (suitable for Java). Concurrent made on-the-fly based on “sliding
views” Extensions:
cycle collection, mark and sweep, generations, age-oriented.
Implementation and measurements on Jikes. Extremely short pauses, good throughput.
Erez Petrank GC via Sliding Views 11
Basic Reference Counting Each object has an RC field, new
objects get o.rc:=1. When p that points to o1
is modified to point to o2 execute: o2.rc++, o1.rc--.
if then o1.rc==0: Delete o1. Decrement o.rc for all children of o1. Recursively delete objects whose rc is
decremented to 0.
o1 o2
p
Erez Petrank GC via Sliding Views 12
An Important Term: A write barrier is a piece of code
executed with each pointer update. “po2 ” implies:
Read p; (see o1)p o2;o2.rc++; o1.rc- -;
o1 o2
p
Erez Petrank GC via Sliding Views 13
Deferred Reference Counting
Problem: overhead on updating program variables (locals) is too high.
Solution [Deutch & Bobrow 76] : Don’t update rc for local variables (roots). “Once in a while”: collect all objects with
o.rc=0 that are not referenced from local variables.
Deferred RC reduces overhead by 80%. Used in most modern RC systems.
Still, “heap” write barrier is too costly.
Multithreaded RC?
Traditional wisdom: write barrier must be synchronized !
Multithreaded RC? Problem 1: ref-counts updates must be
atomic
Fortunately, this can be easily solved : Each thread logs required updates in a local buffer and the collector applies all the updates during GC (as a single thread).
Multithreaded RC? Problem 1: ref-counts updates must be atomic
A
B DC
Thread 2: Read A.next; (see B)A.next D;B.rc- -; D.rc++
Thread 1: Read A.next; (see B)A.next C;B.rc- -; C.rc++
Problem 2: parallel updates confuse counters:
Erez Petrank GC via Sliding Views 17
Known Multithreaded RC
[DeTreville 1990, Bacon et al 2001]: Cmp & swp for each pointer
modification. Thread records its updates in a buffer.
Erez Petrank GC via Sliding Views 18
To Summarize Problems…
Write barrier overhead is high. Even with deferred RC.
Using RC with multithreading seems to bear high synchronization cost. Lock or “compare & swap” with each
pointer update.
Reducing RC Overhead: We start by looking at the “parent’s point of view”. We are counting rc for the child, but rc changes
when a parent’s pointer is modified.
Parent
Child
An Observation Consider a pointer p that takes the
following values between GC’s: O0,O1, O2, …, On .
All RC algorithms perform 2n operations: O0.rc--; O1.rc++; O1.rc--; O2.rc++; O2.rc--; … ; On.rc++;
But only two operations are needed:O0.rc-- and On.rc++
p
O1 O2 O3 On. . . . .O4O0
Use of Observation
Time
Only the first modification of each pointer is logged.
Garbage CollectionP O1; (record p’s previous value O0)
P O2; (do nothing)…P On; (do nothing)
Garbage Collection: For each modified slot p:
Read p to get On, read records to get O0. O0.rc-- , On.rc++
Some Technical Remarks When a pointer is first modified, it is marked
“dirty” and its previous value is logged. We actually log each object that gets modified
(and not just a single pointer). Reason 1: we don’t want a dirty bit per pointer. Reason 2: object’s pointers tend to be modified
together. Only non-null pointer fields are logged. New objects are “born dirty”.
Effects of Optimization• RC work significantly reduced:
• The number of logging & counter updates is reduced by a factor of 100-1000 for typical Java benchmarks !
Elimination of RC UpdatesBenchmar
kNo of stores
No of “first” stores
Ratio of “first” stores
jbb71,011,357264,1151/269
Compress64,905511/1273
Db33,124,78030,6961/1079
Jack135,174,775
1,5461/87435
Javac22,042,028535,2961/41
Jess26,258,10727,3331/961
Mpegaudio5,517,795511/108192
Effects of Optimization• RC work significantly reduced:
• The number of logging & counter updates is reduced by a factor of 100-1000 for typical Java benchmarks !
• Write barrier overhead dramatically reduced.
• The vast majority of the write barriers run a single “if”.
• Last but not least: the task has changed ! We need to record the first update.
Erez Petrank GC via Sliding Views 26
Reducing Synch. Overhead
Our second contribution: A carefully designed write barrier (and
an observation) does not require any sync. operation.
The write barrierUpdate(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) { log( slot, old ) SetDirty(slot) } *slot = new}
Observation:If two threads:1. invoke the write barrier
in parallel, and 2. both log an old value,then both record the same old value.
Running Write Barrier Concurrently
Thread 1:
Update(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) {/* if we got here, Thread 2 has *//* yet set the dirty bit, thus, has *//* not yet modified the slot. */ log( slot, old ) SetDirty(slot) } *slot = new}
Thread 2:
Update(Object **slot, Object *new){ Object *old = *slot if (!IsDirty(slot)) {/* if we got here, Thread 1 has *//* yet set the dirty bit, thus, has *//* not yet modified the slot. */ log( slot, old ) SetDirty(slot) } *slot = new}
Concurrent Algorithm:
Use write barrier with program threads. To collect:
Stop all threads Scan roots (local variables) get the buffers with modified slots Clear all dirty bits. Resume threads For each modified slot:
decrement rc for old value (written in buffer), increment rc for current value (“read heap”),
Reclaim non-local objects with rc 0.
Timeline
Stop threads.
Scan roots; Get buffers;erase dirty
bits;
Resumethreads.
Decrement values in
read buffers;
Increment “current” values;
Collect dead objects
Timeline
Stop threads.
Scan roots; Get buffers;erase dirty
bits;
Resumethreads.
Decrement values in
read buffers;
Increment “current” values;
Collect dead objects
Unmodified current values are in the heap. Modified are in new
buffers.
Concurrent Algorithm:
Use write barrier with program threads. To collect:
Stop all threads Scan roots (local variables) get the buffers with modified slots Clear all dirty bits. Resume threads For each modified slot:
decrease rc for old value (written in buffer), increase rc for current value (“read heap”),
Reclaim non-local objects with rc 0.
Goal 2: stop one thread at a time
Goal 1: clear dirty bits during program run.
Erez Petrank GC via Sliding Views 33
The Sliding Views “Framework”
Develop a concurrent algorithm There is a short time in which all the threads
are stopped simultaneously to perform some task.
Avoid stopping the threads together. Instead, stop one thread at a time.
Tricky part: “fix” the problems created by this modification.
Idea borrowed from the Distributed Computing community [Lamport].
Erez Petrank GC via Sliding Views 34
Graphically
A Snapshot A Sliding View
time time
HeapAddr.
HeapAddr.
t t1 t2
Erez Petrank GC via Sliding Views 35
Fixing Correctness The way to do this in our algorithm is to
use snooping: While collecting the roots, record objects
that get a new pointer. Do not reclaim these objects.
No details…
Erez Petrank GC via Sliding Views 36
Cycles Collection Our initial solution:
use a tracing algorithm infrequently.
More about this tracing collector and about cycle collectors later…
Erez Petrank GC via Sliding Views 37
Performance Measurements
Implementation for Java on the Jikes Research JVM
Compared collectors: Jikes parallel stop-the-world (STW) Jikes concurrent RC (Jikes concurrent)
Benchmarks: SPECjbb2000: a server benchmark ---
simulates business-like transactions. SPECjvm98: a client benchmarks --- a suite
of mostly single-threaded benchmarks
Erez Petrank GC via Sliding Views 38
Pause Times vs. STW
0
100
200
300
400
500
600
700
Pause Times
LevPet
Jikes STW
LevPet 1.3 0.67 1.68 0.59 0.97 0.89 0.8 0.61 1.06
Jikes STW 260.67 188.33 643.33 205.67 225 376 322 416.67 511.33
jess db javac mpeg jack mtrt2 jbb-1 jbb-2 jbb-3
Erez Petrank GC via Sliding Views 39
Pause Times vs. Jikes Concurrent
0
1
2
3
4
Pause Times - Concurrent
Jikes Concurrent
LevPet
Jikes Concurrent 2.77 1.84 2.81 0.8 1.66 1.8 1.79 2.6 3.15
LevPet 1.3 0.67 1.68 0.59 0.97 0.89 0.8 0.61 1.06
jess db javac mpeg jack mtrt2 jbb-1 jbb-2 jbb-3
Erez Petrank GC via Sliding Views 40
SPECjbb2000 Throughput
SPECjbb2000 - LevPet vs. Jikes Concurrent
0.8
1
1.2
1.4
1.6
1.8
2
256 320 384 448 512 576 640 704
heap sizes
jbb1
jbb2
jbb3
jbb4
jbb5
jbb6
jbb7
jbb8
Erez Petrank GC via Sliding Views 41
SPECjvm98 Throughput
SPECjvm98 - Jikes concurrent / LevPet
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
24 32 40 48 56 64 72 80 88 96
jess
db
javac
mpeg
jack
mtrt
Erez Petrank GC via Sliding Views 42
SPECjbb2000 Throughput
LP/parallel tracing
0.6
0.7
0.8
0.9
1
1.1
256 320 384 448 512 576 640 704
Series1
Series2
Series3
Series4
Series5
Series6
Series7
Series8
Erez Petrank GC via Sliding Views 43
A Glimpse into Subsequent Work:SPECjbb2000 Throughput
Tracing / RC
0.5
0.6
0.7
0.8
0.9
1
1.1
256 320 384 448 512 576 640 704
Series1
Series2
Series3
Series4
Series5
Series6
Series7
Series8
Erez Petrank GC via Sliding Views 44
Subsequent Work Cycle Collection [CC’05])
A Mark and Sweep Collector [OOPSLA’03]
A Generational Collector [CC’03]
An Age-Oriented Collector [CC’05]
Erez Petrank GC via Sliding Views 45
Related Work
It’s not clear where to start… RC, concurrent, generational, etc… Some more relevant work was
mentioned.
Erez Petrank GC via Sliding Views 46
Conclusions A Study of Concurrent Garbage Collection with a
Focus on RC. Novel techniques obtaining short pauses, high
efficiency. The best approach: age-oriented collection with
concurrent RC for old and concurrent tracing for young.
Implementation and measurements on Jikes demonstrate non-obtrusiveness and high efficiency.
Erez Petrank GC via Sliding Views 47
Project Building Blocks A novel reference counting algorithm. State-of-the-art cycle collection. Generational RC (for old) and tracing (for
young) A concurrent tracing collector. An age-oriented collector: fitting
generations with concurrent collectors.