Lessons Learned from Experiments with PostGIS Eleanor Tutt [email protected] St. Louis.
Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.
-
Upload
vivian-lambert -
Category
Documents
-
view
214 -
download
0
Transcript of Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.
![Page 1: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/1.jpg)
1
Parallel GC(Chapter 14)
Eleanor AinyDecember 16th 2014
![Page 2: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/2.jpg)
2
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
![Page 3: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/3.jpg)
3
Till now …
Multiple mutator threads
But only 1 collector thread
Poor use of resources!
Assumption remains: No mutators run in parallel to the collector!
Introduction
![Page 4: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/4.jpg)
4
Parallel vs. Non-Parallel Collection
MutatorCollection
Cycle 1Collection
Cycle 2
Introduction
![Page 5: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/5.jpg)
5
The Goal
To reduce:• Time overhead of garbage collection• Pause times in case of stop-the-world collection
Introduction
![Page 6: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/6.jpg)
6
Parallel GC Challenges
Ensure there is sufficient work to be done. Otherwise it’s not worth it!
Load balancing – distribute work & other resources in a way that minimizes the coordination needed.
Synchronization – needed for both correctness and to avoid repeating work.
Introduction
![Page 7: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/7.jpg)
7
More on Load Balancing
Static Partitioning• Some processors will probably have more work to do compared to others.
• Some processors will exhaust their resources before others do.
Introduction
![Page 8: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/8.jpg)
8
Dynamic Load Balancing • Sometimes it’s possible to obtain a good estimate of the amount of work to be done in advance
• More often it’s not possible to estimate that Solution: (1) Over-partition the work into more tasks (2) Have each thread compete to claim one task at a time to execute.Advantages:(1) More resilient to changes in the number of processors available(2) If one task takes longer to execute other threads can execute
anyfurther work
More on Load BalancingIntroduction
![Page 9: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/9.jpg)
9
Why not divide the work to the smallest possible independent tasks?
The coordination cost is too expensive!Synchronization guarantees correctness and avoids unnecessary work, but has time & space overheads!
Algorithms try to minimize the synchronization needed by using thread-local data structures, for instance.
More on Load BalancingIntroduction
![Page 10: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/10.jpg)
10
Processor-centric VS. Memory-centric
Processor-centric algorithms:• threads acquire work that vary in size.• threads steal work from other threads• little regard to the location of the objects
Memory-centric algorithms:• take location into greater account• operate on continuous blocks of heap memory• acquire/release work from/to shared pools of fixed-size buffers of work
Introduction
![Page 11: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/11.jpg)
11
Algorithms’ Abstraction
Assumption: Each collector thread executes the following loop (*):
while not terminated()acquireWork()performWork()generateWork()
(*) in most cases.
Introduction
![Page 12: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/12.jpg)
12
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
![Page 13: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/13.jpg)
13
Marking comprises of…
1) Acquisition of an object from a work list2) Testing & setting marks3) Generating further marking work by adding the
object’s children to the work list
Parallel Marking
![Page 14: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/14.jpg)
14
Important Note
All known parallel marking algorithms areprocessor-centric!
Parallel Marking
![Page 15: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/15.jpg)
15
When is Synchronization Required?
No synchronization:If the work list is thread-local.Example: when an object’s mark is represented by a bit in its header.
Synchronization needed:Otherwise the thread must acquire work atomically from some otherthread’s work list or from some global list.Example: when marks are stored in a shared bitmap.
Parallel Marking
![Page 16: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/16.jpg)
16
Endo et al [1997] Parallel Mark Sweep Algorithm
N – total number of threadsEach marker thread has its own:• local mark stack• a stealable work queue.
shared stealableWorkQueue[N]me myThreadId
acquireWork():if not isEmpty(myMarkStack)
returnstealFromMyself() if isEmpty(myMarkStack)
stealFromOthers()
Parallel Marking
![Page 17: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/17.jpg)
17
An idle thread acquires work by first examining its own queue andthen other threads’ queues.
stealFromMyself():lock(stealableWorkQueue[me])n size(stealableWorkQueue[me]) / 2transfer(stealableWorkQueue[me], n, myMarkStack) unlock(stealableWorkQueue[me])
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 18: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/18.jpg)
18
An idle thread acquires work by first examining its own queue andthen other threads’ queues. stealFromOthers():
for each j in Threads if not locked(stealableWorkQueue[j] )
if lock(stealableWorkQueue[j]) n size(stealableWorkQueue[j]) / 2 transfer(stealableWorkQueue[j], n, myMarkStack) unlock(stealableWorkQueue[j]) return
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 19: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/19.jpg)
19
performWork():while pop(myMarkStack, ref)
for each fld in Pointers(ref) child *fld if child null && not isMarked(child) setMarked(child)
push(myMarkStack, child)
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 20: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/20.jpg)
20
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
Stack BStack A
P1P2
Thread A Thread B
Queue BQueue A
C1
Notice: it is possible for threads to mark the same child object.
![Page 21: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/21.jpg)
21
Each thread checks its own mark queue. If it’s empty it transfers all its mark stack (apart from local roots) to the queue.
generateWork():if isEmpty(stealableWorkQueue[me])
n size(myMarkStack)lock(stealableWorkQueue[me])transfer(myMarkStack, n, stealableWorkQueue[me])unlock(stealableWorkQueue[me])
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 22: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/22.jpg)
22
Parallel Marking With a BitmapThe collector tests the bit and only if it isn’t set, attempts to set itatomically, retrying if the set fails.
setMarked(ref):oldByte markByte(ref)bitPosition markBit(ref)loop
if isMarked(oldByte, bitPosition) returnnewByte mark(oldByte, bitPosition)if (CompareAndSet(&markByte(ref), oldByte, newByte) return
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
CompareAndSet(x,old,new): atomic curr *x if curr = old *x new return true return false
![Page 23: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/23.jpg)
23
Termination Detection – Reminder From Previous Lecture:• Separate thread for termination detection.
• Symmetric detection – every thread can play the role of the detector.
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 24: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/24.jpg)
24
Termination Detection – Reminder From Previous Lecture:
shared jobs[N] initial work assignmentsshared busy[N] [true, …]shared jobsMoved falseshared allDone falseme myThreadId
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 25: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/25.jpg)
25
Termination Detection – Reminder From Previous Lecture:
worker(): loop
while not isEmpty(jobs[me]) job dequeue(jobs[me]) perform jobif another thread j exists whose jobs set appears relatively large some stealJobs(j) enqueue(jobs[me], some) continuebusy[me] falsewhile no thread has jobs to steal && not allDone /* do nothing: wait for work or termination*/if allDone returnbusy[me] true
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 26: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/26.jpg)
26
Termination Detection – Reminder From Previous Lecture:
stealJobs(j): some atomicallyRemoveJobs(jobs[j])if not isEmpty(some)
jobsMoved truereturn some
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 27: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/27.jpg)
27
Termination Detection – Reminder From Previous Lecture:
detect(): anyActive truewhile anyActive
anyActive ( i) (busy[i])anyActive anyActive || jobsMoved jobsMoved false
allDone true
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 28: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/28.jpg)
28
Running Example
Queue B
Stack B
Queue A
Stack A
Initially: queues are empty!acquireWork – if stack is non-empty returns.
Thread A Thread B
Endo et al [1997] Parallel Mark Sweep AlgorithmParallel Marking
![Page 29: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/29.jpg)
29
Running Example
performWork pops, marks and pushes children.
Stack B
Queue BO1
O2
O3
O4
O4
O1
O3
O2
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 30: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/30.jpg)
30
Running Example
Queue B
Stack B
generateWork moves all the objects from the stack to the queue!
Queue B
O3
O2
Stack B
O3O2
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 31: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/31.jpg)
31
Running Example
acquireWork – if stack is empty moves half the queue to the stack.
Stack B
Queue BQueue B
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 32: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/32.jpg)
32
Running Example
Queue B
acquireWork – if queue is also empty, steals from other queues.This continues until there is no more work (the detector will detect this!).
Stack A Stack B
Queue AQueue A
Parallel Marking
Endo et al [1997] Parallel Mark Sweep Algorithm
![Page 33: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/33.jpg)
33
N – total number of threads• Each thread has its own stealable deque (double-ended queue).• The deques are fixed size to avoid allocation during collection
causes overflow.• All threads share a global overflow set implemented as a list of list.
shared overflowSetshared deque[N]me myThreadId
Parallel Marking
Flood et al [2001] Parallel Mark Sweep Algorithm
acquireWork():if not isEmpty(deque[me])
return n dequeFixedSize/2if extractFromOverflowSet(n)
returnstealFromOthers()
![Page 34: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/34.jpg)
34
• The Java class structure holds the head of a list of overflow objects of that type, linked through the class pointer field in their header.
• An object’s type field can be restored on remove from overflow set (stop-the-world enables the type field to be used here).
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
![Page 35: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/35.jpg)
35
Idle threads acquire work by trying to fill half their deque from the overflow set before stealing from other deques.
extractFromOverflowSet(n): transfer(overflowSet, n, deque[me])
Parallel Marking
Flood et al [2001] Parallel Mark Sweep Algorithm
![Page 36: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/36.jpg)
36
Idle threads steal work from the top of others’ deques using remove.
stealFromOthers():for each j in Threads
ref remove(deque[j])if ref null push(deque[me], ref) return
remove:requires synchronization!
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
![Page 37: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/37.jpg)
37
performWork():loop
ref pop(deque[me])if ref = null return for each fld in Pointers(ref) child *fld if (child null && not isMarked(child) setMarked(child) if not push(deque[me], child) n size(deque[me]) / 2 transfer(deque[me], n, overflowSet)
pop:requires synchronizationonly to claim the lastelement of the deque.
push: does not requiresynchronization.
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
![Page 38: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/38.jpg)
38
Work is generated inside peformWork by pushing to the deque or transferring to the overflow set.
generateWork():/* nop */
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
![Page 39: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/39.jpg)
39
Termination Detection• Variation of symmetric detection that we saw in previous lecture.• Status word – one bit per thread (active/inactive).
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
![Page 40: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/40.jpg)
40
Running Example
Deque BDeque A
Initially: deques are non-empty!acquireWork – if deque is non-empty return.
Thread A Thread B
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
![Page 41: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/41.jpg)
41
Running Example
performWork – pop, mark and push children.
O1
O4
O5
O2
O3
O6
O7
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
![Page 42: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/42.jpg)
42
Running Example
Deque B
performWork – if push causes overflow copies half the queue to the overflow set.
Thread B
O1
O4
O5
O2
O3
O6
O7
O3 O4 O5 O6
O7
A
A
B
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
O1O2
![Page 43: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/43.jpg)
43
Running Example
performWork – the overflow set in this case:
A
A
B
Class A Structure
Class B Structure
O5
O6
O7
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
O1
O4
O5
O2
O3
O6
O7
![Page 44: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/44.jpg)
44
Running Example
Deque B
acquireWork – if deque is empty, takes work from overflow set. If fails, removes from other deques.
Thread B
Deque A
Thread A
O9 O9
Flood et al [2001] Parallel Mark Sweep AlgorithmParallel Marking
![Page 45: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/45.jpg)
45
• This technique is best employed when the number of threads is known in advance.
• May be difficult for a thread:• To choose the best queue from which to steal.• To detect termination.
Mark Stacks With Work Stealing - Disadvantages Parallel Marking
![Page 46: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/46.jpg)
46
• Threads exchange marking tasks through single writer, single reader channels.
• In a system of N threads, each thread has an array of N-1 queues.• Annotation for input channel from thread i to thread j i j.
This is also an output channel of thread i.
shared channel[N,N]me myThreadId
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
![Page 47: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/47.jpg)
47
If the thread’s stack is empty, it takes a task from some input channel k me.
acquireWork():if not isEmpty(myMarkStack)
returnfor each k in Threads
if not isEmpty(channel[k, me]) ref remove(channel[k, me]) push(myMarkStack, ref) return
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
![Page 48: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/48.jpg)
48
Threads first try to add new tasks (marking children) to other threads’input channels (their output channels).performWork():
loopif isEmpty(myMarkStack) returnref pop(myMarkStack)for each fld in Pointers(ref) child *fld if child null && not isMarked(child) if not generateWork(child)
push(myMarkStack, child)
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
![Page 49: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/49.jpg)
49
• When a thread generates a new task, it first checks whether any other thread k needs work.
• If so, adds the task to the output channel me k.• Otherwise, pushes the task to its own stack.
generateWork(ref):for each k in Threads
if needsWork(k) && not isFull(channel[me,k]) add(channel[me,k], ref) return true
return false
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
![Page 50: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/50.jpg)
50
Advantages:• No expensive atomic operations!• Performs better on servers with many processors.• Keeps all threads busy.
(*) On a machine with 16 Intel Xeon processors queues of size one ortwo were found to scale best.
Wu and Li [2007] Parallel Tracing With ChannelsParallel Marking
![Page 51: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/51.jpg)
51
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
![Page 52: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/52.jpg)
52
Copying is Different From Marking…
It’s essential that an object be copied only once!If an object is marked twice it usually does not affect the correctness of the program.
Parallel Copying
![Page 53: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/53.jpg)
53
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying
Each copying thread is given its own stack and transfers work between its local stack and a shared stack.
k – size of a local stack
shared sharedStackmyCopyStack[k]sp 0 /* local stack pointer */
Parallel Copying
![Page 54: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/54.jpg)
54
Using rooms, they allow multiple threads to:• pop elements from the shared stack in parallel• push elements to the shared stack in parallelBut not pop and push in parallel!
shared gate openshared popClients /* number of clients in the pop room */shared pushClients /* number of clients in the push room */
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
![Page 55: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/55.jpg)
55
while not terminated()enterRoom() /* enter pop room */for i 1 to k
if isLocalStackEmpty() acquireWork() if isLocalStackEmpty() breakperformWork()
transitionRooms()generateWork()if exitRoom() /* exit push room */ terminate()
acquireWork(): sharedPop()
performWork(): ref localPop() scan(ref)
generateWork(): sharedPush()
isLocalStackEmpty(): return sp = 0
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
![Page 56: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/56.jpg)
56
localPush(ref):myCopyStack[sp++] ref
localPop():return myCopyStack[--sp]
Local Stack
SPref
1. localPop()2. localPush(ref)
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
![Page 57: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/57.jpg)
57
sharedPop():cursor FetchAndAdd(&sharedStack, 1) if cursor stackLimit FetchAndAdd(&sharedStack, -1)else myCopyStack[sp++] cursor[0]
FetchAndAdd(x, v): atomic old *x *x old + v return old
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
![Page 58: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/58.jpg)
58
sharedPush():cursor FetchAndAdd(&sharedStack, -sp) - sp for i 0 to sp-1
cursor[i] myCopyStack[i]sp 0
FetchAndAdd(x, v): atomic old *x *x old + v return old
Parallel Copying
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying
![Page 59: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/59.jpg)
59
enterRoom():while gate OPEN
/* do nothing: wait */FetchAndAdd(&popClients, 1)while gate OPEN
FetchAndAdd(&popClients, -1) /* failure - return to previous state*/
while gate OPEN /* do nothing: wait */ FetchAndAdd(&popClients, 1) /* try again */
Parallel Copying
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying
![Page 60: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/60.jpg)
60
transitionRooms(): /* move from pop room to push room */gate CLOSED /* close gate to pop room */FetchAndAdd(&pushClients, 1)FetchAndAdd(&popClients, -1) while popClients > 0
/* do nothing: wait till none popping */
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
![Page 61: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/61.jpg)
61
exitRoom():pushers FetchAndAdd(&pushClients, -1) - 1if pushers = 0 /* last in push room */ gate OPEN if isEmpty(sharedStack) /* no work left */
return true else
return false
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel CopyingParallel
Copying
![Page 62: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/62.jpg)
62
Problem:Any processor waiting to enter the push room must wait until allprocessors in the pop room have finished their work!
Possible Solution:The work can be done outside the rooms!It increases the likelihood that the pop room is empty threads will be able to enter the push room more quickly
Parallel Copying
Processor-Centric Techniques:Cheng and Blelloch [2001] Parallel Copying
![Page 63: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/63.jpg)
63
• Divide the heap into small, fixed-size chunks.• Each thread receives its own chunks to scan and into which to copy
survivors.• Once a thread chunk copy is full it’s transferred to a global pool
where idle threads compete to scan it and a new empty chunk isobtained for the thread itself.
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
![Page 64: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/64.jpg)
64
Mechanisms Used To Ensure Good Load Balancing:• Chunks acquired were small (256 words).
• To avoid fragmentation, they used big bag of pages allocation for small objects
• Larger objects and chunks were allocated from the shared heap using a lock.
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
![Page 65: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/65.jpg)
65
Mechanisms Used To Ensure Good Load Balancing:• Balanced load in finer granularity.• Each chunk was divided into smaller blocks (32 words).
Memory-Centric Techniques:Block-Structured HeapsParallel
Copying
![Page 66: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/66.jpg)
66
Mechanisms Used To Ensure Good Load Balancing:• After scanning a slot, the thread checks whether it reached the block boundary.
• If so and the next object was smaller than a block:• the thread advanced its scan pointer to the start of its current copy
block.• It reduced contention – the thread did not have to compete to
acquire a new scan block.• Un-scanned blocks in that area are given to the global pool.
• If the object was larger than a block but smaller than a chunk, the scan pointer was advanced to the start of its current copy chunk.
• If the object was large, the thread continued to scan it.
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
![Page 67: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/67.jpg)
67
Mechanisms Used To Ensure Good Load Balancing:
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
![Page 68: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/68.jpg)
68
Block States and Transitions:
Memory-Centric Techniques:Block-Structured HeapsParallel
Copying
![Page 69: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/69.jpg)
69
State Transition Logic:
Parallel Copying
Memory-Centric Techniques:Block-Structured Heaps
![Page 70: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/70.jpg)
70
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
![Page 71: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/71.jpg)
71
1) Statically partition the heap into contiguous blocks for threads to sweep.
2) Over-partition the heap and have threads compete for a block to sweep to a free-list.
ProblemThe free-list becomes a bottleneck!
SolutionProcessors will have their own free-lists.
Parallel Sweeping
Simple Strategies
![Page 72: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/72.jpg)
72
• A naturally parallel solution to sweeping partially full blocks.• In the sweep phase, we need to identify empty blocks and return
them to the block allocator.• Need to reduce contention.• Gave each thread several consecutive blocks to process locally.• They used bitmap marking with bitmaps held in block headers
(used to determine whether a block is empty or not).• Empty blocks are added to a local free-block list.• Partially full blocks are added to local reclaim list for subsequent
lazy sweeping.• Once a processor finishes with its sweep set it merges its local list
with the global free-block list.
Parallel Sweeping
Endo et al [1997] Lazy Sweeping
![Page 73: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/73.jpg)
73
Outline of Today’s Talk
How to use parallelism in each of the 4 components of tracing GC:• Marking• Copying• Sweeping• Compaction
![Page 74: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/74.jpg)
74
Observation:Uniprocessor compaction algorithms typically slide all live data to one end of the heap space.
If multiple threads do so in parallel one thread can overwrite live data before another thread has moved it!
Thread 1 compaction data. Thread 2 compaction data.
B CA DC
Parallel Compaction
Flood et al [2001] Parallel Mark-Compact
![Page 75: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/75.jpg)
75
Suggested Solution:• Divide the heap space into several regions, one for each
compacting thread.• To reduce fragmentation, they also have threads alternate the
direction in which they move objects in even and odd numbered regions.
Flood et al [2001] Parallel Mark-CompactParallel Compaction
![Page 76: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/76.jpg)
76
4 Phases:1) Parallel marking.2) Calculate forwarding addresses.3) Update references.4) Move objects.
Flood et al [2001] Parallel Mark-CompactParallel Compaction
![Page 77: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/77.jpg)
77
Phase 2 - Calculating Forwarding Addresses:• Over-partition the space into M = 4N (N- number of threads) units of
roughly the same size.• Threads compete to claim units.• Each thread counts the volume of live data in its unit.• According to these volumes, they partition the space into N regions that
contain approximately the same amount of live data.• Threads compete to claim units and install forwarding addresses of each
live object of their units.
3 6 13 7 10 5 7 5 12 48 9
30 29 30
M = 12 units, N = 3 regions/threads
Flood et al [2001] Parallel Mark-CompactParallel Compaction
![Page 78: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/78.jpg)
78
Phase 3 - Updating References:• Updating references to point to objects’ new locations requires scanning:
• Objects stored in mutator threads’ stacks that might contain references to objects in the heap space (young generation).
• Live objects in the heap space (old generation).• Threads compete to claim old generation units to scan and a single
thread scans the young generation.
Phase 4 – Moving Objects:• Each thread is in charge of a region. • Good load balancing is guaranteed because the regions contain roughly
equal volumes of live data.
Flood et al [2001] Parallel Mark-CompactParallel Compaction
![Page 79: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/79.jpg)
79
Disadvantages:1) The algorithm makes 3 passes over the heap while other
compacting algorithms make fewer passes.2) Rather than compacting all live data to one end of the heap,
the algorithm compacts into N regions, leaving (N +1)/2 gaps for allocation. If a large number of threads in used, it’s difficult for mutators to allocate very large objects.
Flood et al [2001] Parallel Mark-CompactParallel Compaction
![Page 80: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/80.jpg)
80
1) Address the 3 passes problem:• Calculate rather than store forwarding addresses using the mark
bitmap and an offset vector that holds the new address of the first live object in each block.
• To construct the offset vector one pass over the mark-bit vector is needed.
• Only a single pass over the heap is needed to move objects and update references using these vectors.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
![Page 81: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/81.jpg)
81
1) Address the 3 passes problem:• Bits in the mark-bit vector indicate the start and end of each live object.• Words in the offset vector hold the address to which the first live object
in their corresponding block will be moved. • Forwarding addresses are not stored but are calculated when needed
from the offset and mark-bit vectors.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
![Page 82: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/82.jpg)
82
2) Address the small gaps problem:• Over-partition the heap into fairly large areas. • Threads race to claim the next area to compact, using an atomic
operation to increment a global area index.• If the thread succeeds, it has obtained an area to compact.• If it fails, it tries to claim the next area.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
![Page 83: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/83.jpg)
83
2) Address the small gaps problem:• A table holds pointers to the beginning of the free space for each area. • After winning an area to compact, a thread races to obtain an area
into which it can move objects. It claims an area by trying to write null into its corresponding table slot.
• Threads never try to compact from or into an area whose table entry is null.
• Objects are never moved from a lower to a higher numbered area.• Progress is guaranteed since a thread can always compact an area into
itself.• Once a thread has finished with an area, it updates the area’s free
pointer. If an area is full, its free space pointer will remain null.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
![Page 84: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/84.jpg)
84
2) Address the small gaps problem:
…1 2 3
Area Index: 0Area Index: 1Area Index: 2
Free pointers table 200 1000 1800 …
A B CA B C
NULL
D EA B C D E
200 400
400NULL
1000 1800
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
![Page 85: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/85.jpg)
85
2) Address the small gaps problem:Explored two ways in which objects can be moved:a. Slide object by object.b. To reduce compaction time, slide only complete blocks (256 bytes).
Free space in each block is not squeezed out.
Abuaiadh et al [2004] Parallel Mark-CompactParallel Compaction
![Page 86: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/86.jpg)
86
Discussion
• What is the tradeoff in the choice of the chunk size in parallel copying?• Parallel copying with no synchronization can cause issues? For example if
an object is copied twice by two different threads, what can be the consequence?
A
B
A A
FA X
B
![Page 87: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/87.jpg)
87
Something Extra
https://www.youtube.com/watch?v=YhKZe22tZlc
![Page 88: Parallel GC (Chapter 14) Eleanor Ainy December 16 th 2014 1.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649ea35503460f94ba748f/html5/thumbnails/88.jpg)
88
Conclusions & Summary
• There should be enough work for parallel collection • Need to take into account synchronization costs• Need to balance loads between the multiple threads• Learned different algorithms for marking, sweeping, copying and
compaction that take all this challenges into account.• Difference between marking and copying – marking an object twice is
not so bad. Copying an object twice can harm the correctness.