Hoard: A Scalable Memory Allocator for Multithreaded Applications
description
Transcript of Hoard: A Scalable Memory Allocator for Multithreaded Applications
Hoard: A Scalable Memory Allocator for Multithreaded Applications
Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson
Presented by Ivan Jibaja
(Some slides adapted from Emery Berger’s presentation)
1
Outline
• Motivation• Problems in allocator design
– False sharing– Fragmentation
• Existing approaches• Hoard design• Experimental evaluation
2
Motivation
• Parallel multithreaded programs prevalent– Web servers, search engines, DB managers etc.– Run on CMP/SMP for high performance
• Memory allocation is a bottleneck– Prevents scaling with number of processors
3
Desired allocator attributes on a multiprocessor system
• Speed– Competitive with uniprocessor allocators on 1 cpu
• Scalability– Performance linear with the number of processors
• Fragmentation (=max allocated / max in use)– High fragmentation poor data locality paging
• False sharing avoidance
4
The problem of false sharing• Program causes false sharing
• Allocate number of objects in a cache line, pass objects to different threads
• Allocators cause false sharing!• Actively:
• malloc satisfies different thread requests from same cache line
• Passively:• free allows future malloc to produce false sharing
processor 1 processor 2x2 = malloc(s);x1 = malloc(s);
A cache line
thrash… thrash…
5
The problem of fragmentation
• Blowup:– Increase in memory consumption when allocator
reclaims memory freed by program, but fails to use it for future requests
– Mainly a problem of concurrent allocators
– Unbounded (worst case) or bounded (O(P))
6
Example: Pure Private Heaps Allocator
• Pure private heaps:• one heap per processor.• malloc gets memory
from the processor's heap or the system
• free puts memory on the processor's heap
• Avoids heap contention• Examples: STL, Cilk
x1= malloc(s)
free(x1) free(x2)
x3= malloc(s)
x2= malloc(s)
x4= malloc(s)
processor 1 processor 2
= allocated by heap 1
= free, on heap 2
7
How to Break Pure Private Heaps: Fragmentation
• Pure private heaps:• memory consumption can
grow without bound!
• Producer-consumer:• processor 1 allocates• processor 2 frees• Memory always
unavailable to producer
free(x1)
x2= malloc(s)free(x2)
x1= malloc(s)processor 1 processor 2
x3= malloc(s)free(x3)
8
Example II: Private Heaps with Ownership
• free puts memory back on the originating processor's heap.
• Avoids unbounded memory consumption• Examples: ptmalloc,LKmalloc
x1= malloc(s)free(x1)
free(x2)x2= malloc(s)
processor 1 processor 2
9
How to Break Private Heaps with Ownership:Fragmentation
• memory consumption can blowup by a factor of P.
• Round-robin producer-consumer:processor i allocatesprocessor i+1 frees
• Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks
free(x2)
free(x1)
free(x3)
x1= malloc(s)
x2= malloc(s)
x3=malloc(s)
processor 1 processor 2 processor 3
10
Existing approaches
11
Uniprocessor Allocators on Multiprocessors
• Fragmentation: Excellent– Very low for most programs [Wilson & Johnstone]
• Speed & Scalability: Poor– Heap contention
• A single lock protects the heap
• Can exacerbate false sharing– Different processors can share cache lines
12
Existing Multiprocessor Allocators• Speed:
• One concurrent heap (e.g., concurrent B-tree):
• O(log (#size-classes)) cost per memory operation• too many locks/atomic updates
Fast allocators use multiple heaps
• Scalability:• Allocator-induced false sharing • Other bottlenecks (e.g. nextHeap global in Ptmalloc)
• Fragmentation:• P-fold increase or even unbounded
13
Hoard as the solution
14
Hoard Overview• P per-processor heaps & 1 global heap• Each thread accesses only its local heap & global • Manages memory in page-sized superblocks of
same-sized objects (LIFO free-list)– Avoids false sharing by not carving up cache lines– Avoids heap contention – local heaps allocate & free
small blocks from their superblocks• Avoids blowup by
– Moving superblocks to global heap when fraction of free memory exceeds some threshold
15
Superblock managementEmptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S)
f = ¼K = 0
• Multiple heaps Avoid actively induced false sharing
• Block coalescing Avoid passively induced false sharing
• Superblocks transferred are usually empty and transfer is infrequent
16
Hoard pseudo-codemalloc(sz)1. If sz > S/2, allocate the superblock from the OS
and return it.2. i hash(current thread)3. Lock heap i4. Scan heap i’s list of superblocks from full to least
(for the size class of sz)5. If there is no superblock with free space {6. Check heap 0 (global) for a superblock7. If there is none {8. Allocate S bytes as superblock s & set
owner to heap i9. } Else {10. Transfer the superblock s to heap i11. u0 u0 – s.u; ui ui + s.u
12. a0 a0 - S; ai ai + S
13. }14. }15. ui ui + sz; s.u s.u + sz
16. Unlock heap i17. Return a block from the superblock
free(ptr)1. If the block is “large”2. Free superblock to OS and return3. Find the superblock s this blocks comes from4. Lock s5. Lock heap i, the superblock’s owner6. Deallocate the block from the superblock7. ui ui – block size
8. s.u s.u – block size9. If (i = 0) unlock heap i, superblock s and return10. If (ui < ai – K*S) and (ui<(1-f)*ai) {
11. Transfer a mostly-empty superblock s1 to heap 0 (global)
12. u0 u0 + s1.u; ui ui – s1.u
13. a0 a0 + S; ai ai – S
14. } 15. Unlock heap i and superblock s
17
Heap contention
• Per-processor Heap contention
– 1 allocator thread / multiple threads free• Inherently unscalable
– Pairs of producer/consumer threads• malloc/free calls serialized• At most 2X slowdown (undesirable but scalable)
– Empirically only a small fraction of memory is freed by another thread Contention expected to be low
18
Heap contention (2)• Global Heap contention
– Measure # GH lock acquisitions as upper bound
– Growing phase:• Each thread at most k/(f*S/s) acquisitions for k malloc’s
– Shrinking phase:• Pathological case where program frees (1-f) of each superblock and
then frees every block in superblock one at a time
– Empirically: No excessive shrinking and gradual growth of memory usage low overall contention
19
Experimental Evaluation• Dedicated 14-processor Sun Enterprise
– 400 MHz Ultrasparc– 2 GB RAM, 4MB L2 cache– Solaris 7– Superblock size=8K, f = ¼
• Comparison between– Hoard– Ptmalloc (GNU libC, multiple heaps & ownership)– Mtmalloc (Solaris multithreaded allocator)– Solaris (default system allocator)
20
Benchmarks
21
Speed
22
Size classes need to be handled more cleverly
Scalability - threadtest
23
278% faster than Ptmalloc on 14 cpus
t threads allocate/deallocate 100,000/t 8-byte objects
Scalability – Larson
24
• “Bleeding” typical in server applications• Mainly stays within empty fraction during execution• 18X faster than next best allocator on 14 cpus
Scalability - BEMengine
25• Few times below empty fraction low synchronization
False sharing behavior
26
• Active-false: Each thread allocates small object, writes it few times, frees it
• Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false
• Illustrate effects of contention of the coherence mechanism
Fragmentation results
27
Large number of size classes remain live for
duration of program and scattered across
blocks
Within 20% of Lea’s allocator
Hoard Conclusions• Speed: Excellent
• As fast as a uniprocessor allocator on one processor• amortized O(1) cost• 1 lock for malloc, 2 for free
• Scalability: Excellent• Scales linearly with the number of processors• Avoids false sharing
• Fragmentation: Very good• Worst-case is provably close to ideal• Actual observed fragmentation is low
28
Discussion Points
• If we had to re-evaluate Hoard today which benchmarks would we use?
• Are there any changes needed to make it work with languages like Java?
29