U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for...

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Memory Management for High-Performance

ApplicationsEmery Berger

University of Massachusetts, Amherst

2UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

High-Performance Applications

Web servers, search engines, scientific codes

C or C++ (still…) Run on one or

cluster of server boxes

Raid drive

cpucpucpucpu

RAM

Raid drive

cpucpucpucpu

RAM

RAID drive

cpucpucpucpu

RAM

software

compiler

runtime system

operating system

hardware

Needs support at every level


New Applications,Old Memory Managers

Applications and hardware have changed Multiprocessors now commonplace Object-oriented, multithreaded Increased pressure on memory manager

(malloc, free)

But memory managers have not kept up Inadequate support for modern

applications


Current Memory ManagersLimit Scalability

Runtime Performance

01234567

89

1011121314

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Processors

Sp

eed

up

Ideal

Actual

As we add processors, program slows down

Caused by heap contention

Larson server benchmark on 14-processor Sun


The Problem

Current memory managersinadequate for high-performance applications on modern architectures Limit scalability, application

performance, and robustness


This Talk

Building memory managers Heap Layers framework [PLDI 2001]

Problems with current memory managers Contention, false sharing, space

Solution: provably scalable memory manager Hoard [ASPLOS-IX]

Extended memory manager for servers Reap [OOPSLA 2002]


Implementing Memory Managers

Memory managers must be Space efficient Very fast

Heavily-optimized code Hand-unrolled loops Macros Monolithic functions

Hard to write, reuse, or extend


Building Modular Memory Managers

Classes Overhead Rigid hierarchy

Mixins No overhead Flexible hierarchy


A Heap Layer

template <class SuperHeap>class GreenHeapLayer :

public SuperHeap {…};

GreenHeapLayer

RedHeapLayer

Mixin with malloc & free methods


LockedHeap

mallocHeapmallocHeap

Example:Thread-Safe Heap Layer

LockedHeap protect the superheap with a lock

LockedMallocHeap

mallocHeap

LockedHeap


Empirical Results

Heap Layers vs. originals: KingsleyHeap

vs. BSD allocator

LeaHeapvs. DLmalloc 2.7

Competitive runtime and memory efficiency

Runtime (normalized to Lea allocator)

0

0.25

0.5

0.75

1

1.25

1.5

cfrac espresso lindsay LRUsim perl roboop Average

BenchmarkN

orm

alized

Ru

nti

me

Kingsley KingsleyHeap Lea LeaHeap

Space (normalized to Lea allocator)

0

0.5

1

1.5

2

2.5

cfrac espresso lindsay LRUsim perl roboop Average

Benchmark

No

rmali

zed

Sp

ace

Kingsley KingsleyHeap Lea LeaHeap


Overview

Building memory managers Heap Layers framework

Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard

Extended memory manager for servers Reap


Problems with General-Purpose Memory Managers

Previous work for multiprocessors Concurrent single heap [Bigler et al. 85, Johnson 91,

Iyengar 92] Impractical

Multiple heaps [Larson 98, Gloger 99]

Reduce contention but cause other problems: P-fold or even unbounded increase in space Allocator-induced false sharing

we show


Multiple Heap Allocator:Pure Private Heaps

One heap per processor: malloc gets memory

from its local heap free puts memory

on its local heap

STL, Cilk, ad hoc

x1= malloc(1)

free(x1) free(x2)

x3= malloc(1)

x2= malloc(1)

x4= malloc(1)

processor 0 processor 1

= in use, processor 0

= free, on heap 1

free(x3) free(x4)

Key:


Problem:Unbounded Memory Consumption

Producer-consumer: Processor 0 allocates Processor 1 frees

Unbounded memory consumption Crash!

free(x1)

x2= malloc(1)

free(x2)

x1= malloc(1)processor 0 processor 1

x3= malloc(1)

free(x3)


Multiple Heap Allocator:Private Heaps with Ownership

free returns memory to original heap

Bounded memory consumption No crash!

“Ptmalloc” (Linux),LKmalloc

x1= malloc(1)

free(x1)

free(x2)

x2= malloc(1)

processor 0 processor 1


Problem:P-fold Memory Blowup

Occurs in practice Round-robin producer-

consumer processor i mod P

allocates processor (i+1) mod P

frees

Footprint = 1 (2GB),but space = 3 (6GB) Exceeds 32-bit address

space: Crash!

free(x2)

free(x1)

free(x3)

x1= malloc(1)

x2= malloc(1)

x3=malloc(1)

processor 0 processor 1 processor 2


Problem:Allocator-Induced False Sharing

False sharing Non-shared objects

on same cache line Bane of parallel

applications Extensively studied

All these allocatorscause false sharing!

CPU 0 CPU 1

cache cache

bus

processor 0 processor 1x2= malloc(1)x1= malloc(1)

cache line

thrash… thrash…


So What Do We Do Now? Where do we put free memory?

on central heap: on our own heap:

(pure private heaps) on the original heap:

(private heaps with ownership)

How do we avoid false sharing?

Heap contention Unbounded

memory consumption

P-fold blowup


Overview






Hoard: Key Insights Bound local memory consumption

Explicitly track utilization Move free memory to a global

heap Provably bounds memory

consumption

Manage memory in large chunks Avoids false sharing Reduces heap contention


Overview of Hoard

Manage memory in heap blocks Page-sized Avoids false sharing

Allocate from local heap block Avoids heap contention

Low utilization

Move heap block to global heap Avoids space blowup

global heap

…

processor 0 processor P-1


Summary of Analytical Results

Space consumption: near optimal worst-case

Hoard: O(n log M/m + P) {P « n} Optimal: O(n log M/m)

[Robson 70]: ≈ bin-packing

Private heaps with ownership:O(P n log M/m)

Provably low synchronization

n = memory requiredM = biggest object sizem = smallest object sizeP = processors


Empirical Results Measure runtime on 14-processor Sun

Allocators Solaris (system allocator) Ptmalloc (GNU libc) mtmalloc (Sun’s “MT-hot” allocator)

Micro-benchmarks Threadtest: no sharing Larson: sharing (server-style) Cache-scratch: mostly reads & writes

(tests for false sharing) Real application experience similar


Runtime Performance: threadtest

speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)

Many threads,no sharing

Hoard achieves linear speedup


Runtime Performance: Larson

Many threads,sharing(server-style)



Runtime Performance:false sharing

Many threads,mostly reads & writes of heap data



Hoard in the “Real World” Open source code

www.hoard.org 13,000 downloads Solaris, Linux, Windows, IRIX, …

Widely used in industry AOL, British Telecom, Novell, Philips Reports: 2x-10x, “impressive” improvement in

performance Search server, telecom billing systems, scene

rendering,real-time messaging middleware, text-to-speech engine, telephony, JVM

Scalable general-purpose memory manager


Overview






Custom Memory Allocation

Programmers often replace malloc/free Attempt to increase performance Provide extra functionality (e.g., for

servers) Reduce space (rarely)

Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,

except for regions. [OOPSLA 2002]


Overview of Regions

+ Fast+ Pointer-bumping

allocation+ Deletion of chunks

+ Convenient+ One call frees all memory

regionmalloc(r, sz)regiondelete(r)

Separate areas, deletion only en masse

regioncreate(r) r

- Risky- Accidental

deletion- Too much

space


Why Regions?

Apparently faster, more space-efficient

Servers need memory management support: Avoid resource leaks

Tear down memory associated with terminated connections or transactions

Current approach (e.g., Apache): regions


Drawbacks of Regions

Can’t reclaim memory within regions Problem for long-running computations,

producer-consumer patterns,off-the-shelf “malloc/free” programs

unbounded memory consumption

Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming


Reap = region + heap Adds individual object deletion & heap

Reap Hybrid Allocator

reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption Fast

Adapts to use (region or heap style) Cheap deletion


Using Reap as Regions

Runtime - Region-Based Benchmarks

0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e

Original Win32 DLmalloc WinHeap Vmalloc Reap

4.08

Reap performance nearly matches regions


Reap: Best of Both Worlds

Combining new/delete with regionsusually impossible:

Incompatible API’s Hard to rewrite code

Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)

Changed 20 lines (out of 8000) Benchmark: compute 1000th prime

With Reap: 240K Without Reap: 7.4MB


Open Questions

Grand Unified Memory Manager? Hoard + Reap Integration with garbage collection

Effective Custom Allocators? Exploit sizes, lifetimes, locality and

sharing

Challenges of newer architectures NUMA, SMT/CMP, 64-bit, predication


Current Work: Robust Performance

Currently: no VM-GC communicaton BAD interactions under memory

pressure Our approach (with Eliot Moss, Scott

Kaplan):Cooperative Robust Automatic Memory Management

Garbage collector

/ allocator

Virtual memory manager

LRU queuememory pressure

empty pages

reduced impact


Current Work: Predictable VMM

Recent work on scheduling for QoS E.g., proportional-share Under memory pressure, VMM is

scheduler Paged-out processes may never recover Intermittent processes may wait long time

Scheduler-faithful virtual memory(with Scott Kaplan, Prashant Shenoy) Based on page value rather than order


ConclusionMemory management for high-performance

applications Heap Layers framework [PLDI 2001]

Reusable components, no runtime cost Hoard scalable memory manager [ASPLOS-IX]

High-performance, provably scalable & space-efficient Reap hybrid memory manager [OOPSLA 2002]

Provides speed & robustness for server applications

Current work: robust memory management for multiprogramming


The Obligatory URL Slide

http://www.cs.umass.edu/~emery


If You Can Read This,I Went Too Far


Hoard: Under the Hood

MallocOrFreeHeap

PerProcessorHeap

SelectSizeHeap

LockedHeap

HeapBlockManager

LockedHeap

SuperblockHeap

LockedHeap

HeapBlockManager

SystemHeap

Largeobjects(> 4K)

FreeToHeapBlock

EmptyHeap Blocks

LockedHeap

HeapBlockManager

select heap based on size

malloc from local heap, free to heap block

get or return memory to global heap


Custom Memory Allocation

Very common practice Apache, gcc, lcc,

STL, database servers…

Language-level support in C++

Replace new/delete,bypassing general-purpose allocator Reduce runtime – often Expand functionality –

sometimes Reduce space – rarely

“Use custom allocators”


Drawbacks of Custom Allocators

Avoiding memory manager means: More code to maintain & debug Can’t use memory debuggers Not modular or robust:

Mix memory from customand general-purpose allocators → crash!

Increased burden on programmers


Overview

Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose

allocators Advantages and drawbacks of regions Reaps – generalization of regions &

heaps


Class1 free list

(1) Per-Class Allocators

a

b

c

a = new Class1;b = new Class1;c = new Class1;delete a;delete b;delete c;a = new Class1;b = new Class1;c = new Class1;

Recycle freed objects from a free list

+ Fast+ Linked list operations

+ Simple+ Identical semantics+ C++ language

support- Possibly space-

inefficient


(II) Custom Patterns

Tailor-made to fit allocation patterns Example: 197.parser (natural

language parser)char[MEMORY_LIMIT]

a = xalloc(8);b = xalloc(16);c = xalloc(8);xfree(b);xfree(c);d = xalloc(8);

a b cd

end_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_array

+ Fast+ Pointer-bumping allocation

- Brittle- Fixed memory size- Requires stack-like

lifetimes


(III) Regions

+ Fast+ Pointer-bumping

allocation+ Deletion of chunks

+ Convenient+ One call frees all memory

regionmalloc(r, sz)regiondelete(r)

Separate areas, deletion only en masse

regioncreate(r) r

- Risky- Accidental

deletion- Too much

space


Overview



heaps


Custom Allocators Are Faster…

Runtime - Custom Allocator Benchmarks

0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

helcc

mud

lle

Non-re

gions

Regio

ns

Ove

rall

No

rma

lize

d R

un

tim

e

Custom Win32

non-regions regions averages


Not So Fast…

Runtime - Custom Allocator Benchmarks

0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

he lcc

mud

lle

Non-re

gions

Regio

ns

Overa

ll

No

rma

lize

d R

un

tim

e

Custom Win32 DLmalloc

non-regions regions averages


The Lea Allocator (DLmalloc 2.7.0)

Optimized for common allocation patterns Per-size quicklists ≈ per-class

allocation Deferred coalescing

(combining adjacent free objects) Highly-optimized fastpath Space-efficient


Space Consumption Results

Space - Custom Allocator Benchmarks

00.250.5

0.751

1.251.5

1.75

No

rmal

ized

Sp

ace

Original DLmalloc

regionsnon-regions averages


Overview



heaps


Why Regions?

Apparently faster, more space-efficient

Servers need memory management support: Avoid resource leaks

Tear down memory associated with terminated connections or transactions

Current approach (e.g., Apache): regions


Drawbacks of Regions

Can’t reclaim memory within regions Problem for long-running computations,

producer-consumer patterns,off-the-shelf “malloc/free” programs

unbounded memory consumption

Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming


Reap = region + heap Adds individual object deletion & heap

Reap Hybrid Allocator

reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption+ Fast

+ Adapts to use (region or heap style)+ Cheap deletion


Using Reap as Regions

Runtime - Region-Based Benchmarks

0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e

Original Win32 DLmalloc WinHeap Vmalloc Reap

4.08

Reap performance nearly matches regions


Reap: Best of Both Worlds

Combining new/delete with regionsusually impossible:

Incompatible API’s Hard to rewrite code

Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)

Changed 20 lines (out of 8000) Benchmark: compute 1000th prime

With Reap: 240K Without Reap: 7.4MB


Conclusion

Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,

except for regions Reaps:

Nearly matches region performancewithout other drawbacks

Take-home message: Stop using custom memory allocators!


Software

http://www.cs.umass.edu/~emery

(part of Heap Layers distribution)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for...

Documents

Transcript of U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for...