Optimizing RAM-latency Dominated Applications

Optimizing RAM-latency Dominated Applications

Yandong Mao, Cody Cutler, Robert MorrisMIT CSAIL

RAM-latency may dominate performance

• RAM-latency dominated applications – follow long pointer chains– working set >> on-chip cache

• A lot of cache misses -> stalling on RAM fetches

• Example: Garbage Collector– Identify live objects by following inter-object pointers– Spend much of its time stalling to follow pointers,

due to RAM latency

Addressing RAM-latency bottleneck?

• View RAM as we view disk

• High latency

• A similar set of optimization techniques– Batching– Sorting– Access I/O in parallel and asynchronously

Outline

• Hardware Background• Three techniques to address RAM-latency– Linearization: Garbage Collector– Interleaving: Masstree– Parallelization: Masstree

• Discussion• Conclusion

Three Relevant Hardware Features

RAM Controller

Channel 0 Channel 1 Channel 2

0 1 2 3 4 5

2. Parallel accesses to different channels

1. Fetch RAM before needed- Hardware prefetcher – sequential or

strided access pattern- Software prefetch- Out-of-order execution

3. Row buffer cache inside memory channel

• Intel Xeon X5690

Per-array row buffer cache

...

......

...

...

4096 bytes

Row buffer

Data Rows

Hit in row buffer: 2x-5x faster than miss!

Sequential access: 3.5x higher throughput than random access!

• Each channel has many of arrays shown below• Each array has an additional row: row buffer• Memory access: check row buffer, reload if miss

Linearization memory accesses for Garbage Collector

• Garbage Collector goal– Find live objects (tracing)

• starts from root (stack, global variables)• follows object pointers of live objects

– Claim space for unreachable objects

• Bottleneck of tracing: RAM-latency– Pointer addresses are unpredictable and non-sequential– Each access -> cache miss -> stall for RAM-fetch

Observation

• Arrange objects in tracing order during garbage collection– Subsequent tracing would access memory in

sequential order

• Take advantage of two hardware features– Hardware prefechers: prefetch into cache– Higher row buffer hit rate

Benchmark and result• Time of tracing 1.8 GB of live data• HSQLDB 2.2.9: a RDBMS engine in Java• Compacting Collector of Hotspot JVM from OpenJDK7u6

– Use copy collection to reorder objects in tracing order

• Result: tracing in sequential order is 1.3X faster than random order

• Future work– better linearizing algorithm than copy collection algorithm (use twice the

memory!)– measure application-level performance improvement

Interleaving on Masstree

• Not always possible to linearize memory access

• Masstree: a high performance in-memory key value store for multi-core– All cores share a single B+tree– Each core: a dedicated working thread– Scales well on multi-core

• Focus on Masstree with single-thread for now

Single-threaded Masstree is RAM-latency dominated

• Careful design to avoid RAM fetches– trie of B+trees, inline key fragments and children in

tree nodes– Accessing one fat B+tree node in one RAM-latency

• Still RAM-latency dominated!– Each key-lookup follows a random path– O(N) RAM-latency (hundreds of cycles) per-lookup– A million lookups per second

Batch and interleave tree lookups

• Batch key lookups

• Interleave computation and RAM fetch using software prefetch

E

A D

B F

X

1. Find child containing A in E prefetch(B)

2. Find child containing X in E prefetch(F)

3. Find child containing A in B prefetch(A) B is already in cache!

4. Find child containing X in F prefetch(X)

F is already in cache!

• Perform a batch of lookups w/o stalling on RAM-fetch!• As long as computation (inspecting a batch of nodes) >

RAM-latency• 30% improvement with batch of five

Parallelizing Masstree

• Interesting observation– applications are limited by RAM-latency, not by CPU– but adding more cores help!

• Reason– RAM is a parallel system– More cores keeps RAM busier

• Compare with interleaving technique– Same effect: keep RAM busier– Difference: from one core, and from multi-cores

Parallelization improves performance by issuing more RAM loads

1 2 4 6 8 10 120

2

4

6

8

10

12

14

0

50

100

150

200

250

300Masstree ThroughputRAM loads

Number of hyper-threads

Thro

ughp

ut (m

illio

ns/g

ets/

seco

nd)

RAM

load

s (m

illio

ns/s

econ

d)

Interleaving and Parallelization can be complementary

1 2 4 6 8 10 120

2

4

6

8

10

12

14

0

10

20

30

40

50

60

70

80

90

100Throughput of Interleaved MasstreeImprovement Over Masstree

Number of hyper-threads

Thro

ughp

ut(m

illio

ns/g

ets/

seco

nd)

Impr

ovem

ent(%

)

Beats Masstree by 12-30%

Improvement decreases with more coresParallelization alone can saturate

Discussion

• Applicability– Lessons• Interleaving seems more general than linearization

– applied to Garbage Collector?• Interleaving is more difficult than parallelization

– requires batching and concurrency control

– Challenges in automatic interleaving• Need to identify and resolve conflicting access• Difficult or impossible without programmers’ help

Discussion

• Interleaving on certain data structures– Data structures and potential applications• B+tree: Masstree

– other applications use in-memory B+tree?• Hashtable: Memcached

– A single hashtable– Multi-get API: natural batching and interleaving

– Preliminary result: interleaving hashtable improves throughput by 1.3X

Discussion

• Profiling tools– Linux perf• Look at most expensive function• Manually inspect

– Maybe misleading• computation limited or RAM-latency limited?

– RAM stalls based tool?

Related Work

• PALM[Jason11]: B+tree with same interleaving technique

• RAM parallelization at different levels: regulation considered harmful[Park13]

Conclusion

• Identifies a class of applications: dominated by RAM-latency

• Three techniques to address RAM-latency bottleneck of two applications

• Improve your program similarly?

Questions?

… B+tree, indexed by k[0:7]

B+tree, indexed by k[8:15]…

Trie: a tree where each level is indexed by fixed-length key fragment

Single-threaded Masstree is RAM-latency dominated

Optimizing RAM-latency Dominated Applications

Documents

Transcript of Optimizing RAM-latency Dominated Applications