Optimizing RAM-latency Dominated Applications
description
Transcript of Optimizing RAM-latency Dominated Applications
Optimizing RAM-latency Dominated Applications
Yandong Mao, Cody Cutler, Robert MorrisMIT CSAIL
RAM-latency may dominate performance
• RAM-latency dominated applications – follow long pointer chains– working set >> on-chip cache
• A lot of cache misses -> stalling on RAM fetches
• Example: Garbage Collector– Identify live objects by following inter-object pointers– Spend much of its time stalling to follow pointers,
due to RAM latency
Addressing RAM-latency bottleneck?
• View RAM as we view disk
• High latency
• A similar set of optimization techniques– Batching– Sorting– Access I/O in parallel and asynchronously
Outline
• Hardware Background• Three techniques to address RAM-latency– Linearization: Garbage Collector– Interleaving: Masstree– Parallelization: Masstree
• Discussion• Conclusion
Three Relevant Hardware Features
RAM Controller
Channel 0 Channel 1 Channel 2
0 1 2 3 4 5
2. Parallel accesses to different channels
1. Fetch RAM before needed- Hardware prefetcher – sequential or
strided access pattern- Software prefetch- Out-of-order execution
3. Row buffer cache inside memory channel
• Intel Xeon X5690
Per-array row buffer cache
...
......
...
...
4096 bytes
Row buffer
Data Rows
Hit in row buffer: 2x-5x faster than miss!
Sequential access: 3.5x higher throughput than random access!
• Each channel has many of arrays shown below• Each array has an additional row: row buffer• Memory access: check row buffer, reload if miss
Linearization memory accesses for Garbage Collector
• Garbage Collector goal– Find live objects (tracing)
• starts from root (stack, global variables)• follows object pointers of live objects
– Claim space for unreachable objects
• Bottleneck of tracing: RAM-latency– Pointer addresses are unpredictable and non-sequential– Each access -> cache miss -> stall for RAM-fetch
Observation
• Arrange objects in tracing order during garbage collection– Subsequent tracing would access memory in
sequential order
• Take advantage of two hardware features– Hardware prefechers: prefetch into cache– Higher row buffer hit rate
Benchmark and result• Time of tracing 1.8 GB of live data• HSQLDB 2.2.9: a RDBMS engine in Java• Compacting Collector of Hotspot JVM from OpenJDK7u6
– Use copy collection to reorder objects in tracing order
• Result: tracing in sequential order is 1.3X faster than random order
• Future work– better linearizing algorithm than copy collection algorithm (use twice the
memory!)– measure application-level performance improvement
Interleaving on Masstree
• Not always possible to linearize memory access
• Masstree: a high performance in-memory key value store for multi-core– All cores share a single B+tree– Each core: a dedicated working thread– Scales well on multi-core
• Focus on Masstree with single-thread for now
Single-threaded Masstree is RAM-latency dominated
• Careful design to avoid RAM fetches– trie of B+trees, inline key fragments and children in
tree nodes– Accessing one fat B+tree node in one RAM-latency
• Still RAM-latency dominated!– Each key-lookup follows a random path– O(N) RAM-latency (hundreds of cycles) per-lookup– A million lookups per second
Batch and interleave tree lookups
• Batch key lookups
• Interleave computation and RAM fetch using software prefetch
E
A D
B F
X
1. Find child containing A in E prefetch(B)
2. Find child containing X in E prefetch(F)
3. Find child containing A in B prefetch(A) B is already in cache!
4. Find child containing X in F prefetch(X)
F is already in cache!
• Perform a batch of lookups w/o stalling on RAM-fetch!• As long as computation (inspecting a batch of nodes) >
RAM-latency• 30% improvement with batch of five
Parallelizing Masstree
• Interesting observation– applications are limited by RAM-latency, not by CPU– but adding more cores help!
• Reason– RAM is a parallel system– More cores keeps RAM busier
• Compare with interleaving technique– Same effect: keep RAM busier– Difference: from one core, and from multi-cores
Parallelization improves performance by issuing more RAM loads
1 2 4 6 8 10 120
2
4
6
8
10
12
14
0
50
100
150
200
250
300Masstree ThroughputRAM loads
Number of hyper-threads
Thro
ughp
ut (m
illio
ns/g
ets/
seco
nd)
RAM
load
s (m
illio
ns/s
econ
d)
Interleaving and Parallelization can be complementary
1 2 4 6 8 10 120
2
4
6
8
10
12
14
0
10
20
30
40
50
60
70
80
90
100Throughput of Interleaved MasstreeImprovement Over Masstree
Number of hyper-threads
Thro
ughp
ut(m
illio
ns/g
ets/
seco
nd)
Impr
ovem
ent(%
)
Beats Masstree by 12-30%
Improvement decreases with more coresParallelization alone can saturate
Discussion
• Applicability– Lessons• Interleaving seems more general than linearization
– applied to Garbage Collector?• Interleaving is more difficult than parallelization
– requires batching and concurrency control
– Challenges in automatic interleaving• Need to identify and resolve conflicting access• Difficult or impossible without programmers’ help
Discussion
• Interleaving on certain data structures– Data structures and potential applications• B+tree: Masstree
– other applications use in-memory B+tree?• Hashtable: Memcached
– A single hashtable– Multi-get API: natural batching and interleaving
– Preliminary result: interleaving hashtable improves throughput by 1.3X
Discussion
• Profiling tools– Linux perf• Look at most expensive function• Manually inspect
– Maybe misleading• computation limited or RAM-latency limited?
– RAM stalls based tool?
Related Work
• PALM[Jason11]: B+tree with same interleaving technique
• RAM parallelization at different levels: regulation considered harmful[Park13]
Conclusion
• Identifies a class of applications: dominated by RAM-latency
• Three techniques to address RAM-latency bottleneck of two applications
• Improve your program similarly?
Questions?
… B+tree, indexed by k[0:7]
B+tree, indexed by k[8:15]…
Trie: a tree where each level is indexed by fixed-length key fragment
Single-threaded Masstree is RAM-latency dominated