MICA: A Holistic Approach to Fast In-Memory Key-Value … · MICA: A Holistic Approach to Fast...
-
Upload
nguyennhan -
Category
Documents
-
view
219 -
download
1
Transcript of MICA: A Holistic Approach to Fast In-Memory Key-Value … · MICA: A Holistic Approach to Fast...
MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim1
Dongsu Han,2 David G. Andersen,1 Michael Kaminsky3
1Carnegie Mellon University
2KAIST, 3Intel Labs
Goal: Fast In-Memory Key-Value Store
• Improve per-node performance (op/sec/node)
• Less expensive
• Easier hotspot mitigation
• Lower latency for multi-key queries
• Target: small key-value items (fit in single packet)
• Non-goals: cluster architecture, durability
2
Q: How Good (or Bad) are Current Systems?
• Workload: YCSB [SoCC 2010]
• Single-key operations
• In-memory storage
• Logging turned off in our experiments
• End-to-end performance over the network
• Single server node
3
0
10
20
30
40
50
60
70
80
Memcached RAMCloud MemC3 Masstree MICA
95% GET ORIG
95% GET OPT
50% GET OPT
End-to-End Performance Comparison
4
Throughput (M operations/sec)
- Published results; Logging on RAMCloud/Masstree
- Using Intel DPDK (kernel bypass I/O); No logging
- (Write-intensive workload)
1.4 0.1 4.4
8.9
1.3 5.8 5.7
16.5
65.6
0.7 1 0.9 6.5
70.4
0
10
20
30
40
50
60
70
80
Memcached RAMCloud MemC3 Masstree MICA
95% GET ORIG
95% GET OPT
50% GET OPT
End-to-End Performance Comparison
5
Throughput (M operations/sec)
Performance collapses under heavy writes
- Published results; Logging on RAMCloud/Masstree
- Using Intel DPDK (kernel bypass I/O); No logging
- (Write-intensive workload)
1.4 0.1 4.4
8.9
1.3 5.8 5.7
16.5
65.6
0.7 1 0.9 6.5
70.4
0
10
20
30
40
50
60
70
80
Memcached RAMCloud MemC3 Masstree MICA
95% GET ORIG
95% GET OPT
50% GET OPT
End-to-End Performance Comparison
6
Throughput (M operations/sec)
13.5x
Maximum packets/sec attainable using UDP 4x
MICA Approach
• MICA: Redesigning in-memory key-value storage
• Applies new SW architecture and data structures to general-purpose HW in a holistic way
7
Client CPU
NIC CPU
Memory
Server node
1. Parallel data access
2. Request direction
3. Key-value data structures (cache & store)
Parallel Data Access
• Modern CPUs have many cores (8, 15, …)
• How to exploit CPU parallelism efficiently?
8
Client CPU
NIC CPU
Memory
Server node
1. Parallel data access
2. Request direction
3. Key-value data structures
Parallel Data Access Schemes
9
CPU core
CPU core Memory
CPU core
CPU core
Partition
Partition
Concurrent Read Concurrent Write
Exclusive Read Exclusive Write
+ Good load distribution - Limited CPU scalability
(e.g., synchronization) - Cross-NUMA latency
+ Good CPU scalability - Potentially low performance under skewed workloads
In MICA, Exclusive Outperforms Concurrent
10
Throughput (Mops)
35.1
76.9
69.3
76.3
21.6
70.4 69.4 65.6
0
10
20
30
40
50
60
70
80
Concurrent Access Exclusive Access
Uniform, 50% GET
Uniform, 95% GET
Skewed, 50% GET
Skewed, 95% GET
End-to-end performance with kernel bypass I/O
Request Direction
• Sending requests to appropriate CPU cores for better data access locality
• Exclusive access benefits from correct delivery
• Each request must be sent to corresp. partition’s core
11
Client CPU
NIC CPU
Memory
Server node
1. Parallel data access
2. Request direction
3. Key-value data structures
Request Direction Schemes
12
Client CPU
CPU
Server node
Client
Classification using 5-tuple
CPU
CPU
Server node
Classification depends on request content
Key 1
Key 1
Key 2
Key 2 Client
Client NIC NIC
+ Good locality for flows (e.g., HTTP over TCP)
- Suboptimal for small key-value processing
+ Good locality for key access - Client assist or special HW support needed for efficiency
Flow-based Affinity Object-based Affinity
Crucial to Use NIC HW for Request Direction
13
Throughput (Mops)
Using exclusive access for parallel data access
33.9
76.9
28.1
70.4
0
10
20
30
40
50
60
70
80
Request direction done solely bysoftware
Client-assisted hardware-basedrequest direction
Uniform
Skewed
Key-Value Data Structures
• Significant impact on key-value processing speed
• New design required for very high op/sec for both read and write
• “Cache” and “store” modes
14
Client CPU
NIC CPU
Memory
Server node
1. Parallel data access
2. Request direction
3. Key-value data structures
MICA’s “Cache” Data Structures
• Each partition has:
• Circular log (for memory allocation)
• Lossy concurrent hash index (for fast item access)
• Exploit Memcached-like cache semantics
• Lost data is easily recoverable (not free, though)
• Favor fast processing
• Provide good memory efficiency & item eviction
15
Circular Log
• Allocates space for key-value items of any length
• Conventional logs + Circular queues
• Simple garbage collection/free space defragmentation
16
New item is appended at tail
Head Tail
Head Tail
Evict oldest item at head (FIFO)
Insufficient space for new item?
(fixed log size)
Support LRU by reinserting recently accessed items
Lossy Concurrent Hash Index
• Indexes key-value items stored in the circular log
• Set-associative table
• Full bucket? Evict oldest entry from it
• Fast indexing of new key-value items
17
Key,Val
bucket 0
bucket 1
bucket N-1
…
Circular log
Hash index hash(Key)
MICA’s “Store” Data Structures
• Required to preserve stored items
• Achieve similar performance by trading memory
• Circular log -> Segregated fits
• Lossy index -> Lossless index (with bulk chaining)
• See our paper for details
18
Evaluation
• Going back to end-to-end evaluation…
• Throughput & latency characteristics
19
Throughput Comparison
20
Throughput (Mops)
End-to-end performance with kernel bypass I/O
0.7 0.9 0.9
5.7
76.9
1.3
5.4 6.1
12.2
76.3
0.7 0.9 1
6.5
70.4
1.3
5.7 5.8
16.5
65.6
0
10
20
30
40
50
60
70
80
Memcached MemC3 RAMCloud Masstree MICA
Uniform, 50% GET
Uniform, 95% GET
Skewed, 50% GET
Skewed, 95% GET
Bad at high write ratios
Similar performance regardless of skew/write
Large performance
gap
Throughput-Latency on Ethernet
21
Average latency (μs)
Throughput (Mops)
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80
Original Memcached MICA
0
10
20
30
40
50
60
70
80
90
100
0 0.1 0.2 0.3
Original Memcached using standard socket I/O; both use UDP
200x+ throughput
MICA
• Redesigning in-memory key-value storage
• 65.6+ Mops/node even for heavy skew/write
• Source code: github.com/efficient/mica
22
Client CPU
NIC CPU
Memory
Server node
1. Parallel data access
2. Request direction
3. Key-value data structures (cache & store)