RedrawingtheBoundaryBetween …nvsl.ucsd.edu/assets/talks/2012-05-21-DaMoN-release.pdf ·...
Transcript of RedrawingtheBoundaryBetween …nvsl.ucsd.edu/assets/talks/2012-05-21-DaMoN-release.pdf ·...
1
Redrawing the Boundary Between So3ware and Storage for Fast Non-‐Vola;le Memories
Steven Swanson Director, Non-‐Vola;le System Laboratory
Computer Science and Engineering University of California, San Diego
2
*http://hmi.ucsd.edu
Humanity processed 9 Zettabytes in 2008* 9,000,000,000,000,000,000,000 Bytes Welcome to the Data Age!
*hJp://hmi.ucsd.edu
3
Solid State Memories
• NAND flash – Ubiquitous, cheap – Sort of slow, idiosyncra;c
• Phase change, Spin torque MRAMs, etc. – Not (yet) ready for prime ;me
– DRAM-‐like speed – DRAM or flash-‐like density
• DRAM + BaJery + flash – Fast, available today – Reliability?
4
CPU NVM
NVM
NVM
NVM
DRAM
DDR PCIe
PCIe-‐aJached NVMs – Low Latency – High bandwidth – Flexible interface – Flexible topology – Scalable
DDR-‐aJached NVMs – Lowest Latency – Highest bandwidth – Rigid interface – Restricted topology – Limited scalability
Integrating Them Into Systems
5
1
10
100
1000
10000
100000
1000000
1 10 100 1000 10000 100000 1000000 10000000
Band
width Rela-
ve to
disk
1/Latency Rela-ve To Disk
Hard Drives (2006) PCIe-‐Flash (2007)
PCIe-‐PCM (2010)
PCIe-‐Flash (2012)
PCIe-‐PCM (2013?)
DDR Fast NVM (2016?)
5917x à 2.4x/yr
7200x à 2.4x/yr
6
Software Latency Will Dominate
0.3%
19.3% 21.9% 10.4%
21.9%
70.0%
88.4% 94.09%
0%
20%
40%
60%
80%
100%
So:ware's C
ontribu-
on to
Laten
cy
7
Software Energy Will Dominate
0.4%
97.0%
75.8% 78.6%
0%
20%
40%
60%
80%
100%
Hard Drive (2006)
SSD (2012) PCIe-‐Flash (2012)
PCIe-‐PCM (2013) So:ware's C
ontribu-
on to
Ene
rgy/IOP
8
Software’s New Role in Storage
For Disks: Adding So3ware Could Make the System Faster
For NVMs: Adding So3ware Makes It Slower
• Reduce so3ware interac;ons • Remove disk-‐centric features • Redesign IO so3ware (OS, FS, languages, and apps)
9
CPU NVM
NVM
NVM
NVM
DRAM
DDR
NV-‐Heaps: NVMS over DDR
Moneta: SSDs for Fast NVMs
PCIe
Moneta Driver IO Stack
Database
File System
Block Driver IO Stack
Programming Model
File System
Two Case Studies
10
Moneta Internals: Performing a Write
16GB NVM
16GB NVM
16GB NVM
16GB NVM
Ring Control
Transfer Buffers
DMA Control
Scoreboard
Tag Status Registers
Host via PIO
Host via DMA
Request Queue
Ring (4
GB/s)
11
The Moneta Prototype
• FPGA-‐based implementa;on • DDR2 DRAM emulates PCM – Configurable memory latency – 48 ns reads, 150 ns writes – 64GB across 8 controllers
• PCIe: 2 GB/s, full duplex
12
Optimizing Software for Moneta
• Op;miza;ons – Remove IO scheduler – Redesigned HW interface – Remove locks
• SW latency drops from 13.4 us to 5us – 62% reduc;on in latency – Increased concurrency
• Bandwidth increases by up to 10x
0
5
10
15
20
25
Moneta 0.1
Moneta 1.0
Latency (us)
PCM
Ring
DMA
Wait
Interrupt
Issue
Copy
Schedule
OS/User
13
The App Gap
0.1
1
10
100
1000
10000
Moneta FusionIO SSD-‐RAID DISK-‐RAID
Log Speedu
p vs DISK-‐RA
ID
XDD 4KB RW Btree HashTable Bio PTF
• We must close the App Gap for NVM success
14
Where is the App Gap (Part I)?
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Sustaine
d MB/s
4KB Accesses
Moneta 1.0 + fs
Moneta 1.0
Moneta 1.0
Data-bases
WebApps
SciApps
FileServers
Driver
File systemIO Stack
15
Eliminating FS and OS overheads
• Separate protec;on mechanisms from policy – Move protec;on checks to hardware – Allow applica;ons to access Moneta directly
• 41% is to support protec;on and sharing
0
5
10
15
20
25
30
Moneta 0.1 Moneta 1.0
Latency (us)
PCM
Ring
DMA
Wait
Interrupt
Issue
Copy
Schedule
OS/User
File System Moneta-D
Driver
AppDriver
App
IO StackFS
Perm CheckMoneta 1.0
App App
Driver
File systemIO Stack
• Protec;on overheads are almost eliminated
17
Moneta
DriverApp
DriverApp
IO StackFS
Perm Check Moneta
DriverApp
DriverApp
IO StackFS
Perm Check
Logging/Atomicity
Where is the App Gap (Part II)?
• Write ahead logging – e.g. ARIES
• Extra IO opera;ons – Wasted so3ware latency/power
– Wasted IO performance
• Mul;-‐part Atomic Writes – Apps can create and commit write groups
– Write data resides in a redo log
• The applica;on can read and change the log contents
18
Bandwidth Comparison
2 to 3.8X improvement
0
200
400
600
800
1000
1200
1400
1600
1800
0.5 1 2 4 8 16 32 64 128 256 512
Sustaine
d Ba
ndwidth (M
B/s)
Access Size (KB)
Normal Writes
WAL Writes
0
200
400
600
800
1000
1200
1400
1600
1800
0.5 1 2 4 8 16 32 64 128 256 512
Sustaine
d Ba
ndwidth (M
B/s)
Access Size (KB)
Normal Writes
AtomicWrite
WAL Writes
19
ARIES – Write Ahead Logging for Databases on Disks
• ARIES provides useful database features – Atomicity – Durability – High Concurrency – Efficient recovery
• But it makes disk-‐centric design decisions – Undo and redo logs to avoid random
synchronous writes and provide IO scheduling flexibility
– Page-‐based storage management to match disk seman;cs
We want to keep these
Rethink these
20
ARIES’ Disk-Centric Design Decisions
Design Decision Advantages Downsides
No-‐force
Eliminates synchronous random writes.
Requires synchronous writes to the redo log.
Steal
Longer sequen;al writes. Requires an undo log.
Pages Matches disk and OS page-‐based IO model.
Page granularity is not always ideal.
21
MARS: Rethinking ARIES for Moneta
No-‐force
Steal
Pages
Force updates on commit, but offload it to Moneta’s memory controllers.
Get rid of undo logging. Update redo log contents instead.
So3ware can choose object/page granularity.
22
ARIES vs MARS
0 20000 40000 60000 80000 100000 120000 140000 160000
1 2 4 8 16
4KB Sw
aps T
X/sec
Threads
MARS ARIES
23
Software’s Role in Moneta
• The basic Moneta hardware is (rela;vely) simple
• So3ware requirements drove the HW enhancements
• Reduce, Remove, Redesign – Simplify the kernel IO stack – Eliminate the OS and FS overheads for most accesses
– Rethink write-‐ahead logging for new memories
24
CPU NVM
NVM
NVM
NVM
DRAM
DDR
NV-‐Heaps: NVMS over DDR
Moneta: SSDs for Fast NVMs
PCIe
Moneta Driver IO Stack
Database
File System
Block Driver IO Stack
Programming Model
File System
Two Case Studies
25
Disks and Files Are Passé
DDR NVM
Block Driver
File System
System Calls
Applica-ons DDR NVM is a new kind of disk
-‐ Use conven-onal file opera-ons to access data -‐ Files are bags of bytes
DDR NVM is a new kind of Memory
-‐ Use VM to access data -‐ Create classes
-‐ Use pointers/references -‐ Leverage strong types -‐ But make it persistent
DDR NVM
Block Driver
File System
System Calls
Applica-ons
26
The Dangers of Non-Volatile Programming
Dlist Insert(Foo x, Dlist head) { Dlist r; r = new Dlist; r.data = x; r.next = head; r.prev = null; if (head) { head.prev = r; } return r; }
What if the system crashes?
What if x and head are in different files?
27
NV-Heaps: Safe, Fast, Persistent Objects
• Container for NV data structures – Generic – Self-‐contained
• Access via loads and stores – Memory-‐like performance – Familiar seman;cs
• Strong safety guarantees – ACID transac;ons – Pointer safety
Transac-on Handling
NV-‐Heap
[ASPLOS 2011]
28
A Type System for NV-Heaps
1. Refine each type with a heap-‐index 2. Type rules ensure that the heaps match
Dlist<h> Insert(Foo<h> x, Dlist<h> head) { atomic<h> { Dlist<h> r; r = new Dlist in heapof(r); r.data = x; r.next = head; r.prev = null in heapof(r); if (head) { head.prev = r; } } return r; }
29
0
20000
40000
60000
80000
100000
120000
Berkelely DB -‐ RAM Disk
NV-‐heaps DRAM
Memcached (vola;le)
Ope
ra-o
ns/sec
Application: A Persistent Key-value Store
39X 8% slowdown
NV-‐heaps are much faster than the block-‐based systems NV-‐heaps make the cost of persistence very low
30
NV-Heaps Wrap Up
• The hardware is here already. • NV-‐heaps can change how programmers think about persistent state. – Careful system design can avoid the pisalls – And the performance gains are huge
31
Other Efforts
• PCIe SSDs – FusionIO et. al. – Intel NVMExpress prototypes [FAST’12]
– Texas Memory Systems SSDs
• Atomic Writes – Transac;onal Flash [OSDI’08]
– FusionIO AtomicWrite [HPCA’11]
• NV-‐Heaps – Slow but safe
• Stasis [OSDI 06] • Orthogonally persistent Java [SIGMOD 96]
• Many others…
– Fast but unsafe • Recoverable Virtual Memory [SOSP 93]
• Rio Vista [SOSP 97]
– Fast and safe • Mnemosyne [ASPLOS 11]
33
How Can We Elegantly Enrich Storage Semantics?
• Atomic Writes help but,… • Each applica;ons has its own needs • Legacy system architectures • Novel data structures
• Can we build flexible seman;cs for storage? • Like shaders in graphics cards?
34
How do NVMs affect the data center?
• How should we organize a petabyte of NVMs? • Centralized? Distributed? Hybrid?
• NVM performance with large-‐scale system overheads • Replica;on • Network latency • Thick so3ware layers (e.g. OpenStack, HadoopFS)
35
Software Costs Will Remain High
0.3%
19.3% 21.9% 10.4%
21.9%
42.2%
11.1% 11.8%
0%
20%
40%
60%
80%
100%
So:ware's C
ontribu-
on to
Laten
cy
PCIe and DDR performance will con;nue to improve.
PCIe and DDR power efficiency will con;nue to improve.
Thick so3ware stacks will push so3ware costs higher.
Applica;ons will demand richer, more sophis;cated
interfaces.
36
The App Gap Remains
0.1
1
10
100
1000
10000
Moneta FusionIO SSD-‐RAID DISK-‐RAID
Log Speedu
p vs DISK-‐RA
ID
XDD 4KB RW Btree HashTable Bio PTF
37
Solid State Storage In the Data Age
• Harnessing the data we collect requires vastly faster storage systems
• Fast, non-‐vola;le memories can provide it, but… • NVMs will force so3ware’s role to change – So3ware will become a drag on IO performance – Reduce, Remove, Redesign
• Leveraging NVMs quickly requires a coordinated research that spans system layers
• No aspect of the system is safe!