Evolutionary Technology, Revolutionary Implications ...Evolutionary Technology, Revolutionary...
Transcript of Evolutionary Technology, Revolutionary Implications ...Evolutionary Technology, Revolutionary...
Evolutionary Technology, Revolutionary Implications
Persistent MemoryPankaj MehraVP and Senior Fellow
Mar 31, 2017
©2016 Western Digital Corporation or affiliates. All rights reserved. Confidential.
SanDisk Confidential - Office of CTO
We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten.
Don't let yourself be lulled into inaction.
Bill Gates
During Persistent Memory's first go round, I made the first mistake and my company made the second.
Data Growth in Transactions & Analytics
ESS Technology
Other Major Sources of Data
Machine Learning and Video Analyticsturn video data into a data warehouse(license plates, brands, cats too)
Logging, and not just transactions
TLOG ALOG ELOG
©2016 Western Digital Corporation or affiliates. All rights reserved. Confidential. 5
The root of all data collection
TRANSACTION LOGGING
Business Critical Tx in Operational Data Stores
Paid transactions ($0.10/tx)Æ Free Transactions**
($0)
**Blockchain (FSI, pharma, …) for Distributed Ledger
APPLICATION LOGGING
SEIM (ArcSight), Kissmetrics (SaaS) and Google Analytics, spur a wave of app logging
5 EB in MSFT Cosmos!
LOG EVERYTHING
The user is the product
Every read becomes a write
PBs/day pour in from phones, fixed cameras, cars (GM), travelers, …
What is Persistent Memory?
Durable memory that is synchronously accessed but whose metadata are managed like file storage
• First public description PMehra @Ohio State University (Oct 10, ‘02)• Mehra-Fineberg (‘04) showed that RDMA-attached persistent memory improves the performance, availability and scalability of OLTP
7
2004: Persistent Memory based Write Aside Buffer replicated byte-grain log writes in under 10 msec
8
2005@HP TechCon, we questioned decades old conventional wisdom around write ahead logging
a.k.a. write-behind logging, recently rediscovered in PelotonDB by Arulraj and Pavlo[accepted to appear @VLDB’17]
9
and because we knew thatpersistence and replication go hand-in-hand ...
NetworkRead Write
Latency Bandwidth Latency BandwidthInfiniBand (4x, 12-port switch, HP rx5670, HP-UX) 14.7 ms 337 MB/sec 9.9 ms 337 MB/sec
ServerNet (NSK S86000 host, “Sequoia” PMU) 14.5 ms 26.5 MB/sec 14.2 ms 32.8 MB/sec
Communication-link-attached persistent memory deviceUS 20040148360 A1ABSTRACTA system is described that includes a network attached persistent memory unit. The system includes a processor node for initiating persistent memory operations (e.g., read/write). The processor unit references its address operations relative to a persistent memory virtual address space that corresponds to a persistent memory physical address space. A network interface is used to communicate with the persistent memory unit wherein the persistent memory unit has its own network interface. The processor node and the persistent memory unit communicate over a communication link such as a network (e.g., SAN). The persistent memory unit is configured to translate between the persistent memory virtual address space known to the processor nodes and a persistent memory physical address space known only to the persistent memory unit. In other embodiments, multiple address spaces are provided wherein the persistent memory unit provides translation from these spaces to a persistent memory physical address space.
Early Persistent Memory prototypes on Tandem NSK
•Lab prototype based on NonStop Sequoia I/O board
•Device firmware based on Fibre Channel HBA firmware
•Device attached to S-Series servers via MSEB or other ServerNet port (precursor to InfiniBand and RDMA)
page 11
Persistent Memory was 95% software!
•3 Software Components–Access• Library supported privileged NSK processes (such as DP2)• Fast, synchronous writes and reads, direct to device• Implemented pointer chasing, mirroring, …
–Manager• Presented a named volume but hid device location• Implemented using standard process pairs
–Device• Relatively simple hardware
–sometimes entirely special-purpose software
… even so, the access path was 95% hardware!
Network Persistent Memory Unit Operation
Things to remember, Things to ponder
•A persistent memory filesystem should support– Memory-like access– Storage-like management
•Offer an integrated solution to replication and persistence– That places persistent bits outside the
fault zone of the last CPU that wrote them
•Exploit byte-grain persistence deeply– Write-behind logging– Image-based service replication
SanDisk Confidential - Office of CTO
•What’s new and truly different?•What are the other significant developments since 2006?•What will persistent memory do to the memory hierarchy this time around?
Storage
Memory
Storage-Class Memory
NAND
HDD
DRAM
SRAMSTT-MRAM
PCM
EmbeddedNVM
(SNDK) ReRAM
Non-volatile
Volatile
Acc
ess
Tim
e (s
ec) 3D
XPoint
CBRAM (Micron)
SanDisk Confidential - Office of CTO
Emerging memories
STT-MRAM will surpass SRAM cost while providing competitive performance and non-volatility enabling memory of choice for IoT
3D XPoint will bifurcate into performance and cost optimized flavors
NAND will continue dropping in cost/bit widening gulf for SCM to fill
Source: Chris Petti, SanDisk
Memory-Storage Hierarchy
� Memory = Precious Resource
� But Huge Penalty leaving Memory
� Storage = Continuum of Memory Hierarchy
� Storage = Permanence & Capacity
� SCM (ReRAM) Changes Hierarchy• Permanent, Fast, Vast = 10s Tbytes per node
� Will be used as Memory and Storage
+104 ns25,000 instruction gap: Latency penalty for leaving memory
hierarchy
Flash
HDD 107 ns
105 ns
“Storage-like” block-based
DRAM
Processor
Cache 101 ns
102 ns“Memory-like” L/S access
often 64Byte Cache Line
SCM (ReRAM)
SCM (ReRAM)
103 ns
103 ns
Order Latency
SW Stack: Promise of NV Main MemoryFreeing up CPU cycles for real work
Touch
OS
File
Block
NVMe / VSL
Hardware
NAND - TBytes
Touch
ReRAM: 10’s TByte
I/O Stack NV Main Memory
25k
inst
ruct
ions
+ C
onte
xt
Swap
s +
cac
he p
ollu
tion
I/O latency &
IOPs
Transaction Overhead
~8uS
100 uS Media <<1 uS Media
Massive Datasets
mm
ap
25k Instructions8uS
Cache Pollution
1 Instruction<<1uS
Clean Cache
Industry Simulations = ~50x Work / Server
Persistent Memory: Assumptions & Projections
• Reduction in context switches– NVMe and other network learnings– Hardware-only access path to persistent memory
• Cost, Amount of persistent memory– PCIe CBDRAM bufs Æ NVDIMM-F Æ NVDIMM-N Æ SCM and NVDIMM-P
• Remote Persistent Memory– RDMA PMUs (2002) Æ GenZ (2016)
• Other key developments– Atomic writes to flash (2014), In-Place Update Engines (2015), Explicit placement of data (2015+)
Scale Metric 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
2 msec Context Switches 7 3 2 1 0log10 B Amount of persistent memory 3 9 9 10 12 15log10 B In-memory data 10 10 11 11 11 12 13 13 14 15ns append to persistent log 100,000 15,000 5,000 300
persist transactional data mutation 5,000,000 1,000,000 60atomic-persistent-write / log-append ? WAL WAL WAL WAL WBL WBL WBL WBL WBL WBL
ns append to persistent log [remote] 1,000,000 19,000 2,500 900 750 400Cost of persistent memory D+N N D+N 0.5D 0.35D
Standards Based (Remote) LD-ST Persistent Memory
18
© 2016 Gen-Z
• Common protocol for many PHYs, topologies• Splits memory controller for pipelining• Light headers (better than RDMA)• Achieves 90% peak BW at 64B message size!• Rich set of x-ISA memory ops (atomic, flush)
Protocol
Transport
Memory Controller
DRAM DIMMDDR
SCM DIMM
NVDIMM-P
Non-determinism enabled w/ new commands on same busNVDIMM-P – New device class shares DDR w/ DRAM
• New command overlaid on DDR4/5 pins• Supports non-deterministic reads, OOO completion, etc.• Data transferred per synchronous DRAM constraints so that DRAM
and NVM can share the bus• Intel DDR-T (to our best knowledge) is also an add-on protocol, like
NVDIMM-P, but proprietary and integrated w/ Intel CPU/MC.
DDR4 DDR5
Protocol is separate command layer/spec
NVDIMM-P Protocol Overview
Source: Dave Landsman and WD Standards Team
What Databases are Doing
19Content credit: Arulraj VLDB paper
Existing DB, NVM Aware Engine Transactions, NVM-Aware DB, WBL Analytics, flatter memory hierarchy
What Applications are Doing
C ontent c redit: McGuffy SPAA paper 20
Reduced-write, low-depth, work-efficient parallel algorithms for NVM
CPU1 1,1
SmallMemory
1,ω
LargeMemory
What compilers still do
21
• Lifting of loads is still extremely helpful
• Also helps vectorization / SIMD
• But pushing out stores?
• What about other compiler optimization strategies in the face of– Asymmetric write costs– Long latency of reads
• These things make a bigger difference to code (1000x, for instance) than many things we do here
LD
ST
Persistent Memory: Assumptions and ProjectionsCategory Where We Are Assumptions,
“How we got here”Projections“Where are we going?”
File Systems • Pmem.io• Named extents
• NVMFS/ACM support • Scale-out filesystems with persistent named objects that can be discovered and memory mapped very efficiently
Memory Managers • Persistent pointers• Pipelined fences, writebacks
• Mostly mapping and checkpointing bolted on to conventional volatile memory management and traditional file systems
• In-place update semantic displaces I/O semantic in applications not just in databases
• Judicious placement of data structures and fields between DRAM and SCM gives way to better optimizers
End-to-end low latency and high throughput paths (local)
• DDR3/4 NVDIMM-N • NVDIMM-F (memory channel storage)• PCIe BBDRAM, CBDRAM (I/O channel
memory)
• DDR4 NVDIMM-P• DDR5 NVDIMM-P
End-to-end low latency and high throughput paths (remote)
• RDMA to NVDIMM or PCIe BARs • NPMU/PMM • GenZ, PMOF• Memory Nodes
SCM disrupts high-perf NAND and big-memory DRAM across enterprise segments
Hyperscale Server Enterprise Server Enterprise Storage
DRAM• Containers• In-memory database• Active data (GB tier)
• VMs• In-memory database• Active data (GB tier)
• I/O buffers• Recovery buffers
Storage Class Memory
• In-memory pub-sub• Indexes
• Distributed logging• Real-time analytics (TB tier)
• Mmap’ed files (TB Tier)• Swap/backing store
• Tmp data (VDI, analytics)• Transaction logging
• Fabric buffers• Metadata logs
• Hot data
NAND• Data Warehouse• Data Lakes (PB tier)• Batch analytics
• Indexes• Filesystems• Object store
• Staging/Buffering
• Metadata & Index
HDDs• Object storage
• Cold Tier (EB tier)
• PB (Copy data) tier • PB (Archival) Tier
24
Converged Memory-Storage Markets (2021)
HPC Hyperscale Server
Enterprise Server
EnterpriseStorage,
Converged
ComputeTier
High DWPD SSDs for Burst Buffer(~5 PB/20K)
In-memory computing &in-memory caching(n TB SCM)
IMDB for analytics, SDM (TBs of SCM)
SDS$ (TBs high-DWPD SSD)
RAID$, WB in SCM (SW optimization) (TB)
ArchiveTier
HDD Æ Capacity Flash (EB)
HDD Æ? Capacity Flash (m EB)
Active Backup, CDM(n PB of 90-10 HDD-SSD)
High Capacity SSDs (PB AFAs and HCI)
Memory Centric Computing
©2017 Western Digital Corporation or its affiliates. All rights reserved.
Shipping computation to the data
Works best when simple expressions computed against large number of data records
PowerReduction in data movement count and distance
PerformanceParallelism, Bandwidth, and Latency
CostLow gate count embedded cores with future open ISA and tools
CPU
Near Memory
FarMemory
FarCompute
DataNear
Compute
Will memory disaggregate?
©2016 Western Digital Corporation or affiliates. All rights reserved. Confidential. 26
Need efficient memory-semantic fabrics, and major software shift