Moving to PCI Express based SSD with NVM Express
-
Upload
stanislas-odinot -
Category
Technology
-
view
1.147 -
download
22
description
Transcript of Moving to PCI Express based SSD with NVM Express
Moving to PCI Express* Based Solid-State Drive with NVM Express
Jack Zhang Sr. SSD Application Engineer, Intel Corporation
SSDS002
2
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
3
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
4
More than ten exabytes of NAND based compute SSDs shipped 2013
SSD Capacity Growth by Market Segment (PB/MGB)
Solid-State Drive Market Growth
-
10,000
20,000
30,000
40,000
50,000
60,000
70,000
2011 2012 2013 2014 2015 2016 2017
MG
B Enterprise
Client
Source: Forward Insight Q4’13
5
PCI Express* Bandwidth
PCI Express* (PCIe) provides a scalable, high bandwidth interconnect,
unleashing SSD performance possibilities
Source: www.pcisig.com, www.sata-io.org www.usb.org
6
PCI Express* Bandwidth
PCI Express* (PCIe) provides a scalable, high bandwidth interconnect,
unleashing SSD performance possibilities
Source: www.pcisig.com, www.sata-io.org www.usb.org
7
Motherboard
PCIe SAS SATA
Translation
Queue
NVMe
File System
Software
SAS
SATA
PCI Express* (PCIe) removes controller latency NVM Express (NVMe) reduces software latency
SSD Technology Evolution
8
Source: Forward Insights*
PCI Express* SSD starts ramping this year
Enterprise SSD Interface Trends
PCI Express* Interface SSD Grows Faster
9
Why PCI Express* for SSDs?
Added PCI Express* SSD Benefits • Even better performance • Increased Data Center CPU I/O:
40 PCI Express Lanes per CPU
• Even lower latency • No external IOC means
Lower power (~10W) & cost (~$15)
10
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
11
Client PCI Express* SSD Considerations
• Form Factors? • Attach to CPU or PCH? • PCI Express* x2 or x4?
• Path to NVM Express? • What about battery life? • Thermal concerns?
Trending well, but hurdles remain
12
Card-based PCI Express* SSD Options
M.2 Socket 2
M.2 Socket 3
SATA Yes, Shared Yes, Shared
PCIe x2 PCIe x4 No Yes Comms Support? Yes No
Ref Clock Required Required Max “Up to” Performance 2 GB/s 4 GB/s
Bottom Line Flexibility Performance
Host Socket 2 Host Socket 3
Device w/ B&M Slots
22x80mm DS recommended for capacity 22x42mm SS recommended for size & weight
M.2 defines: single or double sided SSDs in 5 lengths, and
2 SSD host sockets
13
Card-based PCI Express* SSD Options
M.2 Socket 2
M.2 Socket 3
SATA Yes, Shared Yes, Shared
PCIe x2 PCIe x4 No Yes Comms Support? Yes No
Ref Clock Required Required Max “Up to” Performance 2 GB/s 4 GB/s
Bottom Line Flexibility Performance
Host Socket 2 Host Socket 3
Device w/ B&M Slots
22x80mm DS recommended for capacity 22x42mm SS recommended for size & weight
M.2 defines: single or double sided SSDs in 5 lengths, and
2 SSD host sockets
Industry alignment for M.2 length will lower costs and accelerate transitions
14
PCI Express* SSD Connector Options
SATA Express*
SFF-8639
SATA* Yes Yes PCIe x2 x2 or x4 Host Mux Yes No Ref Clock Optional Required EMI SRIS Shielding Height 7mm 15mm
Max “Up to” Performance 2 GB/s 4 GB/s
Bottom Line Flexibility& Cost Performance
SATA Express*: flexibility for HDD
Alignments on connectors for PCI Express* SSDs will lower costs and accelerate transitions
Separate Refclk Independent SSC (SRIS) removes clocks
from cables, reducing emissions & costs of shielding
SFF-8639: Best performance
15
PCI Express* SSD Connector Options
SATA Express*
SFF-8639
SATA* Yes Yes PCIe x2 x2 or x4 Host Mux Yes No Ref Clock Optional Required EMI SRIS Shielding Height 7mm 15mm
Max “Up to” Performance 2 GB/s 4 GB/s
Bottom Line Flexibility& Cost Performance
SATA Express*: flexibility for HDD
Alignments on connectors for PCI Express* SSDs will lower costs and accelerate transitions
Separate Refclk Independent SSC (SRIS) removes clocks
from cables, reducing emissions & costs of shielding
SFF-8639: Best performance
Use an M.2 interface without cables for x4 PCI Express* performance, and lower cost
16
Many Options to Connect PCI Express* SSDs
17
Many Options to Connect PCI Express* SSDs
18
• SSD can attach to Processor (Gen 3.0) or Chipset (Gen 2.0 today, Gen 3.0 in future)
• SSD uses PCIe x1, x2 or x4
• Driver interface can be AHCI or NVM Express
Many Options to Connect PCI Express* SSDs
19
• SSD can attach to Processor (Gen 3.0) or Chipset (Gen 2.0 today, Gen 3.0 in future)
• SSD uses PCIe x1, x2 or x4
• Driver interface can be AHCI or NVM Express
Many Options to Connect PCI Express* SSDs
Chipset attached PCI Express* Gen 2.0 x2 SSDs provide ~2x SATA 6Gbps performance today
20
PCI Express* Gen 3.0, x4 SSDs with NVM Express provide even better SSD performance tomorrow
• SSD can attach to Processor (Gen 3.0) or Chipset (Gen 2.0 today, Gen 3.0 in future)
• SSD uses PCIe x1, x2 or x4
• Driver interface can be AHCI or NVM Express
Many Options to Connect PCI Express* SSDs
21
Intel® Rapid Storage Technology 13.x
Intel® RST driver support for PCI Express Storage coming in 2014
PCI Express* Storage + Intel® RST driver delivers power, performance and responsiveness across
innovative form-factors in 2014 Platforms
Detachables, Convertibles, All-in-Ones
Mainstream & Performance
Intel® Rapid Storage Technology (Intel® RST)
22
Client SATA* vs. PCI Express* SSD Power Management
Activity Device State
SATA / AHCI State
SATA I/O
Ready
Power Example
PCIe Link State
Time to Register Read
PCIe I/O
Ready
Active
D0/ D1/D2
Active NA ~500mW L0 NA ~ 60 µs
Light Active Partial 10 µs ~450mW
L1.2 < 150 µs ~ 5ms
Idle Slumber 10 ms ~350mW
Pervasive Idle / Lid down
D3_hot DevSlp 50 - 200 ms ~15mW < 500 µs ~ 100ms
D3_cold / RTD3 off < 1 s 0W L3 ~100ms ~300 ms
Autonomous transition
D3_cold/off, L1.2, autonomous transitions & two-step resume improves PCI Express* SSD battery life
~5mW
23
Client PCI Express* (PCIe) SSD Peak Power Challenges
• Max Power: 100% Sequential Writes
• SATA*: ~3.5W @ ~400MB/s • x2 PCIe 2.0: up to 2x (7W) • x4 PCIe 3.0: up to ~15W2
0.00
1.00
2.00
3.00
4.00
5.00
1 2 3 4 5 Average
Po
wer
(W
atts
)
Drive
SATA 128K Sequential Write Power Compressible Data, QD=321
Max
1. Data collected using Agilent* DC Power Analyzer N6705B. System configuration: Intel® Core™ i7-3960X (15MB L3 Cache, 3.3GHz) on Intel Desktop Board DX79SI, AMD* Radeon HD 6990 and driver 8.881.0.0, BIOS SIX791OJ.86A.0193.2011.0809.1137, Intel INF 9.1.2.1007, Memory 16GB (4X4GB) Triple-channel Samsung DDR3-1600, Microsoft* Windows* 7 MSAHCI storage driver, Microsoft Windows 7 Ultimate 64-bit Build 7600 with SP1, Various SSDs. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to http://www.intel.com/performance 2. M.2 Socket 3 has nine 3.3V supply pins, each capable of 0.5A for a total power capability of 14.85W
Attention needed for power supply, thermals, and benchmarking
Source: Intel
Motherboard
M.2 SSD
Thermal Interface Material
24
Client PCI Express* SSD Accelerators
• The client ecosystem is ready: Implement PCI Express* SSDs now!
• Use 42mm & 80mm length M.2 for client PCIe SSD
• Implement L1.2 and extend RTD3 software support for optimal battery life
• Use careful power supply & thermal design
• High performance desktop and workstations can consider SFF-8639 data center SSDs for PCI Express* x4 performance today
Drive PCI Express* client adoption with specification alignment and careful design
25
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
26
2.5” Enterprise SFF-8639 PCI Express* SSDs
The path to mainstream: innovators begin shipping 2.5” enterprise PCI Express* SSDs!
Image sources: Samsung*, Micron*, and Dell*
27
Datacenter PCI Express* SSD Considerations
• Form Factor? • Implementation options? • Hot plug or remove?
• Traditional RAID? • Thermal/peak power? • Managements?
Developments are on the way
28
PCI Express* Enterprise SSD Form Factor
• SFF-8639 supports 4 pluggable device types
• Host slots can be designed to accept more than one type of device
• Use PRSNT#, IfDet#, and DualPortEn# pins for device Presence Detect and device type decoding
SFF-8639 enables multi-capable hosts
29
SFF-8639 Connection Topologies
• Interconnect standards currently in process • 2 & 3 connector designs • “beyond the scope of this specification” a common
phrase for standards currently in development
Source: “PCI Express SFF-8639 Module Specification”, Rev. 0.3
Meeting PCI Express 3.0* jitter budgets
for 3 connector designs is non-trivial. Consider
active signal conditioning to
accelerate adoption.
30
Solution Example – 5 Connectors
PCI Express* (PCIe) signal retimers & switches are available from multiple sources
Images: Dell* Poweredge* R720* PCIe drive interconnect. Contact PLX* or IDT* for more information on retimers or switches
4
5
3 Retimer or Switch
Active signal conditioning enables SFF-8639 solutions with more connectors
31
Hot-Plug Use Cases
• Hot Add & Remove are software managed events
• During boot, the system must prepare for hot-plug: – Configure PCI Express* Slot Capability registers – Enable and register for hot plug events to higher level
storage software (e.g., RAID or tiering software) – Pre-allocate slot resources (Bus IDs, interrupts, memory
regions) using ACPI* tables
Existing BIOS and Windows*/Linux* OS are prepared to support PCI Express* Hot-Plug today
32
Surprise Hot-Remove
• Random device failure or operator error can result in surprise removal during I/O
• Storage controller driver and the software stack are required to be robust for such cases
• Storage controller driver must check for Master Abort – On all reads to the device, the driver checks register for FFFF_FFFFh – If data is FFFF_FFFFh, then driver reads another register expected to have
a value that includes zeroes to verify device is still present
• Time order of removal notification is unknown (e.g. Storage controller driver via Master Abort, or PCI Bus driver via Presence Change interrupt, or RAID software may signal removal first)
Surprise Hot-Remove requires careful software design
33
RAID for PCI Express* SSDs?
• Software RAID is a hardware redundant solution to enable Highly Available (HA) systems today with PCI Express* (PCIe) SSDs
• Multi copies of Application images (redundant resource)
• Open cloud infrastructure that supports data redundancy with software implementations, such as Ceph* object storage
Storage Pool
Row B
Row A
Row B
Hardware RAID for PCIe SSD is under-developments
Data Striped
Data replicated
34
Data Center PCI Express* (PCIe) SSD Peak Power Challenges
• Max Power: 100% Sequential Writes
• Larger capacities have high concurrency, consume most power (up to 25W!2)
• Power varies >40% depending on capacity and workload
• Consider UL touch safety standards when planning airflow designs or slot power limits3
1. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to http://www.intel.com/performance 2. PCI Express* “Enterprise SSD Form Factor” specification requires 2.5” SSD maximum continuous power of <25W 3. See PCI Express* Base Specification, Revision 3.0, Section 6.9 for more details on Slot Power Limit Control
Attention needed for power supply, thermals, and SAFETY
Source: Intel
0
5
10
15
20
25
30
Large Small
Po
wer
, W
100% Seq Write50/50 Seq Read/Write70/30 Seq Read/Write100% Seq Read
Capacity
Modeled PCI Express* SSD Power1
35
PCI Express* SSDs Enclosure Management
• SSD Form Factor Specification (www.ssdformfactor.org) defines hot plug indicator uses, Out-of-Band managements
• PCI Express* Base Specification Rev. 3.0 defines enclosure indicators and registers intended for Hot-Plug management support (Registers: Device Capabilities, Slot Capabilities, Slot Control, Slot Status
• SFF-8485 standard defines SGPIO enclosure management interface
Standardize PCI Express* SSD enclosure management
36
Data Center PCI Express*(PCIe) SSD Accelerators
• The data center ecosystem is capable: Implement PCI Express* SSDs now!
• Proved system implementations of design-in 2.5” PCIe SSDs
• Understand Hot-Plug capabilities of your device, system and OS
• Design thermal solutions with safety in mind
• Collaborate on PCI Express SSD enclosure management standards
Drive PCI Express* data center adoption through education, collaboration, and careful software design
37
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
38
PCI Express* for Data Center/Enterprise SSDs • PCI Express* (PCIe) is a great interface for SSDs
– Stunning performance 1 GB/s per lane (PCIe Gen3 x1) – With PCIe scalability 8 GB/s per device (PCIe Gen3 x8) or more – Lower latency Platform+Adapter: 10 µsec down to 3 µsec – Lower power No external SAS IOC saves 7-10 W – Lower cost No external SAS IOC saves ~ $15 – PCIe lanes off the CPU 40 Gen3 (80 in dual socket)
• HOWEVER, there is NO standard driver
Fusion-io*
Micron*
LSI*
Virident*
Marvell*
Intel
OCZ*
PCIe SSDs are emerging in Data Center/Enterprise, co-existing with SAS & SATA depending on application
39
Next Generation NVM Technology
Family Defining Switching Characteristics
Phase Change Memory
Energy (heat) converts material between crystalline (conductive) and amorphous (resistive) phases
Magnetic Tunnel Junction (MTJ)
Switching of magnetic resistive layer by spin-polarized electrons
Electrochemical Cells (ECM)
Formation / dissolution of “nano-bridge” by electrochemistry
Binary Oxide Filament Cells
Reversible filament formation by Oxidation-Reduction
Interfacial Switching
Oxygen vacancy drift diffusion induced barrier modulation
Scalable Resistive Memory Element
Resistive RAM NVM Options
Cross Point Array in Backend Layers ~4l2 Cell
Wordlines Memory Element
Selector Device
Many candidate next generation NVM technologies. Offer ~ 1000x speed-up over NAND.
40
Fully Exploiting Next Generation NVM
• With Next Generation NVM, the NVM is no longer the bottleneck – Need optimized platform storage interconnect – Need optimized software storage access methods
*
NVM Express is the interface architected for NAND today and next generation NVM
41
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
42
Technical Basics • All parameters for 4KB command in single 64B command • Supports deep queues (64K commands per queue, up to 64K queues) • Supports MSI-X and interrupt steering • Streamlined & simple command set (13 required commands) • Optional features to address target segment (Client, Enterprise, etc.)
– Enterprise: End-to-end data protection, reservations, etc. – Client: Autonomous power state transitions, etc.
• Designed to scale for next generation NVM, agnostic to NVM type used
http://www.nvmexpress.org/
43
Queuing Interface Command Submission & Processing
Submission Queue Host Memory
Completion Queue
Host
NVMe Controller
Head
Tail
1
Submission Queue Tail Doorbell
Completion Queue Head Doorbell
2
3 4
Tail
Head
5 6
7
8
QueueCommand
RingDoorbellNew Tail
FetchCommand
ProcessCommand
QueueCompletion
GenerateInterrupt
ProcessCompletion
RingDoorbell
New Head
Command Submission 1. Host writes command to
Submission Queue 2. Host writes updated
Submission Queue tail pointer to doorbell
Command Processing 3. Controller fetches
command 4. Controller processes
command
*
44
Queuing Interface Command Completion
Submission Queue Host Memory
Completion Queue
Host
NVMe Controller
Head
Tail
1
Submission Queue Tail Doorbell
Completion Queue Head Doorbell
2
3 4
Tail
Head
5 6
7
8
QueueCommand
RingDoorbellNew Tail
FetchCommand
ProcessCommand
QueueCompletion
GenerateInterrupt
ProcessCompletion
RingDoorbell
New Head
Command Completion 5. Controller writes
completion to Completion Queue
6. Controller generates MSI-X interrupt
7. Host processes completion
8. Host writes updated Completion Queue head pointer to doorbell
*
45
Simple Command Set – Optimized for NVM
Admin Commands Create I/O Submission Queue Delete I/O Submission Queue Create I/O Completion Queue Delete I/O Completion Queue Get Log Page Identify Abort Set Features Get Features Asynchronous Event Request Firmware Activate (optional) Firmware Image Download (opt) Format NVM (optional) Security Send (optional) Security Receive (optional)
NVM I/O Commands Read Write Flush Write Uncorrectable (optional) Compare (optional) Dataset Management (optional) Write Zeros (optional) Reservation Register (optional) Reservation Report (optional) Reservation Acquire (optional) Reservation Release (optional)
Only 10 Admin and 3 I/O commands required
46
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
47
Driver Development on Major OSes
• Windows* 8.1 and Windows* Server 2012 R2 include native driver
• Open source driver in collaboration with OFA Windows*
• Stable OS driver since Linux* kernel 3.10 Linux*
• FreeBSD driver upstream Unix
• Solaris driver will ship in S12 Solaris*
• vmklinux driver certified release in 1H, 2014 VMware*
• Open source driver available on SourceForge UEFI
Native OS drivers already available, with more coming!
48
Windows* Open Source Driver Update
• 64-bit support on Windows* 7 and Windows Server 2008 R2
• Mandatory features
Release 1 Q2 2012
• Added 64-bit support Windows 8 • Public IOCTLs and Windows 8 Storport updates
Release 1.1 Q4 2012
• Added 64-bit support on Windows Server 2012 • Signed executable drivers
Release 1.2 Aug 2013
• Hibernation on boot drive • NUMA group support in core enumeration
Release 1.3 March 2014
• WHQL certification • Drive Trace feature, WVI command processing • Migrate to VS2013, WDK8.1
Release 1.4 Oct, 2014
Four major open source releases since 2012. Contributors include Huawei*, PMC-Sierra*, Intel, LSI* & SanDisk*
https://www.openfabrics.org/resources/developer-tools/nvme-windows-development.html
49
Linux* Driver Update
Recent Features • Stabled Linux* 3.10, Latest driver in 3.14 • Surprise hot plug/remove • Dynamic partitioning • Deallocate (i.e., Trim support) • 4KB sector support (in addition to 512B) • MSI support (in addition to MSI-X) • Disk I/O statistics
Linux OS distributors’ support • RHEL 6.5, Ubuntu 13.10 has native drivers
• RHEL 7.0, Ubuntu 14.04LTS and SLES 12 will have latest native drivers
• SuSE is testing external driver package for SLES11 SP3
Works in progress: power management, end-to-end data protection, sysfs manageability & NUMA
/dev/nvme0n1
50
FreeBSD Driver Update
• NVM Express* (NVMe) support is upstream in the head and stable/9 branches
• FreeBSD 9.2 released in September is the first official release with NVMe support
nvme
Core NVMe driver
nvd
NVMe/block layer shim
nvmecontrol
User space utility, including firmware update
Free
BS
D N
VM
e M
odu
les
51
Solaris* Driver Update
• Current Status from Oracle* team - Fully compliant with 1.0e spec - Direct block interfaces bypassing complex SCSI code path - NUMA optimized queue/interrupt allocation - Support x86 and SPARC platform - A command line tool to monitor and configure the controller - Delivered to S12 and S11 Update 2
• Future Development Plans - Boot & install on SPARC and X86 - Surprise removal support - Shared hosts and multi-pathing
52
VMware Driver Update
• Vmklinux based driver development is completed – First release in mid-Oct, 2013 – Public release will be 1H, 2014
• A native VMware* NVMe driver is available for end user evaluations
• VMware’s I/O Vendor Partner Program (IOVP) offers members a comprehensive set of tools, resources and processes needed to develop, certify and release software modules, including device drivers and utility libraries for VMware ESXi
53
UEFI Driver Update
• The UEFI 2.4 specification available at www.UEFI.org contains updates for NVM Express* (NVMe)
• An open source version of an NVMe driver for UEFI is available at nvmexpress.org/resources
“AMI is working with vendors of NVMe devices and plans for full BIOS support of NVMe in
2014.”
Sandip Datta Roy VP BIOS R&D, AMI
NVMe boot support with UEFI will start percolating releases from Independent BIOS Vendors in 2014
54
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
55
NVMe Promoters “Board of Directors”
Technical Workgroup
Queueing Interface Admin Command Set
NVMe I/O Command Set Driver Based Management
Current spec version: NVMe 1.1
Management Interface Workgroup
In-Band (PCIe) and Out-of-Band (SMBus) PCIe SSD Management
First specification will be Q3, 2014
NVM Express Organization Architected for Performance
56
NVM Express 1.1 Overview
• The NVM Express 1.1 specification, published in October of 2012, adds additional optional client and Enterprise features
57
NVM Express 1.1 Overview
• The NVM Express 1.1 specification, published in October of 2012, adds additional optional client and Enterprise features
Multi-path Support
• Reservations • Unique Identifier per Namespace • Subsystem Reset
58
NVM Express 1.1 Overview
• The NVM Express 1.1 specification, published in October of 2012, adds additional optional client and Enterprise features
Power Optimizations
• Autonomous Power State Transitions
Multi-path Support
• Reservations • Unique Identifier per Namespace • Subsystem Reset
59
NVM Express 1.1 Overview
• The NVM Express 1.1 specification, published in October of 2012, adds additional optional client and Enterprise features
Power Optimizations
• Autonomous Power State Transitions
Command Enhancements
• Scatter Gather List support • Active Namespace Reporting • Persistent Features Across
Power States • Write Zeros Command
Multi-path Support
• Reservations • Unique Identifier per Namespace • Subsystem Reset
60
Multi-path Support
• Multi-path includes the traditional dual port model
• With PCI Express*, it extends further with switches
61
Reservations • In some multi-host environments, like Windows* clusters, reservations
may be used to coordinate host access
• NVMe 1.1 includes a simplified reservations mechanism that is compatible with implementations that use SCSI reservations
• What is a reservation? Enables two or more hosts to coordinate access to a shared namespace. – A reservation may allow Host A and Host B access, but disallow Host C
Namespace
NSID 1
NVM Express Controller 1
Host ID = A
NSID 1
NVM Express Controller 2
Host ID = A
NSID 1
NVM Express Controller 3
Host ID = B
NSID 1
HostA
HostB
HostC
NVM Subsystem
NVM Express Controller 4
Host ID = C
62
Power Optimizations • NVMe 1.1 added the Autonomous Power State Transition feature for
client power focused implementations
• Without software intervention, the NVMe controller transitions to a lower power state after a certain idle period – Idle period prior to transition programmed by software
Power State
Opera-tional?
Max Power
Entrance Latency
Exit Latency
0 Yes 4 W 10 µs 10 µs
1 No 10 mW 10 ms 5 ms
2 No 1 mW 15 ms 30 ms
Example Power States
Power State 0
Power State 1
Power State 2
After 50 ms idle
After 500 ms idle
63
Continuing to Advance NVM Express
• NVM Express continues to add features to meet the needs of client and Enterprise market segments as they evolve
• The Workgroup is defining features for the next revision of the specification, expected ~ middle of 2014
Features for Next Revision Namespace Management Management Interface Live Firmware Update Power Optimizations Enhanced Status Reporting Events for Namespace Changes
…
Get involved – join the NVMe Workgroup
nvmexpress.org
64
Agenda
• Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center
• Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments
• Deploying PCIe SSD with NVMe
65
Considerations of PCI Express* SSD with NVM Express, NVMe SSD
• NVMe driver assistant? • S.M.A.R.T/Management? • Performance scalability?
• PCIe SSD vs SATA SSDs? • PCIe SSD grades? • Software optimizations?
NVMe SSDs are on the way to Data Center
66
PCI Express* SSD vs Multi SATA* SSDs SATA SSDs advantages • Matured hardware RAID/Adapter
for management of SSDs • Matured technology/eco system
for SSDs • Cost & performance balance Quick Performance Comparison • Random WRITE IOPS: 6 x S3700
= one PCIe SSD 1.6T (4 lanes, Gen3)
• Random READ IOPS: ~8 x S3700 = 1 x PCIe SSD
Mix-Use PCIe and SATA SSDs • hot-pluggable 2.5” PCIe SSD has
same maintenance advantage as SATA SSD
• TCO, balance on performance and cost
Performance of 6~8 Intel S3700 SSDs is close to 1x PCIe SSD
4K random workloads (IOPS)
Measurements made on Hanlan Creek (Intel S5520HC) system with two Intel Xeon X5560@ 2.93GHz and 12GB (per CPU) Mem running RHEL6.4 O/S, Intel S3700 SATA Gen3 SSDs are connected to LSI* HBA 9211, NVMe SSD is under development, data collected by FIO* tool
0
100000
200000
300000
400000
500000
600000
100% read 50% read 0% read
6x800GB Intel S3700
1x NVMe 1600GB
IOPS
67
Example, PCIe/SATA SSDs in one system
1U 4x 2.5” PCIe SSDs + 4xSATA SSDs
68
Selections of PCI Express* SSD with NVM Express, NVMe SSD
• High Endurance Technology (HET) PCIe SSD Applications with intensive random write workloads, typical are high percentage small block random writes, such as critical database, OLTs… • Middle Tier PCIe SSD Applications needs random write performance and endurance, but much lower than HET PCIe SSD, typical workloads is <70% random writes. • Low cost PCIe SSD Same read performance as above, however it has 1/10th of HET write performance and endurance, Applications with high intensive read workloads, such as search engine etc.
Application determines cost and performance
69
Optimizations of PCI Express* SSD with NVM Express, NVMe SSD
70
Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration Controller capability/identify NVMe features Asynchronous Event NVMe logs Optional IO Command Data Set management (Trim)
71
Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration Controller capability/identify NVMe features Asynchronous Event NVMe logs Optional IO Command Data Set management (Trim)
72
Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration Controller capability/identify NVMe features Asynchronous Event NVMe logs Optional IO Command Data Set management (Trim)
73
Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration Controller capability/identify NVMe features Asynchronous Event NVMe logs Optional IO Command Data Set management (Trim)
74
Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration Controller capability/identify NVMe features Asynchronous Event NVMe logs Optional IO Command Data Set management (Trim) NVMe IO Threaded structure Understand number of CPU logic
cores in your system Write multi-Thread application
programs No need for handling rq_affinity
75
Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration Controller capability/identify NVMe features Asynchronous Event NVMe logs Optional IO Command Data Set management (Trim) NVMe IO Threaded structure Understand number of CPU logic
cores in your system Write multi-Thread application
programs No need for handling rq_affinity
Write NVMe friendly applications
76
Optimizations of PCI Express* SSD with NVM Express (cont.)
IOPS performance • Chose higher number of threads ( < min(number system CPU cores, SSD
controller maximum allocated queues)) • Chose Low Queue depth for each thread (asynchronous IO) • Avoid to use single thread with much higher Queue Depth(QD), especially for
small transfer blocks • Example: 4K random read on one drive in a system with 8 CPU cores, use 8
threads with Queue Depth(QD)=16 per thread instead of single thread with QD=128.
Latency • Lower QD for better latency • For intensive random write, there is a sweet point of threads & QD for
balancing performance and latency • Example: 4K random write in 8-core system, threads=8, sweet QD is 4 to 6. Sequential vs Random workload • Multi-threads sequential workloads may turn to be random workloads at SSD
side
Use Multi-Threads with Low Queue Depth
77
NVM Express (NVMe) Driver beyond NVMe Specification
NVMe Linux driver is open source
LBA0……………………..LBA255 LBA256…………..…..LBA511 LBA512……………….LBA767 LBA768……………..LBA1023
LBA1024…………….………..etc. …etc…
Core 0 Core 1
• Driver Assisted Striping – Dual core NVMe controller
each core maintains separate NAND array and striped LBA ranges (like RAID 0)
– Driver can enforce all commands fall within KB stripe, ensuring maximum performance
• Contribute to NVMe driver
78
S.M.A.R.T and Management
Use PCIe in-band commands to get SSD SMART log (NVMe log)
Statistical data, status,
Warnings, Temperature, endurance indicator
• Use Out-Of-Band
SMBus to access VPD EEPROM, Vendor information
• Use Out-of-Band SMBus temperature sensor for close loop thermal controls (Fan speed)
NVMe Standardizes S.M.A.R.T. on PCIe SSD
79
Scalability of Multi-PCI Express* SSDs with NVM Express
Performance on 4 PCIe SSDs = Performance on 1 PCIe SSD X 4 Advantage of NVM Express threaded and MSI-X structure!
100% random read
0.00
2.00
4.00
6.00
8.00
10.00
12.00
4K 8K 16K 64k
1xNVMe 1600GB
2xNVMe 1600GB
4xNVMe 1600GB
GB/s
0
0.5
1
1.5
2
2.5
3
3.5
4K 8K 16K 64k
1xNVMe 1600GB
2xNVMe 1600GB
4xNVMe 1600GB
GB/s
100% random write
Measurements made on Intel system with two Intel Xeon™ CPU E5-2680 v2@ 2.80GHz and 32GB Mem running RHEL6.5 O/S, NVMe SSD is under development, data collected by FIO* tool, numJob=30, queue depth (QD)=4 (read), QD=1 (write), libaio. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
80
PCI Express* SSD with NVM Express (NVMe SSD) deployments
Source: Geoffrey Moore, Crossing the Chasm
SSDs are a disruptive technology, approaching “The Chasm” Adoption success relies on clear benefit, simplification, and ease of use
81
Summary
• PCI Express* SSD enables lower latency and further alleviates the IO bottleneck
• NVM Express is the interface architected for PCI Express* SSD, NAND Flash of today and next generation NVM of tomorrow
• Promoting and adopting PCIe SSD with NVMe as mainstream technology and get ready for next generation of NVM
82
Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel, Xeon, Look Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others. Copyright ©2014 Intel Corporation.
83
Risk Factors The above statements and any others in this document that refer to plans and expectations for the first quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions; customer acceptance of Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological developments and to incorporate new features into its products. The gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.
Rev. 1/16/14