MyPO and MyCLM SanDisk iPhone App UI Samples. MyPO SanDisk iPhone App UI Samples.
The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk
-
Upload
openstack -
Category
Technology
-
view
1.100 -
download
1
Transcript of The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk
May 5, 2016 1
Allen Samuels
The Consequences of Infinite Storage Bandwidth
Engineering Fellow, Systems and Software Solutions
May 5, 2016
May 5, 2016 2
Disclaimer
During the presentation today, we may make forward-looking statements.
Any statement that refers to expectations, projections, or other characterizations of future events or circumstances is a forward-looking statement, including those relating to industry predictions and trends, future products and their projected availability, and evolution of product capacities. Actual results may differ materially from those expressed in these forward-looking statements due to a number of risks and uncertainties, including among others: industry predictions may not occur as expected, products may not become available as expected, and products may not evolve as excepted; and the factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including, but not limited to, ourannual report on Form 10-K for the year ended January 3, 2016. This presentation contains information from third parties, which reflect their projections as of the date of issuance. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or the date of issuance by a third party.
May 5, 2016 4
Log scale
• Use DRAM Bandwidth as a proxy for CPU throughput
• Reasonable approximation for DMA heavy, and/or poor cache hit performance workloads (e.g. Storage)
Big
diffe
ren
ce
in slo
pe
!
Data is for informational purposes only and may contain errors
Network, Storage and DRAM Trends
May 5, 2016 5
Linear scale
Infin
ite S
tora
ge
Ba
nd
wid
th
• Same data as last slide, but for the Log-impaired
• Storage Bandwidth is not literally infinite
• But the ratio of Network and Storage to CPU throughput is widening very quickly
Data is for informational purposes only and may contain errors
Network, Storage and DRAM Trends
May 5, 2016 6
0
50
100
150
200
250
1990 1995 2000 2005 2010 2015 2020 2025Year
SSDs / CPU Socket
Data is for informational purposes only and may contain errors
May 5, 2016 7
0
5
10
15
20
25
30
35
40
45
50
1995 2000 2005 2010 2015 2020 2025Year
SSDs / CPU Socket @ 20% Max BW
Data is for informational purposes only and may contain errors
May 5, 2016 9
New Denser Server Form Factors
– Blades
– Sleds
Good short term solutions
Let’s Get Small!
May 5, 2016 10
Storage Cost = Media + Access + Management
Shared nothing architecture conflates access and management
Storage costs will become dominated by Management cost
Storage costs become CPU/DRAM costs
Effects Of The CPU/DRAM Bottleneck
May 5, 2016 11
Move management to upper layers where CPU can be right-sized by client
What kind of media access do I want?
– Simple enough functionality to be done directly in drive hardware – NO CPU
– Allow direct access throughout the compute cluster over a network
– Just enough machinery to enable coarse-grained sharing
Embracing The CPU/DRAM Bottleneck
In short, you really want a SAN !
– Or more technically, Fabric Connected Storage
May 5, 2016 12
Not Your Father’s SAN
Three problems with current SAN
– Fibre channel transport
– SCSI access protocol
– Drive oriented storage allocation
All of these want to be updated
– Fibre channel is brittle and costly
– SCSI initiators have long code paths catering to seldom used configurations
– Robust sub-drive storage allocation
May 5, 2016 13
SAN 2.0
NVMe over Fabrics
1.0 Spec is out for review, hopefully done in May
Simple enough for direct hardware execution of data path ops
Minimal initiator code path lengths improve performance
Namespaces allow sub-drive allocations
Not mature enough for enterprise deployment – yet
May 5, 2016 14
SAN 2.0
What storage network?
– Current candidates are FC, Infiniband and Ethernet
Ethernet has best economics – if you can make it work
RoCE is easy on the edge, but hard on the interior
– Only controlled environments have shown multi-switch scalability
– General scalability in a multi-vendor environment likely to be difficult
– Wonderful for intra-rack storage networking
iWarp is hard on the edge, but easy on the interior
– Scarcity of implementations inhibits deployment
Storage over IP will see limited cross rack deployment until this is resolved
May 5, 2016 15
Implementations using OTS stuff are in progress
Server side implementations look pretty conventional too
4-5 MIOPS have been shown
Seems like 10 MIOPS isn’t unreasonable to expect
First Generation Of SAN 2.0
NIC
CPU DRAM
SSDPCIe
May 5, 2016 16
Soon, NICs will forward NVMe operations to local PCIe devices
CPU removed from the software part of the data path
CPU is still needed for the hardware part of the data path
IOPS improve, BW is unchanged
Significant CPU freed for application processing
Getting closer to the wall!
Second Generation SAN 2.0
May 5, 2016 17
New generation of combined SSD controller and NIC
– Rethink of interfaces eliminates DRAM buffering
Network goes right into the drive
No CPU to be found
Works well with rack scale architecture
Third Generation SAN 2.0, Imagined
May 5, 2016 18
Disaggregated / Rack Scale Architecture
– Fabric connected
– Independently scale compute, networking and storage
Let’s Get Really Small
May 5, 2016 19
Call To Action
Fabric-connected storage isn’t well managed by existing FOSS
Lots of upper layer management software is available
– OpenStack, Ceph, Gluster, Cassandra, MongoDB, SheepDog, etc.
Lower layer cluster management still primitive
May 5, 2016 20
What’s It All Mean?
New form factors are in everybody's future
The coming avalanche of storage bandwidth wants to be free
– Not imprisoned by a CPU
Rack Scale Architecture allows new Storage/Compute configs
Storage will be increasingly “Software Defined” as the HW evolves
May 5, 2016 22
Old Model Monolithic, large upfront
investments, and fork-lift upgrades Proprietary storage OS Costly: $$$$$
New SD-AFS Model Disaggregate storage, compute, and software for
better scaling and costs Best-in-class solution components Open source software - no vendor lock-in Cost-efficient: $
Software-defined All-Flash StorageThe disaggregated model for scale
May 5, 2016 23
Scalable Raw Performance
2M IOPS, Latency 1-3ms12-15 GB/s Throughput
8TB Flash-Card Innovations
• Enterprise Grade Power-Fail Safe• Alerts & monitoring • Latching integrated & monitored• Directly samples air temp• Form-factor enables lowest cost SSD
InfiniFlash™ Storage Platform
Capacity 512TB – raw all Flash!
All Flash 3U JBOD of Flash (JBOF)Up to 64 x 8TB SAS Drive Cards 4TB cards also available soon
Operational Efficiency & Resilient
Hot Swappable Architecture, Easy FRULow power – typical workload 400-500W 150W(idle) - 750W(max)
MTBF 1.5+ million hours
Hot Swappable !
Fans, SAS Expander Boards, Power Suppliers, Flash cards
Host Connectivity
Connect up to 8 servers through 8 SAS portsMulti-path enabled
Flash Drive Card
EMS Product Management SanDisk Confidential
May 5, 2016 24
InfiniFlash IF500 All-Flash Storage System Block and Object Storage Powered by Ceph
Ultra-dense High Capacity Flash storage
– 512TB in 3U, Scale-out software for PB scale capacity
Highly scalable performance
– Industry leading IOPS/TB
Cinder, Glance and Swift storage
– Add/remove server & capacity on-demand
Enterprise-Class storage features
– Automatic rebalancing
– Hot Software upgrade
– Snapshots, replication, thin provisioning
– Fully hot swappable, redundant
Ceph Optimized for SanDisk flash
– Tuned & Hardened for InfiniFlash
May 5, 2016 25
InfiniFlash SW + HW Advantage
Software Storage System
Software tuned for Hardware• Ceph modifications for Flash• Both Ceph, Host OS tuned for
InfiniFlash• SW defects that impacts Flash
identified & mitigated
Hardware Configured for Software• Right balance of CPU, RAM,
Storage• Rack level designs for optimal
performance & cost
Software designed for all systems does not work well with any system
Ceph has over 50 tuning parameters that results in 5x – 6x performance improvement
Fixed CPU, RAM hyperconvergednodes does not work well for all workloads
May 5, 2016 26
InfiniFlash for OpenStack with Dis-Aggregation
Compute & Storage Disaggregation enables Optimal Resource utilization
Allows for more CPU usage required for OSDs with small Block workloads
Allows for higher bandwidth provisioning as required for large Object workload
Independent Scaling of Compute and Storage
Higher Storage capacity needs doesn't’t force you to add more compute and vice-versa
Leads to optimal ROI for PB scale OpenStack deploymentsHSEB A HSEB B
OSDs
SAS
….
HSEB A HSEB B HSEB A HSEB B
….
Co
mp
ute
Far
m LUN LUN
iSCSI Storage
…Obj Obj
Swift ObjectStore
…LUN LUN
Nova with Cinder & Glance
…
LibRBD
QEMU/KVM
RGW
WebServer
KRBD
iSCSI Target
OSDs OSDs OSDs OSDs OSDs
Sto
rage
Far
m
Confidential – EMS Product Management
May 5, 2016 27
IF500 - Enhancing Ceph for Enterprise Consumption
IF500 provides usability and performance utilities without sacrificing Open Source principles
• SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality• All Ceph Performance improvements developed by SanDisk are contributed back to community
27
SanDisk Distribution or
Community Distribution
Out-of-the Box configurations tuned for performance with Flash
Sizing & planning tool
InfiniFlash drive management integrated into Ceph management (Coming Soon)
Ceph installer that is specifically built for InfiniFlash High performance iSCSI storage
Better diagnostics with log collection tool Enterprise hardened SW + HW QA
May 5, 2016 28
InfiniFlash Performance Advantage900K Random Read Performance with 384TB of storage
Flash Performance unleashed• Out-of-the Box configurations tuned for
performance with Flash• Read & Write data-path changes for Flash• x3-12 block performance improvement –
depending on workload• Almost linear performance scale with
addition of InfiniFlash nodes• Write performance WIP with NV-RAM
Journals• Measured with 3 InfiniFlash nodes with 128TB each• Avg Latency with 4K Block is ~2ms, with 99.9 percentile
latency is under 10ms • For Lower block size, performance is CPU bound at Storage
Node.• Maximum Bandwidth of 12.2GB/s measured towards 64KB
blocks
S
28
May 5, 2016 29
InfiniFlash Ceph Performance Advantage
Single InfiniFlash unit Performance
– 1 x 512TB InfiniFlash unit connected with 8 nodes
– 4K RR IOPS: ~1 million IOPs - 85% of bare metal perf.
• Corresponding Bare metal IF100 IOPS is 1.1 million
– All 8 hosts CPU saturated for 4K Random read.
• More performance potential with higher CPU cycles
– With 64k IO size we are able to utilize full IF150 bandwidth of over 12GB/s.
– Librbd and Krbd performance are comparable.
– Write Performance is on 3x copy configuration. The more common 2x copy will result in 33% improvement.
Random Write
IO Profile LIBRBD IOPs
4k Random Write 54k
64k Random Write 34k
256k Random Write 11.3k
1,123,175
349,247
87,369
0
5
10
15
20
25
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
4k 64k 256k
BW
(G
Bp
s)
IOP
S
Random Read Block Performance
LIBRBD IOPs Bandwidth (GBps)
May 5, 2016 30
InfiniFlash Ceph Performance Advantage
Linear Scaling with 2 InfiniFlash units
– 2 x 512TB InfiniFlash unit connected with 16 nodes
– 1.8M 4K IOPS – 80% of the bare metal performance
– Performance is Scaling almost linearly - Almost doubled the
performance of single IF150 with ceph
– Write perf is 2 X with 16 node cluster compared with 8 node cluster.
Random Read
Random Write
IO Profile LIBRBD IOPs
4k RR 1800k
64k RR 225k
256k RR 53k
IO Profile LIBRBD BW(MB/s)
4k RR 7194
64k RR 14412
256k RR 13366
May 5, 2016 31
InfiniFlash OS – Hardened Enterprise Class Ceph
Hardened and tested for Hyperscaledeployments and workloads
Platform focused testing enables us to deliver a complete and hardened storage solution
Single Vendor support for both Hardware & Software
Enterprise Level Hardening
Testing at Scale
FailureTesting
9,000 hours of cumulative IO tests
1,100+ unique test cases
1,000 hours of Cluster Rebalancing tests
1,000 hours of IO on iSCSI
Over 100 server node clusters
Over 4PB of Flash Storage
2,000 Cycle Node Reboot
1,000 times Node Abrupt Power Cycle
1,000 times Storage Failure
1,000 times Network Failure
IO for 250 hours at a stretch
May 5, 2016 32
IF500 Reference Configurations
Model Entry Mid High
InfiniFlash 128TB 256TB 512TB
Servers1 2 x Dell R 630-2U 4 x Dell R 630-2U 4 x Dell R 630-2U2
Processor per server Dual socket Intel Xeon E5-2690 v3 Dual socket Intel Xeon E5-2690 v3 * Dual socket Intel Xeon E5-2690 v3
Memory per server 128GB RAM 128GB RAM 128GB RAM
HBA per server (1) LSI 9300-8e PCIe 12Gbps (1) LSI 9300-8e PCIe 12Gbps (1) LSI 9300-8e PCIe 12Gbps
Network per server(1) Mellanox ConnectX-3 dual ports 40GbE
(1) Mellanox ConnectX-3 dual ports 40GbE
(1) Mellanox ConnectX-3 dual ports 40GbE
Boot Drive per server (2) SATA 120GB SSD (2) SATA 120GB SSD (2) SATA 120GB SSD
1 - For larger block workload or less CPU intensive workload, OSD node could use single socket server. Dell Servers can be substituted with other vendor servers that match the specs.2 - For Small Block workloads, 8 servers are recommended
May 5, 2016 33
InfiniFlash TCO Advantage
$-
$10,000,000
$20,000,000
$30,000,000
$40,000,000
$50,000,000
$60,000,000
$70,000,000
$80,000,000
Tradtional ObjStore onHDD
IF500 ObjStore w/ 3Full Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary& HDD Copies
3 year TCO comparison *
3 year Opex
TCA
0
20
40
60
80
100
Tradtional ObjStore on HDD IF500 ObjStore w/ 3 FullReplicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary & HDDCopies
Total Rack
Reduce the replica count with higher reliability of flash
- 2 copies on InfiniFlash vs. 3 copies on HDD
InfiniFlash disaggregated architecture reduces compute usage, thereby reducing HW & SW costs
- Flash allows the use of erasure coded storage pool without performance limitations
- Protection equivalent of 2x storage with only 1.2x storage
Power, real estate, maintenance cost savings over 5 year TCO
* TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment
33