Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” •...
Transcript of Jim Pinkerton Microsoft Corporation - SNIA · 2015. 9. 22. · • Three “entitities” •...
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Concepts on Moving From SAS connected JBOD to an Ethernet
Connected JBOD (EBOD)
Jim Pinkerton Microsoft Corporation
9/22/2015
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Re-Thinking Software Defined Storage Conceptual Model Definition
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Reliable, Available Storage
Direct Attach: • SATA • NVME • CPU Memory Interface Shared Storage • SAS • Fibre Channel • Ethernet
Flaky Non-volatile Device
Fabrics: • Ethernet • InfiniBand • Fibre Channel
Disk Flash Tape Other
Compute/ Application
Node (mem, CPU)
• Three “entitities” • Compute Node • Storage Node • Flakey Storage Devices
• Front end fabric: Ethernet, IB, FC • Back end fabric: Direct Attached or Shared Storage
File or Block Access
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Yesterday’s Storage Architecture: Still highly profitable
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Compute/ Appication
Node (mem, CPU)
Traditional SAN/NAS box
Storage Processing Node
(mem, CPU)
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Non-volatile storage device
Reliable, Available Storage Compute/
Appication Node
(mem, CPU)
Compute/ Application
Node (mem, CPU) Direct Attach:
• SATA • NVME • CPU Memory Interface Shared Storage • SAS • Fibre Channel • Ethernet
Flakey Non-volatile Device
Fabrics: • Ethernet • InfiniBand • Fibre Channel
“Appliance” Vendor Tests it as a unit
Moved to general purpose CPU & OS
Disk Flash Tape Other
Compute Farm
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Today: Software Defined Storage (SDS) “Converged” Deployments The rise of componentization of storage – but an
interop nightmare for the user
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Compute/ Appication
Node (mem, CPU)
Storage Processing Node
(mem, CPU)
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Non-volatile storage device
Reliable, Available Storage Compute/
Appication Node
(mem, CPU)
Compute/ Appication
Node (mem, CPU)
Compute Farm
Direct Attach: • SATA • NVME • CPU Memory Interface Shared Storage • SAS • Fibre Channel • Ethernet
Flakey Non-volatile Device
Fabrics: • Ethernet • InfiniBand • Fibre Channel
Software ships independent of hardware
JBODs
Customer is the first to test as a system
Storage Service
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Today: Software Defined Storage (SDS) “Hyper-Converged” (H-C) Deployments
H-C appliances are a dream for the customer H-C $/GB storage is expensive
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Compute/ Appication
Node (mem, CPU)
Storage Service
Storage Processing Node
(mem, CPU)
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Non-volatile storage device
Reliable, Available Storage Compute/
Appication Node
(mem, CPU)
Compute/ Application
Node (mem, CPU)
Compute Farm
Direct Attach: • SATA • NVME • CPU Memory Interface Shared Storage • SAS • Fibre Channel • Ethernet
Flakey Non-volatile Device
Fabrics: • Ethernet • InfiniBand • Fibre Channel
Typically sold as an appliance (OEM owns E2E testing)
Unreliable NV Storage
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
An Example: Microsoft’s Cloud Platform System (CPS)
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Compute/ Appication
Node (mem, CPU)
Storage Processing Node
(mem, CPU)
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Non-volatile storage device
Reliable, Available Storage Compute/
Appication Node
(mem, CPU)
Compute/ Application
Node (mem, CPU)
Compute Farm
Direct Attach: Shared Storage • SAS
Flakey Non-volatile Device
Fabrics: • Ethernet
JBODs Storage Service
SSD HDD
“Azure aligned innovation” “Appliance like experience” “Single Throat to Choke”
Package Deal Shipped as of Oct/2014
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
SDS with DAS
Non-volatile storage device
Storage Service
Storage Processing Node
(mem, CPU)
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Non-volatile storage device
Reliable, Available Storage Compute/
Application Node
(mem, CPU)
Compute Farm
Direct Attach: • SATA • NVME • CPU Memory Interface
Flakey Non-volatile Device
Fabrics: • Ethernet
Unreliable NV Storage
Non-volatile storage device
Storage FE Service
Storage Processing Node
(mem, CPU)
Storage Processing Node
(mem, CPU)
Non-volatile storage device
Non-volatile storage device
Reliable, Available Storage Compute/
Application Node
(mem, CPU)
Compute Farm
Direct Attach: • SATA, SAS • NVME • CPU Memory Interface
Flakey Non-volatile Device
Fabrics: • Ethernet
Unreliable NV Storage Storage BE Service
Storage Processing Node
(mem, CPU)
Storage Processing Node
(mem, CPU)
Unreliable Shared Block Storage
Fabrics: • Ethernet
Calabria
With DAS, rebuild & replication traffic moves to Ethernet
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Software Defined Storage Topologies
Non-volatile storage device
Storage FE Service
Storage Processing
Node (mem, CPU)
Storage Processing
Node (mem, CPU)
Non-volatile storage device
Non-volatile storage device
Compute/ Application
Node (mem, CPU)
Compute Farm
Direct Attach: • SATA, SAS • NVME • CPU Memory Interface
Fabrics: • Ethernet
Unreliable NV Storage
Storage BE Service
Storage Processing
Node (mem, CPU)
Storage Processing
Node (mem, CPU)
Fabrics: • Ethernet
Non-volatile storage device
Storage FE Service
Storage Processing
Node (mem, CPU)
Storage Processing
Node (mem, CPU)
Non-volatile storage device
Non-volatile storage device
Compute/ Application
Node (mem, CPU)
Compute Farm
Direct Attach: • SATA, SAS • NVME • CPU Memory Interface
Fabrics: • Ethernet
Unreliable NV Storage
Storage BE Service
Storage Processing
Node (mem, CPU)
Storage Processing
Node (mem, CPU)
Fabrics: • Ethernet
Non-volatile storage device
Storage FE Service
Storage Processing
Node (mem, CPU)
Storage Processing
Node (mem, CPU)
Non-volatile storage device
Compute/ Application
Node (mem, CPU)
Compute Farm
Direct Attach: • SATA, SAS • NVME • CPU Memory Interface
Fabrics: • Ethernet
Unreliable NV Storage
Storage BE Service
Storage Processing
Node (mem, CPU)
Storage Processing
Node (mem, CPU)
Fabrics: • Ethernet
Hyper-Converged
Converged
EBOD Future?
= physical host boundary, blue is workload on physical node, arrows can go between physical nodes
File or Block Access Device
Access
Data Mover Workload
EBOD
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EBOD Works for a Variety of Access Protocols & Topologies SMB3 “block” Lustre – object store Ceph – object store NVME Fabric?… T10 Objects
9
Non-volatile storage device
Storage MDS Service
Storage Processing
Node (mem, CPU)
Storage Processing
Node (mem, CPU)
Non-volatile storage device
Non-volatile storage device
Compute/ Application
Node (mem, CPU)
Compute Farm
Direct Attach: • SATA, SAS • NVME • CPU Memory Interface
Fabrics: • Ethernet
Unreliable NV Storage
Storage BE Service
Storage Processing
Node (mem, CPU)
Storage Processing
Node (mem, CPU)
Fabrics: • Ethernet
EBOD
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
I have a problem… shared SAS interop
Failover asdClustering s
SMB3 Service
Scale-Out File Server
Storage Spaces
File System
SAS HBA
SAS Port
SAS Port
SAS Port
SAS Port
SMB3 Service
Scale-Out File Server
SAS HBA
SAS Port
SAS Port
SAS Port
SAS Port
Storage Spaces
File System
SMB3 Service
Scale-Out File Server
Storage Spaces
File System
SAS HBA
SAS Port
SAS Port
SAS Port
SAS Port
SMB3 Service
Scale-Out File Server
Storage Spaces
File System
SAS HBA
SAS Port
SAS Port
SAS Port
SAS Port
CSV v2 + CSV Cache
Storage Pool(s)
NOTE: Each JBOD Chassis Expander has 4 SAS links – 2 links used to connect to Nodes, and two links to provide MPIO connectivity in the case of a SAS expander failure.
10Gb-E RDMA Port
10Gb-E RDMA Port
10Gb-E RDMA Port
10Gb-E RDMA Port
10Gb-E RDMA Port
10Gb-E RDMA Port
10Gb-E RDMA Port
10Gb-E RDMA Port
Dual Port SAS Disks
SAS JBOD Chassis with two Expanders
(2 in/out ports each)
SAS Exp
SAS Exp
Dual Port SAS Disks
SAS JBOD Chassis with two Expanders
(2 in/out ports each)
SAS Exp
SAS Exp
Dual Port SAS Disks
SAS JBOD Chassis with two Expanders
(2 in/out ports each)
SAS Exp
SAS Exp
Dual Port SAS Disks
SAS JBOD Chassis with two Expanders
(2 in/out ports each)
SAS Exp
SAS Exp
Node
Cha
ssis
SAS Shared Fabric • Multi-Initiator • Multi-Path
My Nightmare Experience • Disk multi-path interop • Expander multi-path interop • HBA distributed failures
There is a good reason why customers prefer appliances!
Example Storage Cluster (Microsoft’s CPS)
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
To Share or Not to Share? Shared SAS:
Customer deployment can have serious bugs Failure of a FE node: JBOD fails over to another node Failure of a JBOD: all data is replicated
Non-Shared SAS (or SATA or NVME or SCM) Customer deployment more straightforward Failure of a FE node: EBOD fails over to another node Failure of a EBOD: all data is replicated New Ethernet traffic
Triple replica (3x increase bandwidth on Ethernet) Rebuild traffic
11
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Hyper-Scale Cloud Tension – Fault Domain Rebuild Time
Rebuild time is a function of Number of disks/size of disk behind a node Speed of network and how much of it you want to use
Storage cost reduction is driving higher drive counts behind a node (>30 drives) Causes higher network costs because rebuild time must
occur in constant time
Conclusions:
Required network speed offsets benefits of greater density Fault domain for storage is too big
TB behind
one node
% BW utilization
Net speed (gb/s)
# nodes in
rebuild
Min for rebuild
180 25 40 120 20 1080 25 40 120 120 1080 50 100 120 24
Large # drives require
extreme network
bandwidth
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Private Cloud Tension – Not enough Disks
Goal is entry point at 4 nodes (or less) If used same 30 disk JBOD
Loss of one node implies loss of 30 disks (180 TB) To recover from node loss, must have 25% of
capacity idle for single node failure, 50% idle for dual node fault tolerance
Conclusion: Fault domain is too large
13
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Goals in Refactoring SDS
Optimize workloads for class of CPU Backend is “data mover” (EBOD)
Primarily movement of data and background tasks Data Integrity, caching, …
Little processing power
Frontend is “general purpose CPU” Still need accelerators for data movement, data integrity,
encryption, etc.
14
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EBOD Goals
Reduce Storage Costs Right size config for front end and back-end
workloads Reduce size of fault domain (and rebuild traffic
and network bandwidth requirements) Small Private Clouds Move storage fabric to Ethernet
Build on more robust ecosystem of DAS Keep topology for storage device simple
15
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EBOD Design Points
Ethernet Connected JBOD High End Box (NVME, Storage Class Memory) Volume Box (Some NVME/SSD, HDD) Capacity Box (primarily HDD, some NVME/SSD)
What If we used “small core” CPU? Fewer disks because cheaper CPU
16
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EBOD Volume Box
Enable an Ethernet connected JBOD with low disk count at *very* low cost Critical to hit a price point similar to existing SAS
JBODs that are integrated into the chassis Export just raw disks to keep CPU as simple as
possible, and SDS as close to hardware as possible Needed for Storage Class Memory
Enable front end nodes (big core) to create reliable/available storage
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Comparing Storage Node Design Points
H B A
Multiple Expanders
Current Hyper-Scale
S O C
Expander N I C
Ethernet Switch
Ethernet Cable
Ethernet Connected
JBOD (EBOD)
60 disks
20 disks x 3
x3
Large Core CPU
N I C
N I C
Large Core CPU
N I C
HBA Replaced With: • NIC on Storage Node • Switch port to Storage Node • Switch port to EBOD
Expander Replaced With: • SOC • NIC • Smaller expanders Switch in Chassis?
JBOD
EBOD
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EBOD Volume Concept
CPU and Memory cost optimized for EBOD
Dual attach >=10 GbE SOC with integrated
RDMA NIC SATA/SAS/PCIE
connectivity to ~20 devices Universal connector (SFF-8639) Management
Out-of-band management through BMC In-band management with SCSI Enclosure
Services 19
SFF-8639?
EBOD
4 NVME Devices 16 HDD Drives Dual Port
>=10 GbE
Mgmt Port DMTF Redfish?
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Volume EBOD Proof Point
Intel Avaton Microserver PCIE Gen 2 Chelsio 10 GbE NIC SAS HBA SAS SSD
20
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EBOD Avaton Microserver POC, 8K Random Reads IOPs
Max remote performance ~159K IOPS w/ RDMA ~122K IOPS w/o RDMA
Higher Performance gain with RDMA at lower IOPs
At Higher queue depths, RDMA gains reduce to ~30% CPU% capped at 122K
(28 outstanding IOPs/7 SSDs)
CPU is bottleneck
55%
47%
41%
35%
32% 31% 31%
0%
10%
20%
30%
40%
50%
60%
-
50,000
100,000
150,000
200,000
250,000
300,000
1 2 4 8 16 32 64
RD
MA
PE
RF.
GA
IN
IOP
S
OUTSTANDING IO / DISK
8K Random Reads (100%)
Local IOPs Remote IOPS No RDMA Remote IOPS RDMA perf. gain
60 50 40 30 20 10 0
Local IOPs RDMA Remote IOPs No RDMA Remote IOPs
1 2 4 8 16 32 64 Outstanding IO/Disk
300
250
200
150
100
50
K IO
Ps
(8 K
B)
55%
41%
47%
35% 32% 31% 31%
RD
MA
Per
form
ance
Gai
n Remote Access is bottlenecked on Network Speed
Configuration • Intel Avaton Microserver
• PCIE Gen 2 • Chelsio 10 GbE NIC • SAS HBA • SAS SSD
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EBOD Avaton Microserver POC - 8K Random Reads, Latency
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70
MSE
C
OUTSTANDING IO/DISK
8K Random - 100% reads - Latency
Local latency remote latency No RDMA latency
With RDMA At least 65%
speedup in IO latency (~75% at higher IOPs)
Remote Latency drops at 4 outstanding IO/disk (7 disks)
Goes back up at 8 outstanding IO/disk
0 10 20 30 40 50 60 70 Outstanding IO/Disk
4
3.5
3
2.5
2
1.5
1
0.5
0
Late
ncy
(mse
c)
Local Latency RDMA Remote Latency No RDMA Remote Latency
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
EBOD Performance Concept Goal is highest performant
storage, thus big-CPU is small part of total cost Hi-speed CPU
reduces latency Dual attach >=40 GbE SOC with integrated
RDMA NIC PCIE connectivity to ~20
devices Possibly all NVME attach or
Storage Class Memory 23
EBOD
16 NVME Drives or Storage Class Mem Dual Port
>=40 GbE
Mgmt Port DMTF Redfish?
See “SMB3.1.1 Update” SDC Talk for Dual 100 GbE early results
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Summary
EBOD enables Shared storage on Ethernet using DAS storage
DAS storage has better interop within eco-system Price point of EBOD must be carefully managed
Microserver CPU is viable for broad spectrum of perf Integrated SOC solution is preferred
Low price point of EBOD CPU/Memory enables smaller fault domain (fewer disks can be behind the Microserver)
24
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Q&A
25
2015 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Outstanding Technical Issues
Enclosure level How to manage storage enclosure? SCSI Enclosure
Services (SES)? Base Management? DMTF Redfish or IPMI? How to provision raw storage? Liveness of drives/enclosure
How to fence individual drives? Security model Advanced features in EBOD?
Low latency, integrity check, caching, …
26