InfiniBand/RDMA for Storage - SRP vs. iSER
-
Upload
sebastian-parschauer -
Category
Technology
-
view
12.178 -
download
4
description
Transcript of InfiniBand/RDMA for Storage - SRP vs. iSER
InfiniBand/RDMA for Storage –SRP vs. iSER
Sebastian RiemerLinux Kernel Developer – Storage
23.05.2013
Structure
● RDMA Basics● RDMA Hardware
● InfiniBand, iWARP, RoCE● RDMA Software + Network Protocols● SRP vs. iSER
RDMA for Storage 2/28 23.05.2013
RDMA Basics
RDMA for Storage 3/28 23.05.2013
Remote Direct Memory Access (RDMA)
RDMA for Storage 4/28 23.05.2013
Latency
RDMA for Storage 5/28 23.05.2013
e.g. 4k sync. reads, status/information requests, ...
RDMA MTU
● RDMA MTU: 256, 512, 1024, 2048, 4096 Bytes● MTU : Throughput , Transfer Latency ● Max. MTU is settable● Active MTU is determined● InfiniBand: RDMA MTU is native● iWARP/RoCE: RDMA MTU must fit into Ethernet
MTU: 1500 → 1024 Bytes
RDMA for Storage 6/28 23.05.2013
RDMA Hardware
RDMA for Storage 7/28 23.05.2013
InfiniBand (IB)
● Switched fabric interconnect● Arbitrary topologies: Fat Tree, Mesh, Lash,...● Point-to-point bidirectional serial links● Used in HPC and Enterprise Data Centers● QDR 10 Gbit/s, FDR 14 Gbit/s per lane● Lanes: 4● Low end-to-end latency < 2 µs (1 GbE: 35 µs)
RDMA for Storage 8/28 23.05.2013
InfiniBand (IB)
● Subnet Manager (SM)● LID (16 bit) and GID (128 bit) addressing● GID = 64 bit subnet prefix + 64 bit GUID● Max. 128 partitions (like VLANs)● QoS, reliability and scalability● Credit-based flow control → no packet loss
RDMA for Storage 9/28 23.05.2013
InfiniBand Congestion
● Congestion Control (CC) not ready, yet● CC = tell SM to tell others to reduce their speed● Reduce MTU, set QoS, set IO limits, multipath
RDMA for Storage 10/28 23.05.2013
BLOCKED,NO CREDITS,
(tell SM)
master SM slave SM
Host Channel Adapters (HCA)
● IB counterpart of NICs● Communicate via a Queue Pair (QP) constisting
of Send Queue (SQ) and Receive Queue (RQ)● Reliable/Unreliable, Connected/Disconnected ● Support for atomic operations● Error counters in HW
RDMA for Storage 11/28 23.05.2013
Host Channel Adapters (HCA)
Mellanox QDRdriver: mlx4_ib
ConnectX-2 VPI
RDMA for Storage 12/28 23.05.2013
QLogic/Intel QDRdriver: qib
7300 Series
better for the DC/cloud
Internet Wide Area RDMA Protocol (iWARP)
● RDMA Network Interface Card (RNIC)● Connection-oriented (TCP), only RDMA
technology routable through the Internet● Reliable Connected (RC) only● Latency, bandwidth: >= 3 µs, usually 10 Gbit/s● Vendors: Chelsio (driver cxgb3/4),
Intel NetEffect (driver nes)
RDMA for Storage 13/28 23.05.2013
RDMA over Converged Ethernet (RoCE)
● Limited to a single Ethernet broadcast domain● InfiniBand frame encapsulation (IBoE)● GID is composed of MAC address + reserved● Better suited upon congestion● Scaling issues in big data center setups● Latency, bandwidth: < 2 µs, 10/40 Gbit/s● Vendors: Mellanox (driver mlx4_en),
Emulex (driver ocrdma),
RDMA for Storage 14/28 23.05.2013
RDMA Software + Network Protocols
RDMA for Storage 15/28 23.05.2013
OpenFabrics Enterprise Distribution (OFED)
● Approx. 30 SW packets● Upstream version: 3.5● IB Verbs: Hardware/OS abstraction layer● One IB verbs user-space driver per RDMA HW● IB Subnet Management (e.g. opensm)● Communication Management (CM)● Performance and diagnosis tools + utilities
RDMA for Storage 16/28 23.05.2013
RDMA Network Protocols
● IP over InfiniBand (IPoIB)● iSCSI Extensions for RDMA (iSER)● SCSI RDMA Protocol (SRP)● Network File Systems (NFS-RDMA)● Distributed File Systems (GlusterFS, Lustre)
RDMA for Storage 17/28 23.05.2013
SRP vs. iSER
RDMA for Storage 18/28 23.05.2013
iSCSI Extensions for RDMA (iSER)
RDMA for Storage 19/28 23.05.2013
● SolarisCOMSTAR
● (LIO isert, kernel 3.10)
● STGTuser
kernel
● Mellanox pushes iSER and STGT
● No advanced features with STGT like live resizing
● ProfitBricks chose Solaris for ZFS and iSER
● LIO isert is too new
Target
iSCSI Extensions for RDMA (iSER)
RDMA for Storage 20/28 23.05.2013
● ib_iser ● libiscsi● scsi_transport_iscsi● (ib_ipoib)
● iscsiduser
kernel
● Complexity● Multiple maintainers● Major IPoIB bugs● IP-based DDoS reconnect● Mellanox is mainly
improving performance● Too unstable for IB
open-iscsi Initiator
SCSI RDMA Protocol (SRP)
RDMA for Storage 21/28 23.05.2013
● SCST ib_srpt● Solaris COMSTAR● (LIO ib_srpt)
user
kernel
● Very committed SCST maintainers Bart and Vlad (Bart Van Assche,Vladislav Bolkhovitin)
● ProfitBricks chose SCST due to ZFS and iSER issues
● LIO SRP unstable/unusable
Target
SCSI RDMA Protocol (SRP)
RDMA for Storage 22/28 23.05.2013
● ib_srp● scsi_transport_srp
● (srp-tools)user
kernel
● Simplicity: RDMA-only, kernel-only possible
● Inactive Maintainer● No fast IO failing, no
continuous reconnect● Loosing SCSI disks● Bart + Mellanox are active● Bart's work doesn't fit us
Initiator
ProfitBricks Choices
● Simplicity = Stablity → SRP without srp-tools● Help improving SCST● Improved SRP initiator ourselves
● Just fast IO failing + automatic reconnect● Never loose SCSI devices automatically
● Published SRP initiator fixes● Implement RDMA into QEMU for performance
RDMA for Storage 23/28 23.05.2013
SRP Fixes
● From Bart: https://github.com/bvanassche/ib_srp-backport
● From ProfitBricks: https://github.com/sriemer/ib_srp
● Bart also has performance patches + backport● Bart uses the srp-tools + loosing SCSI devices● Gradually finding compromises
RDMA for Storage 24/28 23.05.2013
● THCA_GUID="0002c903004ed0b2"
● TGID_P1="fe800000000000000002c903004ed0b3"
● PKEY="ffff"
● IHCA="mlx4_0"
● IHCA_P1="1"
● SRP=“id_ext=${THCA_GUID},ioc_guid=${THCA_GUID},dgid=${TGID_P1},pkey=${PKEY},service_id=${THCA_GUID}“
● echo "${SRP}" > /sys/class/infiniband_srp/srp-${IHCA}-${IHCA_P1}/add_target
Establish an SRP connection
RDMA for Storage 25/28 23.05.2013
InfiniBand/RDMA Links/Information
● InfiniBand Trade Association(IB specification, doc, www.infinibandta.com)
● OpenFabrics Alliance (OFA, OFED providers, www.openfabrics.org)
● Mellanox Technologies (www.mellanox.com)● [email protected] mailing list● LinkedIn group „InfiniBand Technologists“
RDMA for Storage 26/28 23.05.2013
Questions?
● Questions???
● [email protected]● www.profitbricks.com
RDMA for Storage 27/28 23.05.2013
Bonus: How to do replication right?
RDMA for Storage 28/28 23.05.2013
Primary Secondary Primary Primary LUN LUN
IP IP
ClusterManager
ClusterManager
WRONG!Store&ForwardWrites! Slow!
WRONG!Complex,
error-prone!
SRP/iSER/iSCSI
SRP/iSER/iSCSI
SRP/iSER/iSCSI
SRP/iSER/iSCSI
SRP/iSER/iSCSI
e.g. SW RAID-1
RIGHT!Simple
and fast!