Persistent Memory over Fabrics An Application-centric view · 2019. 12. 21. · PSM2 MXM. IB. ENET....
Transcript of Persistent Memory over Fabrics An Application-centric view · 2019. 12. 21. · PSM2 MXM. IB. ENET....
Persistent Memory over Fabrics An Application-centric view
Paul Grun Cray, Inc
OpenFabrics Alliance Vice Chair
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Agenda OpenFabrics Alliance Intro OpenFabrics Software Introducing OFI - the OpenFabrics Interfaces Project OFI Framework Overview – Framework, Providers Delving into Data Storage / Data Access Three Use Cases
2
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
OpenFabrics Alliance
The OpenFabrics Alliance (OFA) is an open source-based organization that develops, tests, licenses, supports and distributes OpenFabrics Software (OFS). The Alliance’s mission is to develop and promote software that enables maximum application efficiency by delivering wire-speed messaging, ultra-low latencies and maximum bandwidth directly to applications with minimal CPU overhead.
https://openfabrics.org/index.php/organization.html
3
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
OpenFabrics Alliance – selected statistics - Founded in 2004 - Leadership
- Susan Coulter, LANL – Chair - Paul Grun, Cray Inc. – Vice Chair - Bill Lee, Mellanox Inc. – Treasurer - Chris Beggio, Sandia – Secretary (acting) - 14 active Directors/Promoters (Intel, IBM, HPE, NetApp, Oracle, Unisys, Nat’l Labs…)
- Major Activities
1. Develop and support open source network stacks for high performance networking - OpenFabrics Software - OFS
2. Interoperability program (in concert with the University of New Hampshire InterOperability Lab) 3. Annual Workshop – March 27-31, Austin TX
- Technical Working Groups
- OFIWG, DS/DA, EWG, OFVWG, IWG - https://openfabrics.org/index.php/working-groups.html (archives, presentations, all publicly available)
4
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
OpenFabrics Alliance – selected statistics - Founded in 2004 - Leadership
- Susan Coulter, LANL – Chair - Paul Grun, Cray Inc. – Vice Chair - Bill Lee, Mellanox Inc. – Treasurer - Chris Beggio, Sandia – Secretary (acting) - 14 active Directors/Promoters (Intel, IBM, HPE, NetApp, Oracle, Unisys, Nat’l Labs…)
- Major Activities
1. Develop and support open source network stacks for high performance networking - OpenFabrics Software - OFS
2. Interoperability program (in concert with the University of New Hampshire InterOperability Lab) 3. Annual Workshop – March 27-31, Austin TX
- Technical Working Groups
- OFIWG, DS/DA, EWG, OFVWG, IWG - https://openfabrics.org/index.php/working-groups.html (archives, presentations, all publicly available)
today’s focus
5
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
OpenFabrics Software (OFS)
uverbs kverbs
IB
ENET
IB
iWAR
P
verbs api
phy wire
RoCE
IB
ENET
OpenFabrics Software Open Source APIs and software for advanced networks Emerged along with the nascent InfiniBand industry in 2004 Soon thereafter expanded to include other IB-based networks such as RoCE and iWARP Wildly successful, to the point that RDMA technology is now being integrated upstream. Clearly, people like RDMA. OFED distribution and support: managed by the Enterprise Working Group – EWG Verbs development: managed by the OpenFabrics Verbs Working Group - OFVWG
6
© 2017 SNIA Persistent Memory Summit. All Rights Reserved. 7
Historically, network APIs have been developed ad hoc as part of the development of a new network. To wit: today’s Verbs API is the implementation of the verbs semantics specified in the InfiniBand Architecture. But what if a network API was developed that catered specifically to the needs of its consumers, and the network beneath it was allowed to develop organically? What would such a resulting API look like?
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Introducing the OpenFabrics Interfaces Project
- OpenFabric Interfaces Project (OFI) – Proposed jointly by Cray and Intel, August 2013 – Chartered by the OpenFabrics Alliance, w/ Cray and Intel as co-chairs
- Objectives - ‘Transport neutral’ network APIs - ‘Application centric’, driven by application requirements
8
“Transport Neutral, Application-Centric”
OFI Charter Develop, test, and distribute:
1. An extensible, open source framework that provides access to high-performance fabric interfaces and services 2. Extensible, open source interfaces aligned with ULP and application needs for high-performance fabric services
OFIWG will not create specifications, but will work with standards bodies to create interoperability as needed
8
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
OpenFabrics Software (OFS)
uverbs kverbs
IB
ENET
IB
iWAR
P
verbs
OpenFabrics Software
api
phy wire
RoCE
IB
ENET
libfabric kfabric
ofi
prov
ider
prov
ider
prov
ider
fabr
ic
fabr
ic
fabr
ic
. . .
. . .
Result: 1. An extensible interface driven by
application requirements 2. Support for multiple fabrics 3. Exposes an interface written in the
language of the application
9
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
OFI Framework
Application
hardware
MPI
OFI
Service I/F
Service I/F
Service I/F
Provider B Provider A Provider C …
API
Providers
NIC NIC NIC
OFI consists of two major components: - a set of defined APIs (and some functions) – libfabric, kfabric - a set of wire-specific ‘providers’ - ‘OFI providers’
Think of the ‘providers’ as an implementation of the API on a given wire
a series of interfaces to access fabric services
Provider exports the services defined by the API
10
Service I/F
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
OpenFabrics Interfaces - Providers
libfabric kfabric
ofi
prov
ider
prov
ider
prov
ider
. . .
fabr
ic
fabr
ic
fabr
ic
. . .
libfabric / kfabric
ENET
sockets provider
IB
verbs provider
ENET
verbs provider (RoCEv2, iWARP)
ARIE
S
GNI provider
…
All providers expose the same interfaces, None of them expose details of the underlying fabric.
Current providers: sockets, verbs, usnic, gni, mlx, psm, psm2, udp, bgq
ENET
usnic provider
11
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
OFI Framework – a bit more detail
Communication Services Completion Services Data Transfer Services
Framework
Provider
Event & Completion
Queues
Counters
Connection Management
Control Services
Discovery
Discovery
Address Vectors
Connection Management
Address Vectors
NIC
Network consumers
Triggered
Operations
Msg queues RMA
TagMatching Atomics
Msg queues RMA
TagMatching Atomics
Triggered
Operations
Event & Completion
Queues
Counters
12
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Legacy Apps
Data Analysis Data Storage, Data Access Distributed and Parallel Computing
Msg Passing - MPI applications
- Structured data - Unstructured
data
- Sockets apps
- IP apps
Shared memory - PGAS languages
DS/DA WG
OFI Project Overview – work group structure
13
OFI Project
Application-centric design means that the working groups are driven by use cases:
Data Storage / Data Access, Distributed and Parallel Computing…
Data Storage - object, file, block
- storage class memory
Data Access - remote persistent memory
OFI WG
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Legacy Apps
Data Analysis Data Storage, Data Access Distributed and Parallel Computing
Msg Passing - MPI applications
- Structured data - Unstructured
data
- Sockets apps
- IP apps
Shared memory - PGAS languages
OFI Project Overview – work group structure
14
OFI Project
Data Storage - object, file, block
- storage class memory
Data Access - remote persistent memory
kernel mode ‘kfabric’ user mode
‘libfabric’
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Legacy Apps
Data Analysis Data Storage, Data Access Distributed and Parallel Computing
Msg Passing - MPI applications
- Structured data - Unstructured
data
- Sockets apps
- IP apps
Shared memory - PGAS languages
OFI Project Overview – work group structure
15
OFI Project
Data Storage - object, file, block
- storage class memory
Data Access - remote persistent memory
kernel mode ‘kfabric’ user mode
‘libfabric’
Since the topic today is Persistent Memory…
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage, Data Access?
storage client
NIC
file system D
IMM
DIMM
user or kernel app
virtual switch
POSIX I/F load/store, memcopy…
provider
NVDIMM
NVDIMM
NVM devices (SSDs…)
Non-volatile memory (storage class memory, persistent memory)
Reminder: libfabric: User mode library kfabric : Kernel modules
Data Storage / Data Access Working Group DS
- object, file, block - storage class mem
DA - persistent memory
Key Use Cases: 1. Lustre 2. NVMe 3. Persistent Memory
16
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – Lustre example
RPC RPC
LNET LNET
LND (xxxlnd)
LND (xxxlnd)
?? ??
RDMA OSSs MDSs
Lustre protocol
LNET
LND (xxxlnd)
?? ??
LND (xxxlnd)
RDMA
??
C C … S
??
adding a new network type may require creating a new LND
17
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – Lustre LND Architecture
drawing courtesy of Intel
18
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – NVMe over fabrics
PSM-based
s/w stack
uverbs
kverbs PSM2
IB
ENET
IB
iWAR P
OPA
MXM-based
s/w stack
MXM
IB
verbs psm2 mxm
OpenFabrics Software
RoC E
IB
ENET
kverbs
IB
ENET
IB
iWAR
P verbs
RoCE
IB
ENET
NVMe is a block storage protocol designed for operation with storage class memory devices
19
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – NVMe/F today
NVMe/F is an extension, allowing access to an NVMe device over a fabric NVMe/F leverages the characteristics of verbs to good effect for verbs-based fabrics kverbs
HCA NIC, RNIC
IB
NVMe/F
VFS / Block I/O / Network FS / LNET
NIC
IP
iSCSI
sockets
kernel application
NVMe SCSI
RoCE, iWarp
20
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – NVMe over future fabrics
kfabric
ofi
prov
ider
prov
ider
VERB
S pr
ovid
er
fabr
ic
fabr
ic
fabr
ic
. . .
. . .
kverbs
IB
ENET
IB
iWAR
P verbs
RoCE
IB
ENET
21
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – Future Lustre LND Architecture
22
drawing courtesy of Intel
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – enhanced APIs
kverbs
HCA NIC, RNIC
IB
NVMe/F
VFS / Block I/O / Network FS / LNET
NIC
IP
iSCSI
sockets
kernel application
NVMe SCSI
fabric
kfabric
provider
fabric-specific device
- kfabric as a second native API for NVMe - kfabric as a possible ‘LND’ for LNET A kfabric verbs provider exists today (not upstream), meaning that kfabric can be run over a conventional verbs-based fabric. No modifications are necessary Waiting for a kfabric provider to emerge
RoCE, iWarp
23
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
A look at Persistent Memory
24
Applications tolerate long delays for storage, but assume very low latency for memory - Storage systems are generally asynchronous, target driven, optimized for cost - Memory systems are synchronous, and highly optimized to deliver the lowest imaginable latency with no
CPU stalls Persistent Memory over fabrics is somewhere in between: - Much faster than storage, but not as fast as local memory (usually)
How to treat PM over fabrics? - Build tremendously fast remote memory networks, or - Find ways to hide the latency, making it look for the most part like memory, but cheaper and remote
remote storage remote PM
local mem
tolerance for latency pS uS mS
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Access – PM operations
NIC
user app
PM client
NIC
PM server
NVM device(s)
I/O bus mem bus
provider
For ‘normal’ fabrics, the responder returns an ACK when the data has been received by the end point. - It may not be globally visible, and/or - It may not yet be persistent Need an efficient mechanism for indicating when the data is globally visible, and is persistent
25
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Access – key fabric requirements
app memory
EP
persistent memory
EP requester (consumer)
fi_send, fi_rma
completion
data is persistent ack-c
Responder Requester
I/O pipeline
data is visible ack-b data is received ack-a
Caution: OFA does not define wire protocols, it defines the semantics seen by the consumer
Objectives: 1. Give the client app control over commits to persistent
memory 2. Return to the client distinct indications of
• global visibility, persistence 3. Optimize completion protocols to avoid round trips
26
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Possible converged I/O stack
remote block/file I/O
kernel application
SSD
VFS / Block Layer
HBA
local I/O
NVMe SCSI
local byte
addressable
ulp
SSD NVDIMM
byte access
PCIe mem bus PCIe
user app
remote byte
addressable
byte access
kfabric
kverbs
HCA NIC, RNIC
SRP, iSER, NVMe/F, NFSoRDMA, SMB Direct, LNET/LND,…
VFS / Block I/O / Network FS / LNET
NIC
iSCSI
remote I/O
sockets
provider
fabric-specific device
libfabric
kernel application user app
provider
* Verbs exists as a kfabric provider today
*
27
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Discussion - Between Two Ferns
Doug Voigt - SNIA NVM PM TWG Chair Paul Grun – OFA vice Chair, co-chair OFIWG, DS/DA Are we going in the right direction? Are we aligned? Perfectly aligned? Almost aligned? Are we focusing on the right use cases? How to go forward from here?
28
Thank You
29
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Backup
30
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
kfabric API – very similar to existing libfabric Consumer APIs • kfi_getinfo() kfi_fabric() kfi_domain() kfi_endpoint() kfi_cq_open() kfi_ep_bind() • kfi_listen() kfi_accept() kfi_connect() kfi_send() kfi_recv() kfi_read() kfi_write() • kfi_cq_read() kfi_cq_sread() kfi_eq_read() kfi_eq_sread() kfi_close() …
Provider APIs • kfi_provider_register()
During kfi provider module load a call to kfi_provider_register() supplies the kfi api with a dispatch vector for kfi_* calls.
• kfi_provider_deregister() During kfi provider module unload/cleanup kfi_provider_deregister() destroys the kfi_* runtime linkage for the specific provider (ref counted).
31
kfabric API
http://downloads.openfabrics.org/WorkGroups/ofiwg/dsda_kfabric_architecture/
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – Current Lustre LND Architecture
ptlrpc
LNET
socklnd gnilnd socklnd
Ethernet NIC Gemini RoCEv2 NIC IB HCA
TCP/IP/Enet Aries IB/Enet InfiniBand
Ethernet NIC Gemini RoCEv2 NIC IB HCA
???
???
Adding a new network type requires introducing a new LND
OFED sockets/TPC/IP
32
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – Current Lustre LND Architecture
ptlrpc
LNET
socklnd gnilnd o2iblnd
Ethernet NIC Gemini RoCEv2 NIC IB HCA new NIC
OFED
newlnd
TCP/IP/Enet Aries IB/Enet InfiniBand new fabric
Ethernet NIC Gemini RoCEv2 NIC IB HCA new NIC
One option is to define a new LND for each new network type
sockets/TPC/IP
33
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Storage – Current Lustre LND Architecture
ptlrpc
LNET
socklnd gnilnd kfabric verbs provider
Ethernet NIC Gemini RoCEv2 NIC IB HCA
kfabric new net provider
Ethernet NIC Gemini RoCEv2 NIC IB HCA new NIC
Or, backport LNET to kfabric, allowing use of any kfabric provider
new NIC
sockets/TPC/IP
o2iblnd
OFED
TCP/IP/Enet Aries IB/Enet InfiniBand new fabric
34
© 2017 SNIA Persistent Memory Summit. All Rights Reserved.
Data Access – PM completion semantics
By analogy, storage protocols include an ‘ending status’ phase, signaling to the client
that the data has been safely stored
No such client/server model exists for a remote persistent memory device
Hence, there is no convenient way to know if
data has been safely stored within the persistence domain
I/O TransactionSynchronization is achieved by
the underlying I/O protocol
Persistent Memory OperationSynchronization occurs in
hardware
I/O request
Memory request
acknowledge
I/O response
Data transfer
Sync point
35