Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at...

47
Persistent Memory Over Fabrics Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies

Transcript of Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at...

Page 1: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

Persistent Memory Over Fabrics

Paul Grun, Cray Inc Stephen Bates, Eideticom

Rob Davis, Mellanox Technologies

Page 2: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Agenda

• Persistent Memory as viewed by a consumer, and some guidance to the fabric community

• Implications for building a Remote Persistent Memory service

• Approaches to prototyping

2

Page 3: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Scope for Today’s Discussion

• Usage • Persistent Memory as a target of memory operations • Persistent Memory as a target of I/O operations (out of scope*)

• Locality • A PM device accessed over a network • A local PM device attached to an I/O bus or a memory channel

(out of scope*)

3

*out of scope for this session

Page 4: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

The Consumer’s View Some guidance to the fabric community

4

Paul Grun, Cray Inc

Page 5: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

What the Consumer Would Like

The means to take full advantage of Remote Persistent Memory

• Treat it like memory as much as possible, while still • Taking advantage of its persistence characteristics

5

Who are these consumers, and how will they use this new technology?

Page 6: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Use Case: High Availability, Replication

What it looks like

How it works

Usage: replicate data that is stored in local PM across a fabric and store it in remote PM

6

Page 7: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Use Case: Remote Persistent Memory

How it works

What it looks like

Usage: Expand on-node memory capacity, while taking advantage of persistence (or not). Disaggregate memory from compute.

remote memory service

PM

PM

PM

app

DDR

NIC

app

DDR

NIC

user Remote PM

completion

put

7

Page 8: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Use Case: Shared Persistent Memory

What it looks like How it works

Usage: Information is shared among the elements of a distributed application. Persistence can be used to guard against node failure.

PM

app

NIC

app

NIC

Remote Shared Memory Service

user

Remote PM

completion

user

put get

notice

8

Page 9: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Objective for Fabric Developers

9

Present Remote Persistent Memory to the consumer as much like local memory as practicable

Yes, but what does that mean?

Page 10: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Memory vs I/O

I/O An extent (block) of data is transferred between memory and a storage device. On the storage device, the block is identified by an abstract, protocol-specific identifier (e.g. an LBA). Uses asynchronous I/O techniques.

Memory operations Data is moved between a CPU register and a memory location. Memory location is identified by a real or virtual memory address. Fast and synchronous, while avoiding CPU stalls.

10

Page 11: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

So What Do We Have to Do?

• Streamline the API to look more like memory operations → Use memory references instead of storage identifiers → Focus on puts and gets instead of block read/block write

• Manage asynchronicity – a necessary evil

→ Explicit control over when persistence occurs → Create synchronization points using fabric acknowledgements

• Make it FAST

→ Emphasize wire efficiency, eliminate round trips → No software in the target

11

Page 12: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Streamlined APIs

storage client

NIC

file system D

IMM

DIMM

user or kernel app

virtual switch

POSIX I/F load/store, memcopy…

provider

NVDIMM

NVDIMM

NVM devices

(SSDs…)

remote storage, remote persistent memory

Storage Memory

12

Think about accessing PM using memory paradigms, as opposed to storage paradigms… …even if the data at the far end is stored as a memory mapped file.

Page 13: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Managing Asynchronicity

13

Storage protocols have a synch point built in to the protocol But operations to remote persistent memory do not → Mechanisms to control persistence, and to achieve synchronization, have to be added to the API and the fabric protocol

I/O Request

Memory Request

ACK

sync point

I/O Response

Data Transfer

I/O Transaction Persistent Memory Operation

Page 14: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Managing Asynchronicity

14

data has been received by the endpoint

local buffer

EP

persistent memory

EP

data is persistent ack-d

Responder Requester

data pipeline

data is globally visible ack-c ack-b

local buffer can be reused ack-a

requester (consumer)

MSG, RMA request

completion

Persistence notification

Write Persistent

Page 15: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

What We’ll Need

15

NIC

user app

PM client

NIC

PM server

PM device(s) provider provider

Enhanced APIs - Expose PM device

characteristics via the API - Memory registration - Put/Get style semantics - Control persistence - Completion semantics

- more sophisticated access mechanisms - more capable NICs - improved wire protocols

- faster wires (always) - small message performance - efficient wire utilization - synchronization (completions) in hardware

Page 16: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

What We’ll Need

• Better APIs → Match the lingua franca of the application → Incorporate semantics to control Persistence

• Faster wires → Small messages at high transaction rates look more like memory ops

• Clever target devices

→ Eliminate layers at the target end

16

Objective – Make references to remote memory as fast as possible … neglecting the speed of light and other practical considerations

Page 17: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Hardware and Software for PMoF Today!

17

Stephen Bates, Eideticom

Page 18: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Page 19: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Page 20: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

The Holy Grail of PMoF

CPU

FABRIC

PM M

edia

mov eax,[edx]

The knights that say “c”!

Loads and stores on a client CPU affect Persistent Memory across the fabric!

Clients

Target

We are a loooong way from here!

CPU

mov eax,[edx]

Page 21: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Coming Soon to a Cinema Near You!

GEN Z A New Fabric

featuring Optional coherency

NVMe support Scale

Coming in 2020

OpenCAPI The Return of the Big Blue

featuring

Off the CPU bus Accelerator support Cache coherency

Now Showing in Select Cinemas

CCIX The ARMpire Strikes Back

featuring

Off the CPU bus Accelerator support Cache coherency

Scale?

Coming Soon??

BUT WHAT CAN I SEE TODAY???

Page 22: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

An PMoF Target Today: Hardware (v1-ddr)

NVD

IMM

-N

NVD

IMM

-N

NVD

IMM

-N

DDR PCIe CPU

Fabric I/F Control Plane PM Media

RoCE IB

iWARP etc.

Page 23: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

An PMoF Target Today: Hardware (v1-ddr)

Houston, we have some

problems….

Page 24: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

An PMoF Target Today: Hardware (v1-ddr)

● Fabric I/F ○ Require CPU utilization on the client

side. ○ Not true load/store on client. ○ Challenging to scale. ○ Non-coherent (client and target)

● Control Plane ○ Uh, why is the CPU in the way? ○ Very CPU/ISA dependent

■ DDIO ■ cache effects

● PM Media ○ Expensive ○ Not hot-swappable ○ Capacity/Scale issues ○ MoBo support (ADR) required

Page 25: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

NVMe Persistent Memory Regions

● A standards based PCIe PM interface!

● Takes a Controller Memory Buffer and adds persistence.

● Essentially a persistent PCIe BAR.

● Includes methods for write barriers and flushing.

● Not ARCH specific. ● Can be tied to RDMA so remote

flush/barrier is possible. ● NVDIMM-PCIe ;-)....

http://news.toshiba.com/press-release/electronic-devices-and-components/toshiba-memory-corporation-introduces-worlds-first-e

Page 26: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

An PMoF Target Today: Hardware (v2-pcie)

PCIe

CPU

Fabric I/F

Control Plane

PM Media

RoCE IB

iWARP etc.

w/PMR

PCIe Switch

w/PMR

w/PMR

PCIe

PCIe

PCIe

Page 27: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Building a PMoF Target Today: Hardware (v2-pcie)

Houston, we have fewer

problems….

Page 28: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

An PMoF Target Today: Hardware (v2-pcie)

● Fabric I/F ○ Require CPU utilization on the client

side. ○ Not true load/store on client. ○ Challenging to scale. ○ Non-coherent (client and target)

● Control Plane ○ Uh, why is the CPU in the way? ○ Very CPU/ISA dependent

■ DDIO ■ cache effects

● PM MediaExpensive

○ Not hot-swappable ○ Capacity/Scale issues ○ MoBo support (ADR) required

● Also ○ Decouples target-side CPU DDR from performance ○ Decouples target-side CPU PCIe from performance ○ Fabric I/F can be upgraded in time (Star Wars!)

Page 29: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Building a PMoF Target Today: Software

Is there anybody out there?

Page 30: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

An PMoF Target Today: Software

Protocol Low-Latency Memory Semantic

Storage Features

Over a Fabric

Comment

NVMe-oF Yes No Yes Yes Block today, could be updated for PMRs

Ext4 w/DAX Yes Yes Yes No DAX needs to be applied to remote FS

nPFS1 Yes Yes Yes Yes Does not actually exist yet ;-)

ZuFS Yes Yes Yes No Cool but does not support fabrics

PMEM2 Yes No Yes No Turns PM into a block device.

librpmem3 Yes Yes Yes Yes Library to build upon!

1https://tools.ietf.org/id/draft-hellwig-nfsv4-rdma-layout-00.html 2https://nvdimm.wiki.kernel.org/ 3http://pmem.io/pmdk/librpmem/

Notable mentions include crail.io, memGluster and Octopus….

Page 31: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

It’s the software stupid!

Page 32: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

RDMA VERBs Extensions for Persistency and Consistency

Rob Davis, Mellanox

32

Page 33: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved. 33

• Remote Direct Memory Access (RDMA) Background

Page 34: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

RDMA – What?

34

• Remote • Data transfers between nodes in a network

• Direct • No Operating System Kernel involvement in transfers • Everything about a transfer offloaded onto Interface Card

• Memory • Transfers between user space application virtual memory • No extra copying or buffering

• Access • Send, receive, read, write, atomic operations • Byte or Block Access

Page 35: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

RDMA - Why?

35

• Latency (<usec) • Zero-copy • Hardware based one sided memory to remote memory operations • OS and network stack bypasses • Reliable credit base data and control delivery by hardware • Network resiliency • Scale out with standard converged network (Ethernet/InfiniBand)

Page 36: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

RDMA – How?

36

• Transport built on simple primitives deployed for 15 years in the industry

• Queue Pair (QP) – RDMA communication end point • Connect for establishing connection mutually • RDMA Registration of memory region (REG_MR) for enabling

virtual network access to memory • SEND and RCV for reliable two-sided messaging • RDMA READ and RDMA WRITE for reliable one-sided

memory to memory transmission

Page 37: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

SEND/RCV

37

Host HCA

SEND First

Post SEND

HostHCA Memory

SEND

Completion

Post RCV

SEND MiddleSEND Last

RCV CompletionACK

Page 38: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

RDMA WRITE

38

Host HCA HostHCA Memory

Page 39: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

RDMA READ

39

Host HCA

READ

READ RESPONSE

READ RESPONSE

RDMA READ

HostHCA Memory

RDMA READ

Completion

Register

Memory

Region

Page 40: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved. 40

• RDMA Memory Reliability Extensions for Persistency and Consistency

Page 41: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

RDMA Extensions for Memory Persistency and Consistency

41

• RDMA FLUSH Extension • New RDMA operation • Higher Performance and

Standard • Straightforward Evolutionary fit

to RDMA Transport

• Target guarantees • Consistency • Persistency

Host HCA HostHCA Memory

FLUSH

Page 42: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

RDMA Non Posted WRITE

42

• Goal: Eliminate 2-Phase-Commit Requester-Fence-Roundtrip

• …while maintaining RDMA FLUSH (NP) Semantics without blocking posted operations

• New Transport Operation: Non Posted WRITE • Leverages Native Non Posted Operations Semantics

• Natural fit with existing transport protocol • Ordering • Constrained to responder resources limitation of

number of outstanding operations • Error Handling (e.g. Repeated)

• Two phase commit example • Use Non-Posted WRITE after FLUSH for pointer

update • Avoids need for Requester Side Fencing (extra

roundtrip)

Host HCA HCA Memory

Data FLUSH Done

FLUSH Memory

Subsystem

FLUSH Memory

Subsystem

Pointer FLUSH Done

Page 43: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Matches SNIA NVM PM Model RDMA to PMEM for High Availability

43

• MAP • Memory address for the file • Memory address + Registration

of the replication • SYNC

• Write all the “dirty” pages to remote replication

• FLUSH the writes to persistency • UNMAP

• Invalidate the registered pages for replication

App: SW PeerA:Host SW

PeerANIC:RNic

PeerBNIC:RNic

PeerB PM: PM

PeerB:Host SW

MapRDMAOpen

RDMAMmapregisterMemory

OptimizedFlush RDMAWrite

RDMAWrite

Write

RDMAWrite

Write

RDMAWrite

Flush

Flush

Flush

UnmapRDMAUnmap

unregisterMemory

1

2

3

RDMAWrite

FlushRDMAWrite

Write

Store

Store

Store

Page 44: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved. 44

• Platforms for trying RDMA for Persistent Memory over Fabrics (PMoF)

Page 45: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Available Today

45

• SOC reference platform(s) • Pluggable interfaces for PM

devices • High enough performance to test

Persistent Memory over Fabrics (PMoF)

• Over 8MIOPs with 512B MTU

• Less than 3µsec latency • Less than 1% CPU utilization

Page 46: Persistent Memory Over Fabrics - SNIA · Persistent Memory Over Fabrics ... → Small messages at high transaction rates look more like memory ops ... Reliable credit base data and

© 2018 SNIA Persistent Memory Summit. All Rights Reserved.

Fit the requirements for PMoF testing

46

• Open source Programmable Control Plane

• Multiple standard 100Gb low latency IO interfaces

• Multiple standard persistent memory interfaces

Single ASIC

Control Plane

Fabric I/F PM Media