Octopus: an RDMA-enabled Distributed Persistent Memory...

21
Octopus: an RDMA-enabled Distributed Persistent Memory File System Youyou Lu 1 , Jiwu Shu 1 , Youmin Chen 1 , Tao Li 2 1 Tsinghua University 2 University of Florida 1

Transcript of Octopus: an RDMA-enabled Distributed Persistent Memory...

Octopus:anRDMA-enabledDistributedPersistentMemoryFileSystem

YouyouLu1,Jiwu Shu1,YouminChen1,TaoLi2

1TsinghuaUniversity2UniversityofFlorida

1

Outline

•BackgroundandMotivation•OctopusDesign•Evaluation•Conclusion

2

NVMM&RDMA

•NVMM (PCM,ReRAM,etc)• Datapersistency• Byte-addressable• Lowlatency

•RDMA• Remotedirectaccess• Bypassremotekernel• Lowlatencyandhighthroughput

Client ServerRegistered Memory Registered Memory

HCA HCACPU CPU

A

B

C

D

E

3

Modular-DesignedDistributedFileSystem

• DiskGluster• Diskfordatastorage• GigEforcommunication

•MemGluster• Memory fordatastorage• RDMAforcommunication

18ms

Latency(1KBwrite+sync)

98%

2%

Overall

HDD

Software

Network

324us

99.7%

Overall

MEM

Software

RDMA

4

Modular-DesignedDistributedFileSystem

• DiskGluster• Diskfordatastorage• GigEforcommunication

•MemGluster• Memory fordatastorage• RDMAforcommunication

Bandwidth(1MBwrite)

88MB/s

83MB/s

HDD

FileSystem

118MB/sNetwork

6509MB/s

1779MB/s

MEM

FileSystem

6350MB/sRDMA

94%

27%

5

RDMA-enabledDistributedFileSystem•Morethanfasthardware• Itissuboptimaltosimplyreplacethenetwork/storagemodule

•OpportunitiesandChallenges• NVM• Byte-addressability• Significantoverheadofdatacopies

• RDMA• Flexibleprogrammingverbs (message/memorysemantics)• ImbalancedCPUprocessingcapacityvs.networkI/Os

6

Outline

•BackgroundandMotivation•OctopusDesign•Evaluation•Conclusion

7

RDMA-enabledDistributedFileSystem

• ItisnecessarytorethinkthedesignofDFSoverNVM&RDMA

8

Byte-addressabilityofNVMOne-sidedRDMAverbs

Shareddatamanagements

CPUisthenewbottleneckNewdataflowstrategies

FlexibleRDMAverbs EfficientRPCPrimitive

Opportunity Approaches

OctopusArchitecture

N2

NVMM

N2

NVMM

N3

NVMM

HCA HCA HCA

...

SharedPersistentMemoryPool

ClientBClientASelf-IdentifiedRPC

RDMA-basedDataIO

create(“/home/cym”)Read(“/home/lyy”)

It performs remote direct data access just like an Octopus uses its eight legs

9

1.SharedPersistentMemoryPool

• ExistingDFSs• Redundantdatacopy

Client Server

FSImage

UserSpaceBuffer

GlusterFS

PageCache

UserSpaceBuffer

mbuf

NICNIC

mbuf

10

7copy

1.SharedPersistentMemoryPool

• ExistingDFSs• Redundantdatacopy

Client Server

FSImage

UserSpaceBuffer

GlusterFS +DAX

PageCache

UserSpaceBuffer

mbuf

NICNIC

mbuf

11

6 copy

•OctopuswithSPMP• Introducesthesharedpersistentmemorypool• Globalviewofdatalayout

1.SharedPersistentMemoryPool

• ExistingDFSs• Redundantdatacopy

Client Server

FSImage

UserSpaceBuffer UserSpaceBuffer

mbuf

NICNIC

mbuf

12

MessagePool

4 copy

2.Client-ActiveDataI/O

• Server-Active• ServerthreadsprocessthedataI/O• WorkswellforslowEthernet• CPUscaneasilybecomethebottleneckwithfasthardware

• Client-Active• Letclientsread/writedatadirectlyfrom/totheSPMP

C1 timeNIC MEMCPUC2

Lookupfiledata Senddata

Lookupfiledata Sendaddress13

SERVER

3.Self-IdentifiedMetadataRPC• Message-basedRPC• easytoimplement,lowerthroughput• DaRPC[SoCC’14],FaSST[OSDI’16]

• Memory-basedRPC• CPUcoresscanthemessagebuffer• FaRM[NSDI’14]

• Usingrdma_write_with_imm?• Scanbypolling• Imm dataforself-identification

14

MessagePoolHCA

Thead1 Theadn

HCA DATAID

MessagePoolHCA

Thead1 Theadn

HCA DATA

Outline

•BackgroundandMotivation•OctopusDesign•Evaluation•Conclusion

15

EvaluationSetup• EvaluationPlatform

• ConnectedwithMellanoxSX1012switch• EvaluatedDistributedFileSystems•memGluster,runsonmemory,withRDMAconnection• NVFS[OSU],Crail[IBM],optimizedtorunonRDMA•memHDFS,Alluxio,forbigdatacomparison

Cluster CPU Memory ConnectX-3 FDR Number

A E5-2680 * 2 384 GB Yes * 5

B E5-2620 16 GB Yes * 7

16

OverallEfficiency

75%

80%

85%

90%

95%

100%

getattr readdir

Latency Breakdown

software mem network

0

1000

2000

3000

4000

5000

6000

7000

Write Read

Bandwidth Utilization

software mem network

• Softwarelatencyisreducedfrom326usto6us• Achievesread/writebandwidththatapproachestherawstorageandnetworkbandwidth

17

MetadataOperationPerformance

• OctopusprovidesmetadataIOPSintheorderof10#~10%

• Octopuscanscaleslinearly

6.E+03

6.E+04

6.E+05

1 2 3 4 5

MKNOD

glusterfs nvfscrail dmfscrail-poll

8.E+04

8.E+05

8.E+06

1 2 3 4 5

GETATTR

glusterfs nvfscrail dmfscrail-poll

3.E+04

3.E+05

1 2 3 4 5

RMNOD

glusterfs nvfscrail dmfscrail-poll

18

BigDataEvaluation

•Octopuscanalsoprovidebetterperformanceforbigdataapplicationsthanexistingfilesystems.

0

500

1000

1500

2000

2500

3000

write read

TestDFSIO (MB/s)

memHDFS Alluxio NVFS Crail Octopus

0.6

0.7

0.8

0.9

1

1.1

Teragen Wordcount

Normalized Execution Time

memHDFS Alluxio NVFS Crail Octopus

19

Conclusion• ItisnecessarytorethinktheDFSdesignsoveremergingH/Ws•Octopus’sinternalmechanisms• Simplifiesdatamanagementlayerby reducingdatacopies• RebalancesnetworkandserverloadswithClient-ActiveI/O• RedesignsthemetadataRPCanddistributedtransactionwithRDMAprimitives

• EvaluationsshowthatOctopussignificantlyoutperformsexistingfilesystems

20

Q&A

Thanks

21