File System Design for Distributed Persistent...

FileSystemDesignforDistributedPersistentMemory

�

%49<49��9#703./9(�$30:,6708<

19<49<49�8703./9(�,+9�*3/885��7846(.,�*7�8703./9(�,+9�*3�=19

��

9.56 million tickets/day

11. 6 millionpayments/day

1.48 billiontxs/day

0.256 milliontxs/second

��

• �(8(�+60:,3��3-462(8043�#,*/3414.<• �4259803.��38,370:,� �4259803.�à �(8(��38,370:,� �4259803.• � ��0.��(8(��

• �4;�1(8,3*<�+(8(�7846(.,�(3+�564*,7703.

�7��.68:A��*<*��7*5A<2,;��7�6.68:A��*<*+*;.;� �� $A;<.6

NVMM&RDMA

• '�� (3D XPoint,etc)• Datapersistency• Byte-addressable• Lowlatency

•#��• Remotedirectaccess• Bypassremotekernel• Lowlatencyandhighthroughput

Client ServerRegisteredMemory RegisteredMemory

HCA HCACPU CPU

A

B

C

D

E

�

Outline•Octopus:aRDMA-enabledDistributedPersistentMemoryFileSystem•Motivation•OctopusDesign•Evaluation•Conclusion

•ScalableandReliableRDMA

�

Modular-DesignedDistributedFileSystem

• DiskGluster• Diskfordatastorage• GigEforcommunication

•MemGluster• Memory fordatastorage• RDMAforcommunication

18ms

Latency(1KBwrite+sync)

98%

2%

Overall

HDD

Software

Network

324us

99.7%

Overall

MEM

Software

RDMA

Modular-DesignedDistributedFileSystem

• DiskGluster• Diskfordatastorage• GigEforcommunication

•MemGluster• Memory fordatastorage• RDMAforcommunication

Bandwidth (1MBwrite)

88MB/s

83MB/s

HDD

Software

118MB/sNetwork

6509MB/sMEM

Software

6350MB/sRDMA

94%

323MB/s6.9%

�

RDMA-enabledDistributedFileSystem•Cannotsimplyreplacethenetwork/storagemodule•Morethanfasthardware…•NewfeaturesofNVM• Byte-addressability• Datapersistency

•RDMAverbs•Write,Read,Atomics (memorysemantics)

• Send,Recv (channel semantics)

•Write-with-imm�

RDMA-enabledDistributedFileSystem

•Wechoosetoredesign theDFS!�

Byte-addressabilityofNVMOne-sidedRDMAverbs

Shareddatamanagements

CPUisthenewbottleneckNewdataflowstrategies

Write-with-Imm EfficientRPCdesign

RDMAAtomics Concurrent control

Opportunity Approaches

Outline•Octopus:aRDMA-enabledDistributedPersistentMemoryFileSystem•Motivation•OctopusDesign• High-ThroughputDataI/O• LowLatencyMetadataAccess

•Evaluation•Conclusion

•ScalableandReliableRDMA��

OctopusArchitecture

N2

NVMM

N2

NVMM

N3

NVMM

HCA HCA HCA

...

SharedPersistentMemoryPool

Client1 Client2 Self-IdentifiedRPC

RDMA-basedDataIO

create(“/home/cym”) Read(“/home/lyy”)

�<�9.:/8:6;�:.68<.�-2:.,<�-*<*�*,,.;;�3=;<�524.�*7�!,<89=;�=;.;�2<;�.201<�5.0;

��

1.SharedPersistentMemoryPool

•ExistingDFSs• Redundantdatacopy

Client Server

FSImage

UserSpaceBuffer

GlusterFS

PageCache

UserSpaceBuffer

mbuf

NICNIC

mbuf

�

7copy



Client Server

FSImage

UserSpaceBuffer

GlusterFS+DAX

PageCache

UserSpaceBuffer

mbuf

NICNIC

mbuf

�

6 copy

•OctopuswithSPMP• Introducesthesharedpersistentmemorypool• Globalviewofdatalayout



Client Server

FSImage

UserSpaceBuffer UserSpaceBuffer

mbuf

NICNIC

mbuf

��

MessagePool

4 copy

2.Client-ActiveDataI/O

• Server-Active• ServerthreadsprocessthedataI/O• WorkswellforslowEthernet• CPUscaneasilybecomethebottleneckwithfasthardware

• Client-Active• Letclientsread/writedatadirectlyfrom/totheSPMP

C1 timeNIC MEMCPUC2

Lookupfiledata Senddata

Lookupfiledata Sendaddress��

3.Self-IdentifiedMetadataRPC• Message-basedRPC

• easytoimplement, lowerthroughput• DaRPC,FaSST

• Memory-basedRPC• CPUcoresscanthemessagebuffer• FaRM

• Usingrdma_write_with_imm?• Scanbypolling• Imm dataforself-identification

�

MessagePoolHCA

Thead1 Theadn

HCA DATAID

MessagePoolHCA

Thead1 Theadn

HCA DATA

4.Collect-DispatchDistributedTransaction• mkdir,mknod operationsneeddistributedtransactions• Two-PhaseLocking&Commit

• Distributedlogging• Two-phase coordination

2*RPC��

Coordinator

Participant

Begin LogBegin

LocalLock wait

LogCommit/Abort

WriteData End

EndBegin

LogBegin

TxExec

LogContext

LogCommit/Abort

LocalLock

LogBegin

TxExec

LogContext

LogCommit/Abort

wait

LocalUnlock

WriteData

LocalUnlock

wait LogEnd

prepare commit

• mkdir,mknod operationsneeddistributedtransactions• Collect-DispatchTransaction

• Locallogging withremotein-placeupdate• Distributedlockservice

Coordinator

Participant

Begin LogBegin

LocalLock

wait LocalExec Log&Update

LocalUnlock

wait LocalLock

Collect

Commit/Abort End

UpdateUnlock

End

1*RPC+2*One-sidedcollect dispatch

EvaluationSetup•EvaluationPlatform

• ConnectedwithMellanoxSX1012switch•EvaluatedDistributedFileSystems•memGluster,runsonmemory,withRDMAconnection• NVFS[OSU],Crail[IBM],optimizedtorunonRDMA•memHDFS,Alluxio,forbigdatacomparison

�5=;<.: �"& �.68:A �877.,<)� ��# =6+.:

� �� %,7 ��

� �� %,7 ��

��

OverallEfficiency

75%

80%

85%

90%

95%

100%

0.<*<<: :.*--2:

�*<.7,A��:.*4-8?7

;8/<?*:. 6.6 7.<?8:4

�

��

��

��

��

��

��

��

(:2<. #.*-

�*7-?2-<1�&<252B*<287

;8/<?*:. 6.6 7.<?8:4

• Softwarelatencyisreducedto6us(85%ofthetotallatency)• Achievesread/writebandwidththatapproachestherawstorageandnetworkbandwidth

��

MetadataOperationPerformance

• OctopusprovidesmetadataIOPSintheorderof10#~10%• Octopuscanscaleslinearly

6��

6��

6��

� � � �

�� !�

05=;<.:/; 7>/;,:*25 -6/;,:*25�9855

8��

8��

8��

� � � �

��%�%%#

05=;<.:/; 7>/;,:*25 -6/;,:*25�9855

3��

3��

� � � �

#� !�

05=;<.:/; 7>/;,:*25 -6/;,:*25�9855

�

Read/WritePerformance

• Octopuscaneasilyreachthemaximumbandwidthofhardwarewithasingleclient• OctopuscanachievethesamebandwidthasCrailevenaddanextradatacopy[notshown]

6��

6��

6��

6��

��

(#�%�

05=;<.:/;

7>/;

,:*25�79

,:*25�9855

-6/;

6��

6��

6��

6��

��

#��

�

5629MB/s 6088MB/s

BigDataEvaluation

•Octopuscanalsoprovidebetterperformanceforbigdataapplicationsthanexistingfilesystems.

�

�

��

�

��

?:2<. :.*-

%.;<��$�! ��;�

6.6��$ �55=@28 '�$ �:*25 !,<89=;

��

��

%.:*0.7 (8:-,8=7<

8:6*52B.-��@.,=<287�%26.

6.6��$ �55=@28 '�$ �:*25 !,<89=;

Conclusion•Octopusprovideshighefficiencybyredesigningthesoftware•Octopus’sinternalmechanisms• Simplifiesdatamanagementlayerby reducingdatacopies• RebalancesnetworkandserverloadswithClient-ActiveI/O• Redesigns themetadataRPCanddistributedtransactionwithRDMAprimitives

•EvaluationsshowthatOctopussignificantlyoutperformsexistingfilesystems

Outline•Octopus:aRDMA-enabledDistributedPersistentMemoryFileSystem•BackgroundandMotivation•OctopusDesign•Evaluation•Conclusion

•ScalableandReliableRDMA

�

RCishardtoscale

�

0

5

10

15

20

25

30

10 100 200 300 400 500 600 700 800

RCWrite UDSend

Throughp

ut(M

ops/s)

#.ofclients

p MCX353AConnectX-3 FDRHCA(singleport)p 1 servernodesendverbsto11 clientnodes

RC vsUDReliableConnection(RC)

UnreliableDatagram(UD)One-to-oneparadigm

QPQPQP

QPQPQP

QPQPQPQP

One-to-manyparadigm

p Offloadingwithone-sidedverbsp Higherperformancep Reliablep Flexible-sizedtransferringpHardtoscale

p Unreliable(riskofpacketloss,out-of-order,etc.)

p Cannotsupportone-sidedverbsp MTUisonly4KBp Goodscalability

WhyisRChardtoscale?

CPU1

CPU CPU … MEM

LastLevel Cache

2

Client

3

1 Memory-MappedI/O 2 PCIeDMARead 3 PacketSending

4 PCIeDMAWrite(DDIO enabled) 5 CPUPollsMessage

4

5

NICCacheNIC

CPU CPU CPU…MEM

LastLevel Cache

Server

NICCacheNIC

�

WhyisRChardtoscale?

n NICCache[1]p Mappingtablep QPstatesp Workqueueelements

n CPUCachep DDIOwritesdatatoLLCp Only10%reservedforDDIO

TwotypesofResourceContention:

With RC, the size of cached data is proportional to thenumber of clients!

…

…Server

Clients

MessagePool

�

Ourgoal:howtomakeRCscalablen FocusonRPC primitivewithRCwrite

p RPCisagoodabstraction,widelyusedp RCwrite(one-sided)hashigherthroughput(FaRM)

n Targetatone-to-manydatatransferringparadigmp e.g.,MDS,KVstore,parameterserver,etc.

n System-levelsolutionp Withoutanymodificationstothehardware

n Deploymentsp MetadataserverinOctopusp Distributedtransactionalsystem

�

1.Groupingtheconnections

S

C

C

CC

C

C

C

C

C

C

C

C

n NaïveApproachpNICcachethrashingwhenthenumberofclientsincreases

p Frequentswapin/outp CausinghigherPCIe traffic

0

20

40

0 50 100 150 200

RCWriteVerb

Throughp

ut(M

ops/s)

#.ofclients�


S

C

C

CC

C

C

C

C

C

C

C

C

n ConnectionGroupingp Serveonegroupatatimeslice

Time

�


S

C

C

CC

C

C

C

C

C

C

C

CTime

n ConnectionGroupingp Serveonegroupatatimeslicep Bettercachelocality:recentlyaccessedmetadataismorelikelybeusedagain

2.VirtualizedMappingn AlleviatethecontentionintheCPUcache

p Reducememoryfootprintinthemessagepooln Observations:

p Whengroupingtheclients,onlypartofthemessagepoolisused

Server

Clients …

…MessagePool

Currentgroup Unused messagebuffers

2.VirtualizedMappingnWedon’tneedtoassignamessagebufferforeachclient

p Virtualize asinglephysicalmessagepooltobeshared amongmultiplegroups

p Withoutextraoverheadforloading/savingthecontext

Server

Clients …

…

MessagePool

�

2.VirtualizedMapping

Server

Clients

…

MessagePoolSwitchingtothenextgroup

…

nWedon’tneedtoassignamessagebufferforeachclientp Virtualize asinglephysicalmessagepooltobeshared amongmultiplegroups

p Withoutextraoverheadforloading/savingthecontext

�

Other Challenges&solutionsn Staticgroupingis suboptimalwhen clientshave

p Varyingrequirementsforthetaillatencyp VaryingfrequenciesofthepostedRPCsp Varyingpayloadsizesp VaryingexecutiontimesfordifferenthandlersPriority-basedscheduler:monitorstheperformanceofeachclientsanddynamicallyadjustthegroupsizeandtimeslicelength.

n SwitchingbetweenthegroupsshouldbeefficientWarmuppool:beforebeingserved,clientsfromthenextgroupputtheirnewrequestsinthewarmuppoolfirst

�,8(017 03 &'�"*(1()1,�!��! ��43�!,10()1,��433,*8043�;08/��--0*0,38�!,7496*,�"/(603.� %49203 �/,3� %49<49��9� �0;9 "/9� 03 �=:8;A;��

Evaluationn Platform

p 2× 2.2GHzIntelXeonE5-2650v4CPUs(24coresintotal)p 128GBDRAMp MCX353ACX-3FDRHCAs(56Gbps IBand40GbE)p 12-nodeclusterconnectedwithMellanoxSX-1012switch

nComparedSystemsRPC Description

RawWrite RPC AbaselineRPCwith alltheoptimizationsinScaleRPC disabled

HERDRPC AscalableRPCwithahybrid ofUCwriteandUDsendverbs

FaSST RPC AscalableRPCbasedonUDsendverbs�

Evaluationn Throughput

0246810

60 160 260 360

RawWrite HERDFaSST ScaleRPC

Throughp

ut(M

ops/s)

#.ofclients(Batch=1)

05

10152025

60 160 260 360

RawWrite HERDFaSST ScaleRPC

#.ofclients(Batch=8)Th

roughp

ut(M

ops/s)

�

0

0.5

1

1.5

2

MKNOD RMNOD STAT READDIR

RawWrite ScaleRPC

EvaluationnMetadataServerinOctopus(DistributedFileSystem)

Throughp

ut(ops/s)

80clients

�

Thanks

��

&�'��*84597��(3�!��,3()1,+��07860)98,+� ,67078,38��,246<��01,�"<78,2� %49<49��9� �0;9 "/9� %49203 �/,3� #(4��0��03�&$� �)��%��&'�"*(1()1,�!��! ��43�!,10()1,��433,*8043�;08/��--0*0,38�!,7496*,�"/(603.� %49203 �/,3� %49<49��9� �0;9"/9� 03 �=:8;A;��

File System Design for Distributed Persistent...

Documents

Transcript of File System Design for Distributed Persistent...