Octopus: an RDMA-enabled Distributed Persistent Memory...
Transcript of Octopus: an RDMA-enabled Distributed Persistent Memory...
Octopus:anRDMA-enabledDistributedPersistentMemoryFileSystem
YouyouLu1,Jiwu Shu1,YouminChen1,TaoLi2
1TsinghuaUniversity2UniversityofFlorida
1
NVMM&RDMA
•NVMM (PCM,ReRAM,etc)• Datapersistency• Byte-addressable• Lowlatency
•RDMA• Remotedirectaccess• Bypassremotekernel• Lowlatencyandhighthroughput
Client ServerRegistered Memory Registered Memory
HCA HCACPU CPU
A
B
C
D
E
3
Modular-DesignedDistributedFileSystem
• DiskGluster• Diskfordatastorage• GigEforcommunication
•MemGluster• Memory fordatastorage• RDMAforcommunication
18ms
Latency(1KBwrite+sync)
98%
2%
Overall
HDD
Software
Network
324us
99.7%
Overall
MEM
Software
RDMA
4
Modular-DesignedDistributedFileSystem
• DiskGluster• Diskfordatastorage• GigEforcommunication
•MemGluster• Memory fordatastorage• RDMAforcommunication
Bandwidth(1MBwrite)
88MB/s
83MB/s
HDD
FileSystem
118MB/sNetwork
6509MB/s
1779MB/s
MEM
FileSystem
6350MB/sRDMA
94%
27%
5
RDMA-enabledDistributedFileSystem•Morethanfasthardware• Itissuboptimaltosimplyreplacethenetwork/storagemodule
•OpportunitiesandChallenges• NVM• Byte-addressability• Significantoverheadofdatacopies
• RDMA• Flexibleprogrammingverbs (message/memorysemantics)• ImbalancedCPUprocessingcapacityvs.networkI/Os
6
RDMA-enabledDistributedFileSystem
• ItisnecessarytorethinkthedesignofDFSoverNVM&RDMA
8
Byte-addressabilityofNVMOne-sidedRDMAverbs
Shareddatamanagements
CPUisthenewbottleneckNewdataflowstrategies
FlexibleRDMAverbs EfficientRPCPrimitive
Opportunity Approaches
OctopusArchitecture
N2
NVMM
N2
NVMM
N3
NVMM
HCA HCA HCA
...
SharedPersistentMemoryPool
ClientBClientASelf-IdentifiedRPC
RDMA-basedDataIO
create(“/home/cym”)Read(“/home/lyy”)
It performs remote direct data access just like an Octopus uses its eight legs
9
1.SharedPersistentMemoryPool
• ExistingDFSs• Redundantdatacopy
Client Server
FSImage
UserSpaceBuffer
GlusterFS
PageCache
UserSpaceBuffer
mbuf
NICNIC
mbuf
10
7copy
1.SharedPersistentMemoryPool
• ExistingDFSs• Redundantdatacopy
Client Server
FSImage
UserSpaceBuffer
GlusterFS +DAX
PageCache
UserSpaceBuffer
mbuf
NICNIC
mbuf
11
6 copy
•OctopuswithSPMP• Introducesthesharedpersistentmemorypool• Globalviewofdatalayout
1.SharedPersistentMemoryPool
• ExistingDFSs• Redundantdatacopy
Client Server
FSImage
UserSpaceBuffer UserSpaceBuffer
mbuf
NICNIC
mbuf
12
MessagePool
4 copy
2.Client-ActiveDataI/O
• Server-Active• ServerthreadsprocessthedataI/O• WorkswellforslowEthernet• CPUscaneasilybecomethebottleneckwithfasthardware
• Client-Active• Letclientsread/writedatadirectlyfrom/totheSPMP
C1 timeNIC MEMCPUC2
Lookupfiledata Senddata
Lookupfiledata Sendaddress13
SERVER
3.Self-IdentifiedMetadataRPC• Message-basedRPC• easytoimplement,lowerthroughput• DaRPC[SoCC’14],FaSST[OSDI’16]
• Memory-basedRPC• CPUcoresscanthemessagebuffer• FaRM[NSDI’14]
• Usingrdma_write_with_imm?• Scanbypolling• Imm dataforself-identification
14
MessagePoolHCA
Thead1 Theadn
HCA DATAID
MessagePoolHCA
Thead1 Theadn
HCA DATA
EvaluationSetup• EvaluationPlatform
• ConnectedwithMellanoxSX1012switch• EvaluatedDistributedFileSystems•memGluster,runsonmemory,withRDMAconnection• NVFS[OSU],Crail[IBM],optimizedtorunonRDMA•memHDFS,Alluxio,forbigdatacomparison
Cluster CPU Memory ConnectX-3 FDR Number
A E5-2680 * 2 384 GB Yes * 5
B E5-2620 16 GB Yes * 7
16
OverallEfficiency
75%
80%
85%
90%
95%
100%
getattr readdir
Latency Breakdown
software mem network
0
1000
2000
3000
4000
5000
6000
7000
Write Read
Bandwidth Utilization
software mem network
• Softwarelatencyisreducedfrom326usto6us• Achievesread/writebandwidththatapproachestherawstorageandnetworkbandwidth
17
MetadataOperationPerformance
• OctopusprovidesmetadataIOPSintheorderof10#~10%
• Octopuscanscaleslinearly
6.E+03
6.E+04
6.E+05
1 2 3 4 5
MKNOD
glusterfs nvfscrail dmfscrail-poll
8.E+04
8.E+05
8.E+06
1 2 3 4 5
GETATTR
glusterfs nvfscrail dmfscrail-poll
3.E+04
3.E+05
1 2 3 4 5
RMNOD
glusterfs nvfscrail dmfscrail-poll
18
BigDataEvaluation
•Octopuscanalsoprovidebetterperformanceforbigdataapplicationsthanexistingfilesystems.
0
500
1000
1500
2000
2500
3000
write read
TestDFSIO (MB/s)
memHDFS Alluxio NVFS Crail Octopus
0.6
0.7
0.8
0.9
1
1.1
Teragen Wordcount
Normalized Execution Time
memHDFS Alluxio NVFS Crail Octopus
19
Conclusion• ItisnecessarytorethinktheDFSdesignsoveremergingH/Ws•Octopus’sinternalmechanisms• Simplifiesdatamanagementlayerby reducingdatacopies• RebalancesnetworkandserverloadswithClient-ActiveI/O• RedesignsthemetadataRPCanddistributedtransactionwithRDMAprimitives
• EvaluationsshowthatOctopussignificantlyoutperformsexistingfilesystems
20