echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment...
Transcript of echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment...
![Page 1: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/1.jpg)
Dagstuhl, May 2017
echofs: Enabling Transparent Access to Node-local
NVM Burst Buffers for Legacy Applications
[…with the scheduler’s collaboration]
Alberto Miranda, PhD
Researcher on HPC I/O
www.bsc.es
The NEXTGenIO project has received funding from the European Union’s Horizon 2020 Research and Innovation programme under Grant Agreement no. 671951
![Page 2: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/2.jpg)
• Petascale already struggles with I/O…
– Extreme parallelism w/ millions of threads
– Job’s input data read from external PFS
– Checkpoints: periodic writes to external PFS
– Job’s output data write to external PFS
• HPC & Data Intensive systems merging
– Modelling, simulation and analytic workloads increasing…
I/O -> a fundamental challenge;
2
external filesystem
high performance network
compute nodes
![Page 3: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/3.jpg)
• Petascale already struggles with I/O…
– Extreme parallelism w/ millions of threads
– Job’s input data read from external PFS
– Checkpoints: periodic writes to external PFS
– Job’s output data write to external PFS
• HPC & Data Intensive systems merging
– Modelling, simulation and analytic workloads increasing…
• And it will only get worse at Exascale…
I/O -> a fundamental challenge;
3
external filesystem
high performance network
compute nodes
![Page 4: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/4.jpg)
• Fast storage devices that temporarily store application data before sending it to PFS
– Goal: Absorb peak I/O to avoid overtaxing PFS
– Cray Datawarp, DDN IME
• Growing interest to add them into next-gen HPC architectures
– NERSC’s Cori, LLNL’s Sierra, ANL’s Aurora, …
– Typically a separate resource to PFS
– Usage/allocation/data movements/etc become user responsibility
Burst Buffers -> remote;
4
external filesystem
high performance network
compute nodes
burst filesystem
![Page 5: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/5.jpg)
• Non-volatile coming to the node
– Argonne’s Theta has 128GB SSDs in each compute node
Burst Buffers -> on-node;
5
external filesystem
high performance network
compute nodes
![Page 6: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/6.jpg)
I/O Stack
• Node-local, high-density NVRAM becomes a fundamental component
– Intel’s 3DXPointTM
– Capacity much larger than DRAM
– Slightly slower than DRAM but significantly faster than SSDs
– DIMM form factor standard memory controller
– No refresh no/low energy leakage
NEXTGenIO EU Project [http://www.nextgenio.eu]
6
cache
memory
storage
![Page 7: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/7.jpg)
• Node-local, high-density NVRAM becomes a fundamental component
– Intel’s 3DXPointTM
– Capacity much larger than DRAM
– Slightly slower than DRAM but significantly faster than SSDs
– DIMM form factor standard memory controller
– No refresh no/low energy leakage
NEXTGenIO EU Project [http://www.nextgenio.eu]
7
I/O Stack
cache
memory
nvram
fast storage
slow storage
![Page 8: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/8.jpg)
• Node-local, high-density NVRAM becomes a fundamental component
– Intel’s 3DXPointTM
– Capacity much larger than DRAM
– Slightly slower than DRAM but significantly faster than SSDs
– DIMM form factor standard memory controller
– No refresh no/low energy leakage
I/O Stack
NEXTGenIO EU Project [http://www.nextgenio.eu]
8
cache
memory
nvram
fast storage
slow storage
1. How do we manage access to these layers?
2. How can we bring the benefits from these layers to legacy code?
![Page 9: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/9.jpg)
OUR SOLUTION: MANAGING ACCESS THROUGH A USER-LEVEL FILESYSTEM
![Page 10: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/10.jpg)
I/O Stack
• First goal: Allow legacy applications to transparently benefit from new storage layers
– Accessible storage layers under unique mount point
– Make new layers readily available to applications
– I/O stack complexity hidden from applications
– Allows for automatic management of data location
– POSIX interface [sorry]
echofs -> objectives;
10
cache
memory
nvram
SSD
Lustre
/mnt/echofs/PFS namespace is “echoed”
/mnt/PFS/User/App /mnt/ECHOFS/User/App
![Page 11: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/11.jpg)
• Second goal: construct a collaborative burst buffer by joining NVM regions assigned to a batch job by scheduler [SLURM]
– Filesystem’s lifetime linked to batch job’s lifetime
– Input files staged into NVM before job starts
– Allow HPC jobs to perform collaborative NVM I/O
– Output files staged out to PFS when job ends
echofs -> objectives;
11
NVM
collaborative burst buffer
compute nodes
NVM NVM NVM
parallel processes
external filesystem
echofs
POSIX read/writes
stage-in/stage-out
![Page 12: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/12.jpg)
• User provides job I/O requirements through SLURM
– Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], …
• SLURM allocates nodes and mounts echofs across them
– Also forwards I/O requirements through API
• echofs builds the CBB and fills it with input files
– When finished, SLURM starts the batch job
echofs -> intended workflow;
12
![Page 13: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/13.jpg)
• User provides job I/O requirements through SLURM
– Nodes required, files accessed, type of access [in|out|inout], expected lifetime [temporary|persistent], expected “survivability”, required POSIX semantics [?], …
• SLURM allocates nodes and mounts echofs across them
– Also forwards I/O requirements through API
• echofs builds the CBB and fills it with input files
– When finished, SLURM starts the batch job
echofs -> intended workflow;
13
We can’t expect optimization details
from users, but maybe for them to offer us
enough hints…
![Page 14: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/14.jpg)
• Job I/O absorbed by collaborative burst buffer
– Non-CBB open()s forwarded to PFS (throttled to limit PFS congestion)
– Temporary files do not need to make it to PFS (e.g. checkpoints)
– Metadata attributes for temporary files cached
Distributed key-value store
• When job completes, future of files managed by echofs
– Persistent files eventually sync'd to PFS
– Decision orchestrated by SLURM & DataScheduler component depending on requirements of upcoming jobs
echofs -> intended workflow;
14
If some other job reuses these files, we can leave them “as is”
![Page 15: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/15.jpg)
• Distributed data servers
– Job’s data space partitioned across compute nodes
– Each node acts as data server for its partition
– Each node acts as data client for other partitions
• Pseudo-random file segment distribution
– No replication ⇒ avoid coherence mechanisms
– Resiliency through erasure codes (eventually)
– Each node acts as lock manager for its partition
echofs -> data distribution;
15
NVM NVM NVM NVM
[0-8MB) [8-16MB) [16-32MB)
hash
compute nodes
data space
shared file
![Page 16: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/16.jpg)
• Why pseudo-random?
– Efficient & decentralized segment lookup[no metadata request to lookup segment]
– Balance workload w.r.t. partition size
– Allows for collaborative I/O
echofs -> data distribution;
16
NVM NVM NVM NVM
[0-8MB) [8-16MB) [16-32MB)
hash
compute nodes
data space
shared file
![Page 17: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/17.jpg)
• Why pseudo-random?
– Efficient & decentralized segment lookup[no metadata request to lookup segment]
– Balance workload w.r.t. partition size
– Allows for collaborative I/O
– Guarantees minimal movements ofdata if node allocation changes[future research on elasticity]
echofs -> data distribution;
17
NVM
allocated nodes
NVM
NVM
allocated nodes
NVM
NVM
NVM
allocated nodes
NVM
NVM
NVM
NVM
allocated nodes
NVM
job phases over time
job scheduler
datatransfer
datatransfer
datatransfer
+1 node +1 node -2 nodes
![Page 18: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/18.jpg)
• Why pseudo-random?
– Efficient & decentralized segment lookup[no metadata request to lookup segment]
– Balance workload w.r.t. partition size
– Allows for collaborative I/O
– Guarantees minimal movements ofdata if node allocation changes[future research on elasticity]
echofs -> data distribution;
18
NVM
allocated nodes
NVM
NVM
allocated nodes
NVM
NVM
NVM
allocated nodes
NVM
NVM
NVM
NVM
allocated nodes
NVM
job phases over time
job scheduler
datatransfer
datatransfer
datatransfer
+1 node +1 node -2 nodes
Other strategies would be possible depending
on job semantics
![Page 19: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/19.jpg)
• Data Scheduler daemon external to echofs
– Interfaces SLURM & echofs
Allows SLURM to send requests to echofs
Allows echofs to ACK these requests
– Offers an API to [non-legacy] applications willing to send I/O hints to echofs
– In the future will coordinate w/ SLURM to decide when different echofs instances should access PFS[data-aware job scheduling]
echofs -> integration with batch scheduler;
19
echofs
data scheduler
SLURM
applications
stage-in/stage-outasynchronous requests
dynamicI/O requirements
staticI/O requirements
![Page 20: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/20.jpg)
• Main features:
– Ephemeral filesystem linked to job lifetime
– Allows legacy applications to benefit from newer storage technologies
– Provides aggregate I/O for applications
• Research goals:
– Improve coordination w/ job scheduler and other HPC management infrastructure
– Investigate ad-hoc data distributions tailored for each job I/O
– Scheduler-triggered specific optimizations for jobs/files
Summary;
20
![Page 21: echofs: Enabling Transparent Access to Node-local NVM ... · –Efficient & decentralized segment lookup [no metadata request to lookup segment] –Balance workload w.r.t. partition](https://reader036.fdocuments.in/reader036/viewer/2022062522/5fc4d54e0a42f3476e49e7b5/html5/thumbnails/21.jpg)
• POSIX compliance is hard…
– But maybe we don’t need FULL COMPLIANCE for ALL jobs…
• Adding I/O-awareness to the scheduler is important…
– Allows wasting I/O work already done…
– … but requires user/developer collaboration (tricky…)
• User-level filesystems/libraries solve very specific I/O problems…
– Can we reuse/integrate these efforts? Can we learn what works for a specific application, characterize it & automatically run similar ones in a “best fit” FS?
Food for thought;
21