Early Evaluation of the"Infinite Memory Engine"Burst Buffer Solution
Wolfram SchenckFaculty of Engineering and Mathematics,Bielefeld University of Applied Sciences, Bielefeld, Germany
Salem El Sayed, Maciej Foszczynski,Wilhelm Homberg, Dirk PleiterJülich Supercomputing Centre,Forschungszentrum Jülich, Jülich, Germany
WOPSSS 2016 − Frankfurt, 23.06.2016
Slide 2Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Outline
Test SystemData RetentionTime Analysis
GeneralBenchmarks
(IOR)
NESTBenchmarks
Introduction:The Burst
Buffer Concept
Conclusionsand Outlook
Slide 3 Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Introduction:The Burst Buffer Concept
Slide 4Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Need for New Storage Architectures
Address growing performance gap
Floating-point performance Bfp grows faster than I/O bandwidth Bio, i.e. Bio/Bfp becomes smaller
• For JUQUEEN we have Bio/Bfp = 1 Byte / 40,000 Flops
Mitigation strategy: Hierarchical storage architecture
• Fast but low capacity storage tier
• Large capacity but slow storage tier
Emerging data-intensive applications
Need for
• large storage capacity Cio, and
• high bandwidth Bio, and
• high IOPs rates
Slide 5Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Application Classes
Dominant read
Applications processing data retrieved by experiments or collected by observatories
Applications analyzing data from huge databases ("big data")
Dominant write
Applications from the area of simulation science, generating large amounts of data
Transient write/read
Applications (or sets of applications) producing and consuming significant amounts of data on the same system
Transient data: Long-term storage often not necessary
Cluster
Main Storage System
Slide 6Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Conventional Storage System
Cluster Main Storage System
t 10 time steps
Time step
spent with I/O
Time step
spent with non-
I/O operations
Arrow direction:
Dominant write
Slide 7Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Cluster Burst BufferMain Storage System
t 6 time steps
Enhanced by Burst BufferScenario: Sustained Performance
Cluster Main Storage System
t 10 time steps
SPEEDUP = 10/6 = 1.67
Full simulation cycle
I/O burst
Slide 8Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Cluster Burst BufferMain Storage System
t
6 time steps
Enhanced by Burst BufferScenario: Short-Term Peak Performance
Cluster Main Storage System
t
18 time steps
SPEEDUP = 18/6 = 3.0
Full simulation cycle
I/O burst
Slide 9Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Burst Buffer Concept
Capacities:
Conventional main storage: Large
Burst buffer: Small
Bandwidth:
Between cluster and burst buffer: High
Between burst buffer and main storage: Low
Speedup obtained via burst buffer depends theoretically on (for dominant write):
I/O pattern of application: Continuous vs. in bursts
I/O intensity of application: Low vs. high
Runtime of application: Long vs. short
Increasing speedup
Slide 10Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Infinite Memory Engine (by DDN)
Realisation of storage hierarchy
Upper tier = IME
Very small Cio / Bio ≈ 10 min
Leverage NVM technologies
External storage
Very large Cio / Bio ≈ O(1 day)
Leverage HDD technologies
Benefits
High bandwidth + IOPs rate
Compatibility and support of any POSIX compliant parallel file system
Challenges
Re-organisation of I/O may be required to leverage performance
Compute servers
IME
External storage
Slide 11Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Using IME
MPI I/O interface
Use of namespace of parallel file system (PFS)
Prefix controls where created file is allocated, e.g. ime://gpfs/data/pleiter/file.dat
Software-controlled sync from IME to PFS
POSIX interface
IME storage devices mounted using FUSE
Use of namespace of parallel file system (PFS), but:
Special mountpoint for IME (use path via this mountpoint for direct access to IME)
Choice of path allows to control use of IME or PFS
Software-controlled sync from IME to PFS
Slide 12Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Benchmarking
Central goal of our study:
Benchmarking with real-world system to check if IME fulfills theoretical expectations
Benchmarks:
General performance: IOR [LLNL, 2003]
• Benchmarking tool for testing performance of parallel filesystems using various interfaces and access patterns
Computational science software from the dominant write class: NEST
Slide 13 Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Test System
Slide 14Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
JUlich Dedicated GPU Environment(JUDGE) (decommissioned end of 2015)
JUDGE:
For our tests: Up to 64 compute nodes from JUDGE
Scientific Linux 6.7
Pre-release version of IME software stack (Dec. 2015)
Figure: JSC
Slide 15Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Test System
Schematic overview of the integration of theIME servers at JSC:
IME = IME Server
24 SSDs with 200 GiB each (overall ca. 4.7 TiB)
2 IB host adapters (QDR)
(64 Gbit/s)
(64 Gbit/s)
(32 Gbit/s)
(20 Gbit/s)
(10 Gbit/s)
JUST
Bandwidth to IME:128 Gbit/s = 16 GByte/s
Bandwidth to GPFS:20 Gbit/s = 2.5 GByte/s
Slide 16 Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
GeneralBenchmarks (IOR)
IOR Settings
Slide 17Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
IOR Read Performance
Bandwidth saturation reached with 4 nodes (GPFS) or 8 nodes (IME)
Max. GPFS read bandwidth: 0.63 GByte/s (25% of nominal value)
Max. IME read bandwidth: 13.8 GByte/s (86% of nominal value)
Slide 18Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
IOR Write Performance
Bandwidth saturation reached with 4 nodes (GPFS) or 8 nodes (IME)
Max. GPFS write bandwidth: 0.75 GByte/s (33% of nominal value)
Max. IME write bandwidth: 15.63 GByte/s (98% of nominal value)
Slide 19 Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
NESTBenchmarks
Slide 20Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
The Human Brain Project
HBP: Future & Emerging Technologies flagship
project (co-)funded by European Commission
Science-driven, seeded from FET, extending beyond ICT
Ambitious, unifying goal, large-scale
Goal
To build an integrated ICT infrastructure enabling a global
collaborative effort towards understanding the human
brain, and ultimately to emulate its computational
capabilities
Slide 21Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Brain Simulation (1)
Dendriten
Soma
Axon
Neuron
Simulation software:NEST (NEural Simulation Tool)
Open source: www.nest-simulator.org / www.nest-initiative.org
Purpose: Large-scale simulations of biologically realistic neuronal networks (focus on large networks, use of simple point neurons)
Spike
Slide 22Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Brain Simulation (2)
Right fig.: E. Torre, INM-6, Forschungszentrum Jülich
In the human brain:
ca. 100 bn neurons
ca. 10,000 incoming connections per neuron
Largest simulation so far: Simulation with 1 bn neurons (feasibility study on the K computer in Japan)
I/O challenge: Simulations can produce huge amounts of data
Slide 23Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Parallel Processing in NEST(VP: Virtual Process)
NVP neurons NVP neurons NVP neurons
VP0
Number of MPI Ranks
Num
be
r o
f T
hre
ad
s p
er
Ran
k
T
M
VP1 VP2
NVP neurons NVP neurons NVP neurons
VP3 VP4 VP5
In the whole network: N neurons with N = M·T·NVP
Slide 24Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Cycle
Process-internal routing of spike events to their target neurons (incl. synapse update)
Updating of neuronal states (incl. spike generation)
Exchange of spike events between MPI processes
Communication interval
Slide 25Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Creating Spike Events……during Neuron Update
NVP neurons NVP neurons NVP neurons
VP0 VP1 VP2
NVP neurons NVP neurons NVP neurons
VP3 VP4 VP5
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
Number of MPI Ranks
Nu
mb
er
of T
hre
ad
s p
er
Ra
nk
T
M
Red dot:
Single spike
event
Slide 26Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Cycle(revisited)
Process-internal routing of spike events to their target neurons (incl. synapse update)
Updating of neuronal states (incl. spike generation)
Exchange of spike events between MPI processes
Communication interval
Slide 27Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Creation of Rank-Local Spike Buffers
NVP neurons NVP neurons NVP neurons
VP0 VP1 VP2
NVP neurons NVP neurons NVP neurons
VP3 VP4 VP5
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Number of MPI Ranks
Nu
mb
er
of T
hre
ad
s p
er
Ra
nk
T
M
. . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . .
. . . . . . .
. . . . . . . . .
Slide 28Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
MPI Communication:Every rank receives all spike events
NVP neurons NVP neurons NVP neurons
VP0 VP1 VP2
NVP neurons NVP neurons NVP neurons
VP3 VP4 VP5
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Number of MPI Ranks
Num
be
r o
f T
hre
ad
s p
er
Ran
k
T
M
MPI
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Slide 29Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Cycle(revisited)
Process-internal routing of spike events to their target neurons (incl. synapse update)
Updating of neuronal states (incl. spike generation)
Exchange of spike events between MPI processes
Communication interval
Slide 30Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
I/O in NEST
Data collected during simulations:
Spike events
• Recording device: Spike detector
State variables (e.g., membrane potential of neurons)
• Recording device: Multimeter
Recording devices belong to abstract node class:
Connected to neurons (from which measurements are collected)
Receive spike events (spike detector)
Send out measurement events (multimeter)
Updated like neurons (writing data during update)
Each recording device exists on every virtual process (VP),writes data via C++ output stream into text file (one file per device per VP)
Slide 31Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Script for Benchmark:Random Balanced Network
One spike detector and one multimeter per population(created last after all neurons)
Overall 4 recording devices (= C++ output streams) per VP
Fig.: Nadine Daivandy (JSC)
Slide 32Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Cycle(revisited)
Process-internal routing of spike events to their target neurons (incl. synapse update)
Updating of neuronal states (incl. spike generation)
Exchange of spike events between MPI processes
Communication interval
Update of
recording
devices
I/O
BURST
Slide 33Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Design of Experiment
Factor 1: Number of compute nodes
1, 2, 4, 8, 16
• Strict weak scaling design: Number of neurons per node constant
Factor 2: Amount of written data per node; manipulated via number of state variables recorded by each multimeter
1 – 22
• Corresponds to 1 GiB/node – 8 GiB/node(amount of spike data insignificant)
Factor 3: Output file system
1. POSIX I/O to GPFS
2. POSIX I/O to IME
3. POSIX I/O to /dev/null:Baseline condition, "infinitely fast storage device"
Further experimental settings:
Simulated biological time: 100 ms
Network size: 258,750 neurons per compute node,ca. 3e8 synapses per compute node
23 MPI ranks per compute node
5 runs per task condition, minimum reported
Slide 34Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Bandwidth (1 GiB/node)
Slide 35Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Bandwidth (8 GiB/node)
Slide 36Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Bandwidth (1 and 8 GiB/node)
POSIX2IME very close to POSIX2DEVNULL: IME close to "ideal" performance
Very good scaling behavior of IME: Observed bandwidth nearly doubles with doubling of number of compute nodes
Bad scaling behavior of GPFS beyond 4 compute nodes
Observed bandwidth small compared to IOR measurements
Slide 37Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Cycle(revisited)
Process-internal routing of spike events to their target neurons (incl. synapse update)
Updating of neuronal states (incl. spike generation)
Exchange of spike events between MPI processes
Communication interval
Update of
recording
devices
I/O
BURST
Slide 38Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Time (1 GiB/node)
Effective simulation time = simulation time without step 3 (MPI synchr.)
Slide 39Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Times (8 GiB/node)
Effective simulation time = simulation time without step 3 (MPI synchr.)
Slide 40Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Simulation Time: Observations
The larger the number of nodes,the stronger the advantage ofwriting to IME or /dev/null
Very good scaling behavior ofIME clearly visible in plots
GPFS setting suffers heavily fromimbalance between ranks
IME reaches nearly performanceof /dev/null; barely any I/O-
induced additionalimbalance between ranks
Slide 41Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Relative Runtime Reduction
Reported values based on average over all measured I/O loads
Slide 42 Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Data RetentionTime Analysis
Slide 43Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Motivation: Interactive Supercomp.
Data retention time analysis: Classification of data depending on how long it will be retained
Interactive supercomputing/HPC:
User can interact with the application(s) that run on the supercomputer/cluster
Misc. use cases for NEST
Slide 44Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
NEST: Data Retention Times
Data retention time analysis: Classification of data depending on how long it will be retained
Slide 45 Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Conclusionsand Outlook
Slide 46Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Conclusions
IOR Results: IME saturated ca. 90% of nominal bandwidth in reading and writing
Promising finding for all considered application classes
NEST Results:
Barely any I/O-induced imbalance between ranks with IME(in constrast to GPFS)
IME performance close to baseline condition (/dev/null), nearly perfect weak scaling behavior
At largest problem size: Nearly speedup of 4 achieved vs. GPFS
Easy handling: No code changes in NEST required
Conclusions:
IME actually works as theoretically expected for applications from the dominant write class (writing in bursts)
NEST users would strongly profit from the incorporation of IME in compute clusters (I/O no longer a limiting factor in gathering simulation results)
Slide 47Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Outlook and Recommendations
Recommendations for the future development of IME:
Data pre-fetching: For "dominant read" applications, data pre-fetching before job start would be highly beneficial
• Integration into job managers?
Development of tools for managing short-term and transient data, integration into job managers
Support for end-to-end data integrity like within GPFS
Final word:
IME shows: Working burst buffer solutions exist for complex parallel applications
Opportunity to scale compute and I/O performance
Alternatively: Opportunity to reduce bandwidth requirements for external storage system
Slide 48Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter
Questions?
Thank youfor your
attention!
Acknowledgements: We would like to thank DDN for making an IME test
system available at Jülich Supercomputing Centre. In particular, we gracefully
acknowledge the continuous support by Tommaso Cecchi and Toine Beckers.
Top Related