Download - Early Evaluation of the Infinite Memory Engine Burst ... · PDF file"Infinite Memory Engine" Burst Buffer Solution ... Applications from the area of simulation science, ... compliant

Early Evaluation of the"Infinite Memory Engine"Burst Buffer Solution

Wolfram SchenckFaculty of Engineering and Mathematics,Bielefeld University of Applied Sciences, Bielefeld, Germany

Salem El Sayed, Maciej Foszczynski,Wilhelm Homberg, Dirk PleiterJülich Supercomputing Centre,Forschungszentrum Jülich, Jülich, Germany

WOPSSS 2016 − Frankfurt, 23.06.2016

Evaluation of IME · 23.06.2016 · WOPSSS 2016Wolfram Schenck, Salem El Sayed, Maciej Foszczynski, Wilhelm Homberg, Dirk Pleiter

Outline

Test SystemData RetentionTime Analysis

GeneralBenchmarks

(IOR)

NESTBenchmarks

Introduction:The Burst

Buffer Concept

Conclusionsand Outlook


Introduction:The Burst Buffer Concept


Need for New Storage Architectures

Address growing performance gap

Floating-point performance Bfp grows faster than I/O bandwidth Bio, i.e. Bio/Bfp becomes smaller

• For JUQUEEN we have Bio/Bfp = 1 Byte / 40,000 Flops

Mitigation strategy: Hierarchical storage architecture

• Fast but low capacity storage tier

• Large capacity but slow storage tier

Emerging data-intensive applications

Need for

• large storage capacity Cio, and

• high bandwidth Bio, and

• high IOPs rates


Application Classes

Dominant read

Applications processing data retrieved by experiments or collected by observatories

Applications analyzing data from huge databases ("big data")

Dominant write

Applications from the area of simulation science, generating large amounts of data

Transient write/read

Applications (or sets of applications) producing and consuming significant amounts of data on the same system

Transient data: Long-term storage often not necessary

Cluster

Main Storage System


Conventional Storage System

Cluster Main Storage System

t 10 time steps

Time step

spent with I/O

Time step

spent with non-

I/O operations

Arrow direction:

Dominant write


Cluster Burst BufferMain Storage System

t 6 time steps

Enhanced by Burst BufferScenario: Sustained Performance


t 10 time steps

SPEEDUP = 10/6 = 1.67

Full simulation cycle

I/O burst


Cluster Burst BufferMain Storage System

t

6 time steps

Enhanced by Burst BufferScenario: Short-Term Peak Performance


t

18 time steps

SPEEDUP = 18/6 = 3.0

Full simulation cycle

I/O burst


Burst Buffer Concept

Capacities:

Conventional main storage: Large

Burst buffer: Small

Bandwidth:

Between cluster and burst buffer: High

Between burst buffer and main storage: Low

Speedup obtained via burst buffer depends theoretically on (for dominant write):

I/O pattern of application: Continuous vs. in bursts

I/O intensity of application: Low vs. high

Runtime of application: Long vs. short

Increasing speedup


Infinite Memory Engine (by DDN)

Realisation of storage hierarchy

Upper tier = IME

Very small Cio / Bio ≈ 10 min

Leverage NVM technologies

External storage

Very large Cio / Bio ≈ O(1 day)

Leverage HDD technologies

Benefits

High bandwidth + IOPs rate

Compatibility and support of any POSIX compliant parallel file system

Challenges

Re-organisation of I/O may be required to leverage performance

Compute servers

IME

External storage


Using IME

MPI I/O interface

Use of namespace of parallel file system (PFS)

Prefix controls where created file is allocated, e.g. ime://gpfs/data/pleiter/file.dat

Software-controlled sync from IME to PFS

POSIX interface

IME storage devices mounted using FUSE

Use of namespace of parallel file system (PFS), but:

Special mountpoint for IME (use path via this mountpoint for direct access to IME)

Choice of path allows to control use of IME or PFS

Software-controlled sync from IME to PFS


Benchmarking

Central goal of our study:

Benchmarking with real-world system to check if IME fulfills theoretical expectations

Benchmarks:

General performance: IOR [LLNL, 2003]

• Benchmarking tool for testing performance of parallel filesystems using various interfaces and access patterns

Computational science software from the dominant write class: NEST


Test System


JUlich Dedicated GPU Environment(JUDGE) (decommissioned end of 2015)

JUDGE:

For our tests: Up to 64 compute nodes from JUDGE

Scientific Linux 6.7

Pre-release version of IME software stack (Dec. 2015)

Figure: JSC


Test System

Schematic overview of the integration of theIME servers at JSC:

IME = IME Server

24 SSDs with 200 GiB each (overall ca. 4.7 TiB)

2 IB host adapters (QDR)

(64 Gbit/s)

(64 Gbit/s)

(32 Gbit/s)

(20 Gbit/s)

(10 Gbit/s)

JUST

Bandwidth to IME:128 Gbit/s = 16 GByte/s

Bandwidth to GPFS:20 Gbit/s = 2.5 GByte/s


GeneralBenchmarks (IOR)

IOR Settings


IOR Read Performance

Bandwidth saturation reached with 4 nodes (GPFS) or 8 nodes (IME)

Max. GPFS read bandwidth: 0.63 GByte/s (25% of nominal value)

Max. IME read bandwidth: 13.8 GByte/s (86% of nominal value)


IOR Write Performance

Bandwidth saturation reached with 4 nodes (GPFS) or 8 nodes (IME)

Max. GPFS write bandwidth: 0.75 GByte/s (33% of nominal value)

Max. IME write bandwidth: 15.63 GByte/s (98% of nominal value)


NESTBenchmarks


The Human Brain Project

HBP: Future & Emerging Technologies flagship

project (co-)funded by European Commission

Science-driven, seeded from FET, extending beyond ICT

Ambitious, unifying goal, large-scale

Goal

To build an integrated ICT infrastructure enabling a global

collaborative effort towards understanding the human

brain, and ultimately to emulate its computational

capabilities


Brain Simulation (1)

Dendriten

Soma

Axon

Neuron

Simulation software:NEST (NEural Simulation Tool)

Open source: www.nest-simulator.org / www.nest-initiative.org

Purpose: Large-scale simulations of biologically realistic neuronal networks (focus on large networks, use of simple point neurons)

Spike

http://www.nest-simulator.org/

http://www.nest-initiative.org/


Brain Simulation (2)

Right fig.: E. Torre, INM-6, Forschungszentrum Jülich

In the human brain:

ca. 100 bn neurons

ca. 10,000 incoming connections per neuron

Largest simulation so far: Simulation with 1 bn neurons (feasibility study on the K computer in Japan)

I/O challenge: Simulations can produce huge amounts of data


Parallel Processing in NEST(VP: Virtual Process)

NVP neurons NVP neurons NVP neurons

VP0

Number of MPI Ranks

Num

be

r o

f T

hre

ad

s p

er

Ran

k

T

M

VP1 VP2


VP3 VP4 VP5

In the whole network: N neurons with N = M·T·NVP


Simulation Cycle

Process-internal routing of spike events to their target neurons (incl. synapse update)

Updating of neuronal states (incl. spike generation)

Exchange of spike events between MPI processes

Communication interval


Creating Spike Events……during Neuron Update


VP0 VP1 VP2


VP3 VP4 VP5

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

Number of MPI Ranks

Nu

mb

er

of T

hre

ad

s p

er

Ra

nk

T

M

Red dot:

Single spike

event


Simulation Cycle(revisited)






Creation of Rank-Local Spike Buffers


VP0 VP1 VP2


VP3 VP4 VP5

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

Number of MPI Ranks

Nu

mb

er

of T

hre

ad

s p

er

Ra

nk

T

M

. . . . . . . .

. . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . .


. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

MPI Communication:Every rank receives all spike events


VP0 VP1 VP2


VP3 VP4 VP5

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

Number of MPI Ranks

Num

be

r o

f T

hre

ad

s p

er

Ran

k

T

M

MPI

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .


I/O in NEST

Data collected during simulations:

Spike events

• Recording device: Spike detector

State variables (e.g., membrane potential of neurons)

• Recording device: Multimeter

Recording devices belong to abstract node class:

Connected to neurons (from which measurements are collected)

Receive spike events (spike detector)

Send out measurement events (multimeter)

Updated like neurons (writing data during update)

Each recording device exists on every virtual process (VP),writes data via C++ output stream into text file (one file per device per VP)


Simulation Script for Benchmark:Random Balanced Network

One spike detector and one multimeter per population(created last after all neurons)

Overall 4 recording devices (= C++ output streams) per VP

Fig.: Nadine Daivandy (JSC)







Update of

recording

devices

I/O

BURST


Design of Experiment

Factor 1: Number of compute nodes

1, 2, 4, 8, 16

• Strict weak scaling design: Number of neurons per node constant

Factor 2: Amount of written data per node; manipulated via number of state variables recorded by each multimeter

1 – 22

• Corresponds to 1 GiB/node – 8 GiB/node(amount of spike data insignificant)

Factor 3: Output file system

1. POSIX I/O to GPFS

2. POSIX I/O to IME

3. POSIX I/O to /dev/null:Baseline condition, "infinitely fast storage device"

Further experimental settings:

Simulated biological time: 100 ms

Network size: 258,750 neurons per compute node,ca. 3e8 synapses per compute node

23 MPI ranks per compute node

5 runs per task condition, minimum reported


Bandwidth (1 GiB/node)


Bandwidth (8 GiB/node)


Bandwidth (1 and 8 GiB/node)

POSIX2IME very close to POSIX2DEVNULL: IME close to "ideal" performance

Very good scaling behavior of IME: Observed bandwidth nearly doubles with doubling of number of compute nodes

Bad scaling behavior of GPFS beyond 4 compute nodes

Observed bandwidth small compared to IOR measurements







Update of

recording

devices

I/O

BURST


Simulation Time (1 GiB/node)

Effective simulation time = simulation time without step 3 (MPI synchr.)


Simulation Times (8 GiB/node)

Effective simulation time = simulation time without step 3 (MPI synchr.)


Simulation Time: Observations

The larger the number of nodes,the stronger the advantage ofwriting to IME or /dev/null

Very good scaling behavior ofIME clearly visible in plots

GPFS setting suffers heavily fromimbalance between ranks

IME reaches nearly performanceof /dev/null; barely any I/O-

induced additionalimbalance between ranks


Relative Runtime Reduction

Reported values based on average over all measured I/O loads


Data RetentionTime Analysis


Motivation: Interactive Supercomp.

Data retention time analysis: Classification of data depending on how long it will be retained

Interactive supercomputing/HPC:

User can interact with the application(s) that run on the supercomputer/cluster

Misc. use cases for NEST


NEST: Data Retention Times

Data retention time analysis: Classification of data depending on how long it will be retained


Conclusionsand Outlook


Conclusions

IOR Results: IME saturated ca. 90% of nominal bandwidth in reading and writing

Promising finding for all considered application classes

NEST Results:

Barely any I/O-induced imbalance between ranks with IME(in constrast to GPFS)

IME performance close to baseline condition (/dev/null), nearly perfect weak scaling behavior

At largest problem size: Nearly speedup of 4 achieved vs. GPFS

Easy handling: No code changes in NEST required

Conclusions:

IME actually works as theoretically expected for applications from the dominant write class (writing in bursts)

NEST users would strongly profit from the incorporation of IME in compute clusters (I/O no longer a limiting factor in gathering simulation results)


Outlook and Recommendations

Recommendations for the future development of IME:

Data pre-fetching: For "dominant read" applications, data pre-fetching before job start would be highly beneficial

• Integration into job managers?

Development of tools for managing short-term and transient data, integration into job managers

Support for end-to-end data integrity like within GPFS

Final word:

IME shows: Working burst buffer solutions exist for complex parallel applications

Opportunity to scale compute and I/O performance

Alternatively: Opportunity to reduce bandwidth requirements for external storage system


Questions?

Thank youfor your

attention!

Acknowledgements: We would like to thank DDN for making an IME test

system available at Jülich Supercomputing Centre. In particular, we gracefully

acknowledge the continuous support by Tommaso Cecchi and Toine Beckers.