IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each...

80
ddn.com © 2017 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change. DDN INTERNAL! Not for external use. DDN INTERNAL! Not for external use. IME Storage System MSST 2019, May 20 th Paul Nowoczynski [email protected] Jean-Yves Vet [email protected] Courtesy of Jean-Thomas Acquaviva, Kenneth de Mello, Pharthiphan Asokan, Jean-François Le Fillâtre, James Coomer, John Bent, Lokesh Jaliminche

Transcript of IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each...

Page 1: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2017 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.DDN INTERNAL! Not for external use. DDN INTERNAL! Not for external use.

IME Storage System

MSST – 2019, May 20th

Paul [email protected]

Jean-Yves [email protected]

Courtesy of Jean-Thomas Acquaviva, Kenneth de Mello, Pharthiphan Asokan, Jean-François Le Fillâtre, James Coomer, John Bent, Lokesh Jaliminche

Page 2: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Outline2Introduction

How IME Works

Cost Efficiency

Manageability

Performance

Use Cases

Fault Tolerance

Q&A

Page 3: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

IntroductionWhat’s IME (Infinite Memory Engine)? (1/3)

3

Persistent Data (Disk)

Application IO can be random and

unaligned, it is not ideal for peak

parallel file system efficiency

Parallel file system does not operate at

peak efficiency.

Diverse, high concurrency applications

Computenodes

Application issues IO directly to parallel

scratch file system.

Traditional architecture

Page 4: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

IntroductionWhat’s IME (Infinite Memory Engine)? (2/3)

4

IME’s Active I/O Tier, is inserted right between Compute and the parallel file system

IME software intelligently virtualizes disparate NVMe SSDs into a single pool of shared memory that accelerates I/O, PFS & Applications

► Scale-Out Flash Cache Layer using NVMe SSDs inserted between compute cluster and Parallel File System (PFS):

• IME is configured as CLUSTER with multiple NVMe servers

• All compute nodes can access cached data on IME

► Accelerates difficult IO patterns: small/random/shared file/high concurrency due to thin SW IO management layer

► Configured as scale-out massive cache layer with huge IO bandwidth and IOPs

Page 5: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

IntroductionWhat’s IME (Infinite Memory Engine)? (3/3)

5

Fast Data NVM &

SSD

Persistent Data (Disk)

Diverse, high concurrency applications

Computenodes

Application issues IO to IME client.

Erasure Coding applied

IME client sends fragments to IME

servers

IME servers write buffers to NVM and manage

internal metadata

IME servers write aligned sequential

I/O to SFA backend

Parallel Backing File System (BFS)

operates at maximum efficiency

► New system: No need to oversize the PFS to deliver large BW

► Accelerate an existing PFS (Lustre, Spectrum Scale, NFS)

Page 6: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

IntroductionIO500 - Largest B/W (1/2)

6

Largest BW, rely on PFS for metadata

► POSIX results

► Largest IOR scores (bulk data B/W)

► Metadata are forwarded to the BFS

Page 7: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

IntroductionIO500 - Largest B/W (2/2)

7

► Example: KISTI

− Theoretical peak: 96 EDR~ 1TB/s

− Almost no gap between:

■ Easy Read / Easy Write (large IOs, File per Process)

■ Easy / Hard write (47008 Bytes IOs, Single shared file)

− Hard read remains challenging

Page 8: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

IntroductionBottom line

8

IME is a fully distributed, resilient, transparent and flash native IO accelerator.

It uses an automated tiering system to write data (8MB chunks) back to theBFS (Lustre, Spectrum Scale or NFS).

Page 9: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

9

How IME works

Page 10: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksNamespace - Delegated to the Backing File System

10

The servers are clients of the BFS

► Stand-alone, no modification required

► Can be a subdirectory of a mount – doesn’t need to be at the root• The subdirectory name contains a UUID identifying the BFS

► Metadata operation (create, delete, stats, …) are forwarded to the BFS• Adds a hop. But in some cases higher metadata rates (less client nodes than IME servers• IME server uses open_by_handle() and name_to_handle() system calls

Page 11: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksIME is Log Structured(1/2)

11

► Writing changes only (byte addressable)► Consecutive changes are split into chunks (max 128KB) called fragments

► Fragments remove implicit locking: performance are agnostic to file sharing► Flash native (reduced write amplification on NAND devices)

File

off

set

Time fragment

Accumulation of fragments

Page 12: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksIME is Log Structured(2/2)

12

► During reads, frags are virtually rendered to know how to reconstruct data from the fragments

File

off

set

Accumulation of fragments Rendered view

File

off

set

Time

Hole: read from BFS

Page 13: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksIME Metadata (1/3)

13

Purpose of IME Metadata ► Describe each data fragment► Know the state of the data fragments► Know on which data block, on which NVMe device the fragments are located► Know which blocks (other servers) are needed for reconstructions

Each server contains one (or two) device(s) for logs► The logs are kept on a separate NVMe device called the commit log device► Metadata are in RAM, reconstructed at server boot from the commit log device

Page 14: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksIME Metadata (2/3)

14

IME Metadata are distributed► Partial metadata view in RAM on each server

► All metadata accessible to all servers through a Distributed Hash-Table (DHT)

► Hash function provides a tuple containing ids of servers

File1 2,3,1,4

File4 4,1,3,2

File3 1,2,4,3

File6 3,4,2,1

Data Peer tupleDistributed

Network

Hash Function Peers

Page 15: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksIME Metadata (3/3)

15

► Network parallelism: hash function assigns metadata of each file to a server

► Node-level fault tolerance: hash function tuple provides the list of servers containing copies (dhtcopies in IME configuration file defines amount of copies)

► Self-Optimising for Noisy Fabrics: hash function may pick backup servers provided by

the hash function

Page 16: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksBulk data - Mapping to IME servers

16

► Files virtually split in 8MB chunks (buckets): Each chunk is assigned to a serverFi

le o

ffse

t

Timefragment

Bucket 1:IME server N

Bucket 2:IME server N+x

Bucket 3:IME server N+y

Page 17: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksBulk data - Sending fragments to servers

17

Aggregating IOs (Write)

RDMA buffer from client node to a server: contains 8 sub-buffers

“Preferred” network write IO: 1MB

A sub-buffer: parity block or it contains data fragments of the same file. This sub-buffer will be written on a NVMe device.

128KB

fragments

Page 18: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksBulk data - Reading data

18

Client nodes knows which server(s) to query to retrieve data

Client prefetcher► Client nodes have prefetch engines enabled when detecting a pattern when accessing data► Works on strided IOs

Client node

Server N

- Flattens the fragments- Retrieves data via local or remote NVMe device(s) or BFS- Prepares the RDMA to client node

Page 19: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksResilient to SSD Failures (1/4)

19

IME Server0

IME Server1

IME Server2

IME ServerN

FILE CACHE

8+38+1

6+0 8+2

6+14+1

4+14+18+0

1+1

► IME supports multiple resilience levels through flexible, adaptive erasure coding.

► System Wide Default up to 15+3.

► Runtime flexible: Applications can override defaults and select a specific Erasure Coding Scheme.

Erasure coding options:none

1+1 1+2 1+32+1 2+2 2+32+3 3+2 3+3... ... ...15+1 15+2 15+3

Page 20: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksResilient to SSD Failures (2/4)

20

IME implements erasure coding across servers► Allows for the loss of data drives as well as entire servers► Clients do all the erasure coding computations

• Except for data recovery after a device loss, which is done between servers

Clients work with parity groups (PG)► A PG is made of D+P buffers (D data + P parity, or M+N, or N+K)

Parity geometry can be defined in many ways► Globally, in ime.conf → def_pgeom = N+K► For one FUSE client, in ime-fuse.conf → --pgeom=N+K► For one native IME client application → IM_CLIENT_PGEOM=N+K► The client options take precedence over the global default value

Page 21: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksResilient to SSD Failures (3/4)

21

What happens if there isn’t enough client data to fill all the buffers?► Parity geometry adapts► The number of parity buffers stays the same► The number of data buffers is inferior to the requested geometry► Example: 4+2 default may result in 3+2, 2+2 or 1+2► The recovery guaranty is always respected

• 4+2 cannot result to 3+1

What’s the recovery model of IME?► The buffers that belonged to PGs on that drive, are reconstructed automatically on other drives of the

same server.• As all other buffers from those PGs are on other servers, a single server can lose a lot of drives without

losing parity-protected data.

Page 22: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksResilient to SSD Failures (4/4)

22

Page 23: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksRecovery Models - Recap

23

Erasure Coding with PG (IME Data)► Allows NVMe drive reconstruction and node-level fault tolerance (bulk data)► def_pgeom = N+K defines parity geometry in IME configuration file

DHT copies (IME Metadata)► Allows node-level fault tolerance (IME metadata)► dhtcopies in defines the amount of copies in IME configuration file

Page 24: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksAutomated tiering (1/2)

24

► IME IO Operations

• Reading – IME clients reading data

o If data exist in IME then data are read from IME

o If data don’t exist in IME then they are read from the BFS

o Autoprestage could be enabled in IME1.3

• Prestaging

o Preloading data from the BFS to IME.

o Prestaged data are clean.

• Writing

o New or modified data is written to IME, regardless of its original location

o Data written to IME are dirty.

• Releasing

o Data are marked as deletable, and then freed from IME, regardless of its state

• Syncing

o Modified data are copied back to the BFS, automatically or per user request

o Data waiting for sync are pending, and it becomes clean again once synced

Page 25: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksAutomated tiering (2/2)

25

► States of data in IME

• Clean – Data in IME cache has an exact copy on the BFS. It has either been synchronized or prestaged.

• Dirty – New or modified data in IME without an exact copy on the BFS.

• Pending – Dirty data awaiting synchronization to the BFS.

• Deletable – Data that may be freed from IME. Data can be manually marked as deletable regardless of its state.

► Read Coherency

• Clean and Dirty data are read from IME.

• Non-cached data are retrieved from BFS

• /!\ Overwriting directly into the BFS is currently not supported.IME does not track changes on the BFS. If data are clean/dirty in IME. They should not be modified on the BFS.o IME-1.3.x will support “mtime change” detection

so that clean data may be automatically evicted from IME.

Page 26: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

How IME worksIO Management - Data Flow to Servers

26

Page 27: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

IntroductionIME Client Interfaces

27

IME Native API Spec.:https://github.com/DDNStorage/ime_native

MPI Implementations supporting IME MPI-IO upstream:MVAPICH2 2.3.1Open MPI > 4.0.1MPICH 3.3

Page 28: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

28

Return on Investment

Simple server nodes with single-port NVMe drives

Device bandwidth aligned with network bandwidth

Better endurance: reduced write amplification

Page 29: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ROICost-efficiency - Simple Hardware

29

140

10x single-port NVMe Devices

Dual Intel 4108 CPU

8C, 1.8GHz 12x 8GB DIMMS

OS drive(128GB SATA DoM)Two 16x PCIe slots for

HDR100/EDR/OPA/40 or 100 GbE

240

24x single-port NVMe Devices

8x 16GB DIMMS

Two OS drives(1TB SATA)

Dual Intel E5-2680 v4 CPU

14C, 2.4GHz

Two 16x PCIe slots forHDR100/EDR/OPA/40 or 100 GbE

⌨ Live Demo: Listing NVMe devices

► Simple server nodes with single-port NVMe drives► Limited switch footprint: 1 switch HDR port per box (split cable)

Page 30: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ROICost-efficiency - Bandwidth Balanced

30

>600K IOPs | 20GB/s | 2 Rack Units 240

► Device bandwidth aligned with network bandwidth

Random 4k write > 1M IOPs

Random 4k read > 600K IOPs

Page 31: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

IntroductionIME is Flash Native (1/2)

31

Log Structured at the storage device level:

• High performance device throughput (NAND Flash)

• Increases device lifetime due to reduced write amplification

valid sectors with user data

free, unused blocks

Random writes incoming with offsets corresponding to existing data

Log Structured Filesystem:SSD sees writes to new block ranges

IME

Page 32: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ROICost-efficiency - Endurance

32

Low

er is

bet

ter

► Main factors contributing to extend endurance:

• Log-structured (no read-modify-write including parity protected data)

• IOs aggregated in 128KB (aligned) chunks on disk

• Unmap (trim) commands on released blocks

Page 33: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

33

Manageability

User Space: simple installation

Integration with open source tools

Easy to monitor

Automated tiering

Page 34: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityUser Space - Simple Installation (1/4)

34

Server Packages

ime-240-1.00-219.el7.x86_64.rpm

ime-nvme-1.8-213.el7.noarch.rpm ime-mem-dev-1.0-454.el7.noarch.rpm

libfabric-1.4.0.ddn1.0-el7.x86_64.rpm ime-net-libfabric-1.2.2-1573.el7.x86_64.rpm

libisal-2.16.0-el7.x86_64.rpm

ime-3rdparty-1.2.2-1573.el7.x86_64.rpm ime-common-1.2.2-1573.el7.x86_64.rpm ime-server-1.2.2-1573.el7.x86_64.rpm

PlatformClient Packages

libfabric-1.4.0.ddn1.0-el7.x86_64.rpmime-net-libfabric-1.2.2-1573.el7.x86_64.rpm

libisal-2.16.0-el7.x86_64.rpm

ime-ulockmgr-1.2.2-1573.el7.x86_64.rpm

ime-3rdparty-1.2.2-1573.el7.x86_64.rpm

ime-common-1.2.2-1573.el7.x86_64.rpm ime-client-1.2.2-1573.el7.x86_64.rpm mvapich-verbs-2.2.ddn1.4-el7.x86_64.rpm

NVMe U. Space

Network Stack

Erasure Coding

Locking

Libfuse + Hwloc

IME

MPI

⌨ Live Demo: Installation

Page 35: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityUser Space - Simple Installation (2/4)

35

⌨ Live Demo: Configuration

Preparing Backing File System► Register a path on the BFS (don’t need to be at the root):

ime-bfs-prepare -b <path> -n <name>

Preparing all Nodes (clients and servers)► Configuration file /etc/ddn/ime/ime.conf

• Should be identical on all nodes

Preparing the Client Nodes► IME FUSE configuration file: /etc/ddn/ime/ime-fuse.conf

Preparing the Server Nodes► Register the NVMe Devices: ime-nvm-toolbox -AO

ime.conf

# Declare a backing file system. There may be several.ime_bfs mybfs { # logical name bfs_uuid = 271ba1a5-ae21-4b14-818a-28ccd97968e0; mount_point = /bfs/lustre/ime; bfs_type = lustre;}

# Define node profileim_node_profile ime240 { devs = /dev/ime_nvme{0,1,2,3,4,5,...}; cmtl_devs = /dev/ime_nvme23; netdevs = [ib0-verbs,ib1-verbs]; server_port = 13814; client_port = 5813,5814,5815,5816; heartbeat_port = 2222;}

im_pool pool_msst { #peerno:hostname:enabled:profile:ipquad 0:dime240-02:yes:ime240:172.30.3.12,172.30.3.22; 1:dime240-03:yes:ime240:172.30.3.13,172.30.3.23; 2:dime240-04:yes:ime240:172.30.3.14,172.30.3.24; 3:dime240-05:yes:ime240:172.30.3.15,172.30.3.25;}

im_cluster main { pools = [pool_msst]; bfs = mybfs; dhtcopies = 2; def_pgeom = 3+1;}

Page 36: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityUser Space - Simple Installation (3/4)

36

► Dockerized IME client (1 FUSE mount point per container):• Available in future IME release

► Early evaluations on NVIDIA DGX show similar benefits H

ighe

r is

bet

ter

Hig

her

is b

ette

r

Page 37: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityUser Space - Simple Installation (4/4)

37

⌨ Live Demo: Starting IME

Start the IME service on all server nodes► service ime-server start► Default path for log file: /var/log/ime/ime-server.log

Mount IME on client nodes► service ime-fuse start► Default path for log file: /var/log/ime/ime-fuse.log

Page 38: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityIntegration with Open Source Tools

38

POSIXIME Native

⌨ Live Demo: IO tools

FIO (https://github.com/axboe/fio)► Write in the FUSE mount point with standard FIO engines► IME engines (sync and async) relaying on IME Native interface

IOR (https://github.com/hpc/ior)► POSIX interface and MPIIO interface (if not compiled with MPIIO supporting IME)► MPIIO interface (compiled with IME support)► IME interface

YAPIO Yet Another Parallel I/O Testing Tool (https://github.com/00pauln00/yapio)► POSIX interface► IME interface

MDtest (https://github.com/hpc/ior)► POSIX interface► IME Interface

POSIX

IME Native

IME Native

POSIXIME MPIIO

IME Native

POSIX

Page 39: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityEasy to Monitor (1/7)

39

► Monitoring tool in terminal

► Metrics can be retrieved in JSON format (and use you own tools!)

► Preconfigured Grafana solution

► Turn-key solution: DDN Insight

Page 40: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityEasy to Monitor (2/7)

40

► Monitoring tool in terminal (ime-monitor --nvm-stats)

⌨ Live Demo: Monitoring IME Metrics

NVM Metrics

- Internal ID to identify a unique drive in IME

- Name or path used to open the device

- Read throughput (IO/s), and bandwidth (MB/s)

- Write throughput (IO/s), and bandwidth (MB/s)

- Read throughput (IO/s)- Free space (%)- Device state

Page 41: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityEasy to Monitor (2/7)

41

► Monitoring tool in terminal (ime-monitor --rpc)

⌨ Live Demo: Monitoring IME Metrics

Page 42: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityEasy to Monitor (3/7)

42

► Monitoring tool in terminal (ime-monitor --frgbs)

⌨ Live Demo: Monitoring IME Metrics

Data Migration Statistics

- BFS Sync Subsystem: Auto sync and manual sync subsystem status.- Auto Flush: Auto sync subsystem status.- Data Cleanup: Auto cleanup subsystem status.- Percentage Free: Percentage of free space on the server.

Page 43: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityEasy to Monitor (4/7)

43

► Monitoring tool in terminal (ime-monitor --bfs-stats)

⌨ Live Demo: Monitoring IME Metrics

Page 44: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityEasy to Monitor (5/7)

44

► Metrics can be retrieved in JSON format

► Preconfigured Telegraf - InfluxDB - Grafana solution

Telegraf Agent

IME Server Node

Grafana

IME Monitoring Node Real time monitoringGrafana(Apache 2.0 License)

InfluxDB(MIT License)

Telegraf(MIT License)

InfluxDB

Telegraf Agent

IME Server Node

Telegraf Agent

IME Server Node

Page 45: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityEasy to Monitor (6/7)

45

⌨ Live Demo: Monitoring IME Metrics

Aggregated dashboard

Per server dashboard

Page 46: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityEasy to Monitor (7/7)

46

► Turn-key solution (next release): IME Metrics integrated in DDN Insight:

○ Aggregated views for Clients, Servers, Devices

○ Performance and status data collection

○ Event monitoring and alerts

○ Live and historical data analysis

Page 47: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ManageabilityAutomated tiering

47

Initial values in IME conf file► flush_threshold_ratio = when to start syncing (default if not specified: 40%)► min_free_space_ratio = when to start releasing (default if not specified: 15%)

Can be changed on the fly from a client node (requires root credentials)► ime-sys-ctl -u <flush_threshold_ratio>► ime-sys-ctl -n <min_free_space_ratio>

Can be changed toggled on and off on file (=pinning)► ime-ctl -m <file> = auto sync on► ime-ctl -M <file> = auto sync off

Page 48: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

48

Performance

Close to peak BW even with single shared file

Low random read latency

Dynamic load balancing

Declustered rebuild

Page 49: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceClose to peak BW - Even with single shared file (1/5)

49

FILESYSTEM

Shared File

Shared File

► Parallel File systems can exhibit extremely poor performance for shared file IO due to internal lock management as a result of managing files in large lock units

► IME eliminates contention by managing IO fragments directly, and coalescing IO's prior to flushing to the parallel file system

Performance barrier

file

file

⌨ Live Demo: Single-shared-file “Hard” I/O

Page 50: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceClose to peak BW - Even with single shared file (2/5)

50

MEMORY 8x 16GB DDR4-2400 RAM

CPUS 2x Intel E5 Series Processors 2x PCIe 3.0 ports (16x) for drives 2x PCIe 3.0 ports (16x) for fabrics

FABRIC ADAPTERS

2x InfiniBand EDR/FDRor 2x 10/40/100GbE Ethernetor 2x Intel Omni-Path

DATA DRIVES 23x Intel NVMes P3520 1.2TB

HARDLIMITS 25 GB/s

105 GB/s

30 GB/s

25 GB/s

38.2 GB/s

25 GB/s

105 GB/s

30 GB/s

25 GB/s

29.6GB/s

23 µs

* 180µs with QD 64

~ 20 µs*

~ 3 µs

18 µs

* 210µs with QD 64

~ 15 µs*

~ 3 µs

LATENCY (4K IO)

AGGREGATED B/W (1MB IO)

READ WRITE READ WRITE

~ 0.1 µs ~ 0.1 µs

240

Page 51: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceClose to peak BW - Even with single shared file (3/5)

51

Hig

her

is b

ette

r

Hig

her

is b

ette

r

(single IME240 server saturation)

Page 52: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceClose to peak BW - Even with single shared file (4/5)52

I/O Payload (Bytes)

Ban

dw

idth

(G

B/s

)

23x P3520 NVMEs 1.2 TB

Infiniband (2x EDR)

PCIe 3.0 32x lanes

Write Sequential

Latencydriven 4K

I/O roofline model - Single Shared File - IME server Bandwidths (native interface)

Read Sequential

Read Random

Write Random Read Sequential Read Random Write Random

Write Sequential

Read Random

Write Sequential

Write Random

Read Sequential

Hig

her

is b

ette

r

Page 53: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceClose to peak BW - Even with single shared file (5/5)

53

► Extracting results from IO500 where the client count is 100 nodes or more

► Filesystem options show huge degradation when the IO patterns is tough.

► Only IME is able to present Flash to the applications efficiently

IO500 november 2017

Page 54: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Low random read latency

I/O critical path break down on a random read 4K IO (1/6)

Client node processing

Communication layer- Protocol- Hardware transport layer

Server(s) processing

Disk access (no software cache)

54

IME server

IME AppliancesCompute Nodes

IMEclient

IME serverIME

client

... ...

Switch

a

b

c d

a

b

c

d

Page 55: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Client side- Erasure coding management (slide 22)- IO Aggregation (write)- Prefetcher (sequential read)- FUSE vs Native overhead

- 4K random (SSF) read +10µs

55

IME server

IME AppliancesCompute Nodes

IMEclient

IME serverIME

client

... ...

Switch

10 µs

Low random read latency

I/O critical path break down on a random read 4K IO (2/6)

Page 56: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Low random read latency

I/O critical path break down on a random read 4K IO (3/6)

Network layer- Network Protocol (Verbs over Infiniband, TCP over ethernet,...) - Network topology (hops?)- IRQs (+context switches)- NUMA effects- Network software stack (OFED drivers + CCI* + IME network layer)

IME server

IME AppliancesCompute Nodes

IMEclient

IME serverIME

client

... ...

Switch

19 µs

* Common Communication Interface (cci-forum.com)

- Infiniband FDR

56

Page 57: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Low random read latency

I/O critical path break down on a random read 4K IO (4/6)

Server processing- Fetch data on Backing File System- IME internal read application- IME tasks engine

IME server

IME AppliancesCompute Nodes

IMEclient

IME serverIME

client

... ...

Switch

25 µs- No data fetched on the BFS

57

Page 58: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Low random read latency

I/O critical path break down on a random read 4K IO (5/6)

Drive accesses- Queue depth

IME server

IME AppliancesCompute Nodes

IMEclient

IME serverIME

client

... ...

Switch

114 µs - Intel NVMe P3520 1.2 TB- Queue depth between 1 and 10- Disk has free blocks

58

Page 59: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

59

168 µs

Low random read latency

I/O critical path break down on a random read 4K IO (6/6)

Page 60: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceDynamic Load Balancers - Data Placement on NVMe devices60

IME IO scheduler in a Server

► Performance oriented○ Slow drives are less solicited

► NUMA aware○ Network Buffer leads to higher affinity

to NVMes in the same NUMA node

Page 61: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceDynamic Load Balancers - Data Placement on Server Nodes61

IME Non-Deterministic (ND) Data Placement ► Enabled via an environment variable on the client

IM_CLIENT_DATA_PLACEMENT_TYPE=NONDETERMINISTIC

► IME performance aware load Balancing ensures all available performance is utilized:

○ ½ Server Lost: ½ Server Performance Lost○ 1 Server Lost: 1 Server Performance Lost

Deterministic Data Placement ► In Traditional Systems all servers run at the rate of the

Slowest Server:○ ½ Server Lost: ½ All Performance Lost○ 1 Server Lost: All Performance Lost

Page 62: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceDeclustered rebuild62

When Disk Failure Occurs:

► Reconstruction triggered automatically on other drives of the same server► Uses parity blocks located in the other server nodes

Reconstruction is fast► Only need to reconstruct data which resided in the failed device

○ If 10% of used capacity, only those 10% are rebuilt

► Performance limitation is the mainly due to hardware on which the drive is reconstructed○ Aggregated BW of network adapters○ Aggregated BW of NVMe devices left

Page 63: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

PerformanceWe keep improving performances63

Page 64: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Use cases (1/3)

Simulating an “AI workload”64

Simulating an AI “workload” (random 67KB reads from 576 processes)

► Big file on BFS

► Reading from BFS

► Prestage

► Read from IME (uncached)

► Read from IME (cached),

► Write output

► Sync & release

⌨ Live Demo: Use case 1

Page 65: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Use cases (2/3)

Accelerating your applications (MPIIO)65

PnetCDF I/O Benchmark using S3D Application I/O Kernelhttps://github.com/wkliao/S3D-IO

A data checkpoint is performed at regular time intervals, and its data consist of three- and four-dimensional array variables of type double.

► Run with PFS

► Effect of recompiling the app with MPI supporting IME

► Run with IME

⌨ Live Demo: Use case 2

Page 66: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Use cases (3/3)

Accelerating NFS66

NFS

IME

COMPU

TE

NFS

► Brings scale-out Flash native performance to NFS access

► Shield NFS server from ”tough" IO

► Increase IO throughput from NFS hardware

► Zero application changes - replace NFS mount by IME mount

Page 67: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

67

Fault Tolerance

Page 68: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Fault Tolerance (1/3)68

IME already supports► Data drive failures (erasure coding)► Node failure (node ejection relaying on Raft consensus algorithm)

IME 1.3 (upcoming release)► Disk Hot Swap► Node Reinsertion

Page 69: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

69 Fault Tolerance (2/3)

Demo with the following scenario

► Data 4x IME240 nodes (parity 2+1, DHT copies 3)

1) Server 3 is ejected, 1TB of data needs to be rebuilt

2) 3 NVMe drives are ejected on server 4. 800GB of data is rebuilt on server 4

Page 70: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

70

Parity 2+1DHTcopies 3

Page 71: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Takeaways71

Hardware/ROISimple server nodes with single-port NVMe drives

Device bandwidth aligned with network bandwidthBetter endurance: reduced write amplification

ManageabilityUser Space: simple installation

Integration with open source toolsEasy to monitor

Automated tieringFlexible erasure coding scheme

Node Fault Tolerant

PerformanceClose to peak BW even with single shared file

Low random read latencyDynamic Load Balancers

Declustered rebuild

Page 72: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Q&A72

Feature suggestions

Multi BFSEncryption

Data compressionMulti network

...

Page 73: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

ThankYou!

Keep in touch with us

[email protected]

@ddn_limitless

company/datadirect-networks

9351 Deering AvenueChatsworth, CA 91311

1.800.837.22981.818.700.4000

Page 74: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Backup SlidesPOSIX Compliance (1/2)

74

► Introducing PJD test POSIX Compliance suite

► 8789 tests in 17 categories

► Open source: https://github.com/pjd/pjdfstest.git

Page 75: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Backup SlidesPOSIX Compliance (2/2)

75

Page 76: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.

Backup SlidesIME240 Block diagram

76

Page 77: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

DDN Storage | ©2018 DataDirect Networks, Inc.

IME1.2 FAULT TOLERANCE

► 4xIME240 with parity=2+1 dhtcopy=3

► Device/Server failures are transparent for the application

► Automatic data rebuild with no service interruption

► Native De-Dlustered Distributed Erasure Coding ensures fast rebuild

Page 78: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

DDN Storage | ©2018 DataDirect Networks, Inc.

IME1.2 FAULT TOLERANCE

2

3

I/O write intensive job startup

Server 3 fails with 1TB data

1

Data Rebuild Zone

Normal Service

Resumed

4

~3 mins

Continued Production

Page 79: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

DDN Storage | ©2018 DataDirect Networks, Inc.

IME1.2 FAULT TOLERANCE

Continued Production

► Even after single node failure, the rebuilt data are still protected against failure• 3 failing devices on surviving

servers

• 2nd node failing

Page 80: IME Storage System · How IME works IME Metadata (1/3) 13 Purpose of IME Metadata Describe each data fragment Know the state of the data fragments Know on which data block, on which

DDN Storage | ©2018 DataDirect Networks, Inc.