SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation...

33
SDS: A Framework for Scientific Data Services Bin Dong, Suren Byna*, John Wu Scientific Data Management Group Lawrence Berkeley National Laboratory

Transcript of SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation...

Page 1: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

SDS: A Framework for Scientific Data Services

Bin Dong, Suren Byna*, John Wu Scientific Data Management Group Lawrence Berkeley National Laboratory

Page 2: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Finding news articles of your interest in the old era of print newspaper •  Turn pages •  Read headlines •  Searching for specific news was

time consuming

§  Online newspapers •  Search box •  Limited to one newspaper

§  Customized News •  Articles are organized based on

reader’s interest •  Search numerous online

newspapers

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 2

Finding Newspaper Articles of Interest

Page 3: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Finding news articles of your interest in the old era of print newspaper •  Turn pages •  Read headlines •  Searching for specific news was

time consuming

§  Online newspapers •  Search box •  Limited to one newspaper

§  Customized News •  Articles are organized based on

reader’s interest •  Search numerous online

newspapers

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 3

Finding Newspaper Articles of Interest

Page 4: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Finding news articles of your interest in the old era of print newspaper •  Turn pages •  Read headlines •  Searching for specific news was

time consuming

§  Online newspapers •  Search box •  Limited to one newspaper

§  Customized News •  Articles are organized based on

reader’s interest •  Search numerous online

newspapers

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 4

Finding Newspaper Articles of Interest

Page 5: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Finding news articles of your interest in the old era of print newspaper •  Turn pages •  Read headlines •  Searching for specific news was

time consuming

§  Online newspapers •  Search box •  Limited to one newspaper

§  Customized News •  Articles are organized based on

reader’s interest •  Search numerous online

newspapers

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 5

Finding Newspaper Articles of Interest

àBetter organization, faster access to news

Page 6: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Modern scientific discoveries are driven by massive data •  Scientific simulations and

experiments in many domains produce data in the range of terabytes and petabytes

§  Fast data analysis is an important requirement for discovery •  Data is typically stored as files

on disks that are managed by file systems

•  High data read performance from storage is critical

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 6

Scientific Discovery needs Fast Data Analysis

Image Credits http://www.science.tamu.edu/articles/911 http://www.lsst.org/lsst/gallery/data http://nsidc.org/icelights/category/research-2/

LHC: Higgs Boson search LSST image simulator

NCAR’s CESM data

Light source experiments at LCLS, ALS, SNS, etc.

Page 7: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

Shared storage

Shared storage

Exascale Simulation Machine + analysis

Archive

Parallel Storage

Simulation Site

Analysis Machine

Analysis Machine Analysis Machines

Experiment/observation Site

Analysis Sites

Experiment/observation Processing Machine

Archive

(Parallel) Storage

Shared storage

Need to reduce EBs and PBs of data, and move only TBs from simulation sites

Perform some data analysis on exascale machine (e.g. in situ pattern identification)

Reduce and prepare data for further exploratory Analysis (e.g., Data mining)

A Generic Data Analysis Use Case

11/18/2013 7

Page 8: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Data stored on file systems is immutable •  After data producers (simulations and

experiments) store data, the file format/organization is unchanged over time

•  Data stored in files, which is treated by the system as a sequence of bytes

•  User code (and libraries) has to impose meaning of the file layout and conduct analyses

•  Burden of optimizing read performance falls on application developer

•  However, optimizations are complex due to several layers of parallel I/O stack

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 8

Current Practice: Immutable Data in File Systems

Applications

High-level I/O libraries and formats

High-level I/O libraries and formats

I/O Optimization Layers

Parallel File System

Storage Hardware

Parallel I/O Software Stack

Page 9: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Separation of logical and physical data organization through reorganization

§  Example •  Access all variables where 1.15

< Energy < 1.4 •  Full scan of data or random

accesses to non-contiguous data locations results in poor performance

•  Sorting the data and accessing contiguous chunks of data leads to good performance

§  Several studies showing reorganization can help •  Listed some in the related work

section

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 9

Changing layout of data on file systems is helpful

Energy X Y Z … 1.1692 165.834 -28.0448 -2.76863 …

2.5364 166.249 -27.9321 -2.87165 … 1.4364 166.538 -27.9735 -3.16969 … 1.0862 166.663 -27.9769 -2.79716 … 1.2862

166.621 -27.9046 -2.78382 …

1.5862

167.001 -27.9366

-3.09605

1.3173 167.222

-27.9551 -3.22406 …

Page 10: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Separation of logical and physical data organization through reorganization

§  Example •  Access all variables where 1.15

< Energy < 1.4 •  Full scan of data or random

accesses to non-contiguous data locations results in poor performance

•  Sorting the data and accessing contiguous chunks of data leads to good performance

§  Several studies showing reorganization can help •  Listed some in the related work

section

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 10

Changing layout of data on file systems is helpful

Energy X Y Z … 1.1692 165.834 -28.0448 -2.76863 …

2.5364 166.249 -27.9321 -2.87165 … 1.4364 166.538 -27.9735 -3.16969 … 1.0862 166.663 -27.9769 -2.79716 … 1.2862

166.621 -27.9046 -2.78382 …

1.5862

167.001 -27.9366

-3.09605 …

1.3173 167.222

-27.9551 -3.22406 …

Page 11: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Separation of logical and physical data organization through reorganization

§  Example •  Access all variables where 1.15

< Energy < 1.4 •  Full scan of data or random

accesses to non-contiguous data locations results in poor performance

•  Sorting the data and accessing contiguous chunks of data leads to good performance

§  Several studies showing reorganization can help •  Listed some in the related work

section

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 11

Changing layout of data on file systems is helpful

Energy X Y Z …

1.0862 166.663 -27.9769 -2.79716 …

1.1692 165.834 -28.0448 -2.76863 …

1.2862 166.621 -27.9046 -2.78382 … 1.3173 167.222 -27.9551 -3.22406 … 1.5862 167.001 -27.9366 -3.09605 … 1.4364 166.538 -27.9735 -3.16969 … 2.5364 166.249 -27.9321 -2.87165 …

Page 12: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Database management systems perform various optimizations without user involvement

§  Many efforts from database community reach out to manage scientific data

•  Objectivity/DB, Array QL, SciQL, SciDB, etc.

§  SciDB •  A database system for scientific applications •  Key features:

–  Array-oriented data model –  Append-only storage –  First-class support for user-defined functions –  Massively parallel computations

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 12

Database Systems for Scientific Data Management

Page 13: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Database systems use various optimization tricks and perform them automatically

•  Cache frequently accessed data •  Use indexes •  Store materialized views •  …

§  However, preparing and loading of scientific data to fit into database schemas is cumbersome

•  Scientific data management requirements are different from DBMS •  Established data formats (HDF5, NetCDF, ROOT, FITS, ADIOS-BP, etc.) •  Arrays are first class citizens •  Highly concurrent applications

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 13

More DB Optimizations, but …

Page 14: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Preserving scientific data formats and adding database optimizations as services

§  A framework for bringing the merits of database technologies to scientific data

§  Examples of services •  Indexing •  Sorting •  Reorganization to improve data

locality •  Compression •  Remote and selective data

movement •  …

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 14

Our Solution: Scientific Data Services (SDS)

Page 15: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Finding an optimal data organization strategy •  Capture common read patterns •  Estimate cost with various data organization strategies •  Rank data organizations for each pattern and select the best for a pattern

§  Performing data reorganization •  Similar to database management systems, perform data reorganization

transparently •  Resolving permissions to the original data •  Preserve the original permissions when data is reorganized

§  Using an optimally reorganized dataset transparently •  Based on a read pattern, recognize the best available organization of data •  Redirect to read the selected organization •  Perform any needed post processing, such as decompression, transposition

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 15

First step: Automatic Data Reorganization

Page 16: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Initial implementation of the SDS framework •  Client-server approach •  SDS Client library for each MPI process •  SDS Server to find the best dataset from available reorganized datasets •  Supporting HDF5 Read API •  Added SDS Query API for answering range queries

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 16

The Design of the SDS Framework

Network(

MPI(Process(0(

Applica4on(

HDF5(API(

SDS(Client(

Parallel&File&System&&SDS(Server(

Database( Batch(System(

MPI(Process(N(

Applica4on(

HDF5(API(

SDS(Client(

SDS(Query(API( SDS(Query(API(

Page 17: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 17

SDS Client Implementation

Parser&

Reader&

Post&Processor&

SDS&Query&Interface&

File&Handle&and&Query&String&&

Original&File&Metadata&or&&Reorganized&File&Metadata&

Data&

Requested&Data&

Server&Connector&

SDS&&Server&

ApplicaAon&

Parallel&File&System&

SDS&&Client&HDF5&API& File&Handle&and&Read&Parameters&

File&Handle&and&Final&Query&String&

HDF5 API

•  The HDF5 Virtual Object Layer (VOL) feature allows capturing HDF5 calls

•  Developed a VOL plugin for SDS for capturing file open, read, and close functions

Page 18: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 18

SDS Client Implementation

Parser&

Reader&

Post&Processor&

SDS&Query&Interface&

File&Handle&and&Query&String&&

Original&File&Metadata&or&&Reorganized&File&Metadata&

Data&

Requested&Data&

Server&Connector&

SDS&&Server&

ApplicaAon&

Parallel&File&System&

SDS&&Client&HDF5&API& File&Handle&and&Read&Parameters&

File&Handle&and&Final&Query&String&

SDS Query Interface •  An interface to perform

SQL-style queries on arrays

•  Function-based API that can be used from C/C++ applications

Page 19: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 19

SDS Client Implementation

Parser&

Reader&

Post&Processor&

SDS&Query&Interface&

File&Handle&and&Query&String&&

Original&File&Metadata&or&&Reorganized&File&Metadata&

Data&

Requested&Data&

Server&Connector&

SDS&&Server&

ApplicaAon&

Parallel&File&System&

SDS&&Client&HDF5&API& File&Handle&and&Read&Parameters&

File&Handle&and&Final&Query&String&

Parser •  Checks the conditions in a

query •  Verifies the validity of files

Page 20: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 20

SDS Client Implementation

Parser&

Reader&

Post&Processor&

SDS&Query&Interface&

File&Handle&and&Query&String&&

Original&File&Metadata&or&&Reorganized&File&Metadata&

Data&

Requested&Data&

Server&Connector&

SDS&&Server&

ApplicaAon&

Parallel&File&System&

SDS&&Client&HDF5&API& File&Handle&and&Read&Parameters&

File&Handle&and&Final&Query&String&

Server Connector •  Packages a query or

HDF5 read call information and sends to the SDS server

•  Using protocol buffers for communication

•  MPI Rank 0 communicates with the server and then informs the remaining MPI processes

Page 21: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 21

SDS Client Implementation

Parser&

Reader&

Post&Processor&

SDS&Query&Interface&

File&Handle&and&Query&String&&

Original&File&Metadata&or&&Reorganized&File&Metadata&

Data&

Requested&Data&

Server&Connector&

SDS&&Server&

ApplicaAon&

Parallel&File&System&

SDS&&Client&HDF5&API& File&Handle&and&Read&Parameters&

File&Handle&and&Final&Query&String&

Reader •  Reads data from the

dataset location returned by the SDS Server

•  Makes use of the native HDF5 read calls

Page 22: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 22

SDS Client Implementation

Parser&

Reader&

Post&Processor&

SDS&Query&Interface&

File&Handle&and&Query&String&&

Original&File&Metadata&or&&Reorganized&File&Metadata&

Data&

Requested&Data&

Server&Connector&

SDS&&Server&

ApplicaAon&

Parallel&File&System&

SDS&&Client&HDF5&API& File&Handle&and&Read&Parameters&

File&Handle&and&Final&Query&String&

Post Processor •  Performs any post-

processing needed before returning the data to the application memory

•  Eg. Decompression, transposition, etc.

Page 23: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 23

SDS Server Implementation

SDS#Client#

Request#Dispatcher#

Data#Organiza6on#Recommender#

Data#Reorganizer#

SDS#Metadata#Manager#

Query#Evaluator#

Reorganiza6on#Evaluator#

Client#Request#

Query#Request#

Query#Response#

Organiza6on#List#

Read#Sta6s6cs#

SDS#Metadata#

SDS#Metadata#

Frequently#Read#File#Handle#and#its#SDS#Metadata#

ACribute#of##Reorganized#file#

File#Handle##and##Reorganiza6on#Type#

Reorganiza6on##Job##Running#

Parallel#File#System#

Original#File##

Reorganized#File#

Job#Script#

Job#Results#

SDS#Metadata#

Organiza6on#List#

File#Data#

SDS#Server#

Reorganiza6on##Request#

Periodic#SelfJstart#Request#

SDS#Admin#Interface#

User#Reorganiza6on#Request#

Page 24: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 24

SDS Server Implementation

SDS#Client#

Request#Dispatcher#

Data#Organiza6on#Recommender#

Data#Reorganizer#

SDS#Metadata#Manager#

Query#Evaluator#

Reorganiza6on#Evaluator#

Client#Request#

Query#Request#

Query#Response#

Organiza6on#List#

Read#Sta6s6cs#

SDS#Metadata#

SDS#Metadata#

Frequently#Read#File#Handle#and#its#SDS#Metadata#

ACribute#of##Reorganized#file#

File#Handle##and##Reorganiza6on#Type#

Reorganiza6on##Job##Running#

Parallel#File#System#

Original#File##

Reorganized#File#Job#Script#

Job#Results#

SDS#Metadata#

Organiza6on#List#

File#Data#

SDS#Server#

Reorganiza6on##Request#

Periodic#SelfJstart#Request#

SDS#Admin#Interface#

User#Reorganiza6on#Request#

Request Dispatcher

•  Receives SDS client requests and SDS Admin interface

•  SDS Admin interface issues reorganization commands

•  Based on the request, dispatcher passes on the request to Query Evaluator and Reorganization Evaluator

Page 25: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 25

SDS Server Implementation

SDS#Client#

Request#Dispatcher#

Data#Organiza6on#Recommender#

Data#Reorganizer#

SDS#Metadata#Manager#

Query#Evaluator#

Reorganiza6on#Evaluator#

Client#Request#

Query#Request#

Query#Response#

Organiza6on#List#

Read#Sta6s6cs#

SDS#Metadata#

SDS#Metadata#

Frequently#Read#File#Handle#and#its#SDS#Metadata#

ACribute#of##Reorganized#file#

File#Handle##and##Reorganiza6on#Type#

Reorganiza6on##Job##Running#

Parallel#File#System#

Original#File##

Reorganized#File#Job#Script#

Job#Results#

SDS#Metadata#

Organiza6on#List#

File#Data#

SDS#Server#

Reorganiza6on##Request#

Periodic#SelfJstart#Request#

SDS#Admin#Interface#

User#Reorganiza6on#Request#

Query Evaluator

•  Looks up SDS Metadata for finding available reorganized datasets and their locations for a given dataset

•  SDS Metadata •  File name, HDF5

dataset info, permissions

Page 26: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 26

SDS Server Implementation

SDS#Client#

Request#Dispatcher#

Data#Organiza6on#Recommender#

Data#Reorganizer#

SDS#Metadata#Manager#

Query#Evaluator#

Reorganiza6on#Evaluator#

Client#Request#

Query#Request#

Query#Response#

Organiza6on#List#

Read#Sta6s6cs#

SDS#Metadata#

SDS#Metadata#

Frequently#Read#File#Handle#and#its#SDS#Metadata#

ACribute#of##Reorganized#file#

File#Handle##and##Reorganiza6on#Type#

Reorganiza6on##Job##Running#

Parallel#File#System#

Original#File##

Reorganized#File#Job#Script#

Job#Results#

SDS#Metadata#

Organiza6on#List#

File#Data#

SDS#Server#

Reorganiza6on##Request#

Periodic#SelfJstart#Request#

SDS#Admin#Interface#

User#Reorganiza6on#Request#

Reorganization Evaluator

•  Decides whether to reorganize based on the frequency of read accesses

•  Takes commands from the Admin interface

•  Instructs Data Reorganizer to create a reorganization job script

Page 27: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 27

SDS Server Implementation

SDS#Client#

Request#Dispatcher#

Data#Organiza6on#Recommender#

Data#Reorganizer#

SDS#Metadata#Manager#

Query#Evaluator#

Reorganiza6on#Evaluator#

Client#Request#

Query#Request#

Query#Response#

Organiza6on#List#

Read#Sta6s6cs#

SDS#Metadata#

SDS#Metadata#

Frequently#Read#File#Handle#and#its#SDS#Metadata#

ACribute#of##Reorganized#file#

File#Handle##and##Reorganiza6on#Type#

Reorganiza6on##Job##Running#

Parallel#File#System#

Original#File##

Reorganized#File#Job#Script#

Job#Results#

SDS#Metadata#

Organiza6on#List#

File#Data#

SDS#Server#

Reorganiza6on##Request#

Periodic#SelfJstart#Request#

SDS#Admin#Interface#

User#Reorganiza6on#Request#

Data Organization Recommender

•  Identifies optimal data

reorganization •  Informs the

Reorganization Evaluator with the selected strategy

Page 28: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 28

SDS Server Implementation

SDS#Client#

Request#Dispatcher#

Data#Organiza6on#Recommender#

Data#Reorganizer#

SDS#Metadata#Manager#

Query#Evaluator#

Reorganiza6on#Evaluator#

Client#Request#

Query#Request#

Query#Response#

Organiza6on#List#

Read#Sta6s6cs#

SDS#Metadata#

SDS#Metadata#

Frequently#Read#File#Handle#and#its#SDS#Metadata#

ACribute#of##Reorganized#file#

File#Handle##and##Reorganiza6on#Type#

Reorganiza6on##Job##Running#

Parallel#File#System#

Original#File##

Reorganized#File#Job#Script#

Job#Results#

SDS#Metadata#

Organiza6on#List#

File#Data#

SDS#Server#

Reorganiza6on##Request#

Periodic#SelfJstart#Request#

SDS#Admin#Interface#

User#Reorganiza6on#Request#Data Reorganizer

•  Locates reorganization

code, such as sorting, indexing algorithms

•  Decides on the number of cores to use

•  Prepares a batch job script

•  Monitors the job execution

•  After reorganization, stores the new data location in the SDS Metadata Manager

Page 29: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  NERSC Hopper supercomputer •  6,384 compute nodes •  2 twelve-core AMD 'MagnyCours' 2.1-GHz processors per node •  32 GB DDR3 1333-MHz memory per node •  Employs the Gemini interconnect with a 3D torus topology •  Lustre parallel file system with 156 OSTs at a peak BW of 35 GB/s

§  Software •  Cray’s MPI library •  HDF5 Virtual Object Layer code base

§  Data •  Particle data from a plasma physics simulation (VPIC), simulating

magnetic reconnection phenomenon in space weather

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 29

Performance Evaluation: Setup

Page 30: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  SDS Metadata Manager uses Berkeley DB for storing metadata of reorganized data

§  Measured response time of multiple concurrent SDS Clients requesting metadata from one SDS Server

§  Low overhead of < 0.5 seconds with 240 concurrent clients

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 30

Performance Evaluation: Overhead of SDS

0"

0.2"

0.4"

0.6"

0.8"

1"

40" 80" 120" 160" 200" 240" 280" 320"

Time"(sec)"

Number"of"Concurent"Clients"

HDF5%Open+Close% SDS%Metadata%Read%

Page 31: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  Ran various queries on particle energy conditions

§  Performance benefit over full scan varies based on selectivity of data

§  20X to 50X speedup when data selectivity less than 5%

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 31

Performance Evaluation: Benefits of SDS

1.2X%

2.3X%

8.1X% 18.9X% 25.7X% 36.3X% 42.0X% 44.8X% 51.2X%0.00%20.00%40.00%60.00%80.00%100.00%120.00%140.00%

E>1.10% E>1.15% E>1.20% E>1.25% E>1.30% E>1.35% E>1.40% E>1.45% E>1.50%

Tim

e%(sec)%

Query%

Full%Data%Scan% Read%Reorganized%Data%

Selectivity: 79% 32% 11% 4.6% 2.2% 1.2% 0.75% 0.5% 0.33%

Page 32: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

§  File systems on current supercomputers are analogous to print newspapers

§  The SDS framework is vehicle for applying various optimizations •  Live customization of data organization based on data usage (in progress) •  To work with existing scientific data formats

§  Status: Platform for performing various optimizations is ready

§  Low overhead and high performance benefits §  Ongoing and future work:

•  Dynamic recognition of read patterns •  Model-driven data organization optimizations •  In-memory query execution and query plan optimization •  FastBit indexing to support querying •  Support for more data formats

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 32

Conclusion and Future Work

Page 33: SDS: A Framework for Scientific Data Services · PBs of data, and move only TBs from simulation sites Perform some data analysis on exascale machine (e.g. in situ pattern identification)

11/18/2013 Computational Research Division | Lawrence Berkeley National Laboratory | Department of Energy 33

Thanks!

Contact: Suren Byna [[email protected]]

Thanks to: