Introduction to Scientific Data Management

damien.francois@uclouvain.beOctober 2015

http://www.cism.ucl.ac.be/training

Goal of this session:

“Share tools, tips and tricks related to the storage, transfer, and sharing

of scientific data”

Data storageBlock – File – Object

Databases

Data storage

Medium (Flash, Hard Drives, Tapes, DRAM)

RAID JBODErasure coding

software RAID

Local filesystem Block storage

Attachment (IDE, SAS, SATA, iSCSI, ATAoE, FC)

RDBMSObj store

Global filesystemNoSQL

Schema Serialization (file formats, etc)

Storage Medium Technologies

http://www.slideshare.net/IMEXresearch/ss-ds-ready-for-enterprise-cloud

Storage performances

http://www.slideshare.net/IMEXresearch/ss-ds-ready-for-enterprise-cloud

Storage safety (RAID)

https://en.wikipedia.org/wiki/Standard_RAID_levels

Data storage

Medium (Flash, Hard Drives, Tapes, DRAM)

RAID JBODErasure coding

software RAID

Local filesystem Block storage

Attachment (IDE, SAS, SATA, iSCSI, ATAoE, FC)

RDBMSObj store

Global filesystemNoSQL

Schema Serialization (file formats, etc)

Storage abstraction levels

(local) Filesystems

http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/

Generation 0: No system at all. There was just an arbitrary stream of data. Think punchcards, data on audiocassette, Atari 2600 ROM carts.

Generation 1: Early random access. Here, there are multiple named files on one device with no folders or other metadata. Think Apple ][ DOS (but not ProDOS!) as one example.

Generation 2: Early organization (aka folders). When devices became capable of holding hundreds of files, better organization became necessary. We're referring to TRS-DOS, Apple //c ProDOS, MS-DOS FAT/FAT32, etc.

Generation 3: Metadata—ownership, permissions, etc. As the user count on machines grew higher, the ability to restrict and control access became necessary. This includes AT&T UNIX, Netware, early NTFS, etc.

Generation 4: Journaling! This is the killer feature defining all current, modern filesystems—ext4, modern NTFS, UFS2, you name it. Journaling keeps the filesystem from becoming inconsistent in the event of a crash, making it much less likely that you'll lose data, or even an entire disk, when the power goes off or the kernel crashes.

Generation 5: Copy on Write snapshots, Per-block checksumming, Volume management, Far-future scalability, Asynchronous incremental replication, Online compression. Generation 5 filesystems are Btrfs and ZFS.

Network filesystem

NAS: ex. NFS SAN: ex. GFS

One source many consumers

Pictures from https://www.redhat.com/magazine/008jun05/features/gfs_nfs/

Parallel filesystem

ex: Lustre, GPFS, BeeGeeFS GlusterFSMany sources many consumers

Pictures from https://www.redhat.com/magazine/008jun05/features/gfs_nfs/

Special filesystems – in memory

Filesystems

What filesystem for what usage

● Home (NFS) : Small size, Small I/Os

● Global scratch (parallel FS) : Large size, Large I/Os

● Local scratch (local FS): Medium size, Large I/Os

● In-memory (tmpfs): Small Size, Large I/Os

● Mass storage (NFS); Large size, Small I/Os

Text File Formats – JSON, YML, XML

Text File Formats – CSV,TSV

Binary File Formats – CDF, HDF

http://pro.arcgis.com/en/pro-app/help/data/multidimensional/fundamentals-of-netcdf-data-storage.htm

Binary File Formats – CDF, HDF

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/

What file format for what usage

● Meta data

– Configuration file: INI, YAML

– Result with context information: JSON● Data

– Small data (kBs): CSV, TSV

– Medium data (MBs): compressed CSV

– Large data (GBs): netCDF, HDF5, DXMF

– Huge data (TBs): Database, Object store (“loss of innocence”)

Use dedicated libraries to write and read them

Object storage

● Object: data (e.g. file) + meta data

● Often built on erasure coding

● Scale out easily

● Useful for web applications

● Examples:

– Openstack Swift

– Ceph

Pictures from http://www.tomsitpro.com/articles/rdbms-sql-cassandra-dba-developer,2-547-2.html

Pictures from http://www.ibm.com/developerworks/library/x-matters8/

SQL: SELECT AuthorName, Title FROM Books JOIN Authors ON Authors.AuthorID=Books.AuthorID

Pictures from http://www.ibm.com/developerworks/library/x-matters8/

● Mostly needed for categorical data and alphanumerical data (not suited for matrices, but good for end-results)

● Indexes make finding a data element is very fast(and computing sums, maxima, etc.)

● Encodes relations between data (constraints, etc)

● Atomicity, Consistency, Isolation, and Durability

When to use?

– when you have a large number of small files

– when you perform a lot of direct writes in a large file

– when you want to keep structure/relations between data

– when software crashes have a non-negligible probability

– when files are update by several processes● When not to use:

– only sequential access

– simple matrices/vectors, etc.

– direct access on fixed-size records and no structure

Data transferfaster and less secure

parallel transfers

scp -c cipher ...

http://blog.famzah.net/2010/06/11/openssh-ciphers-performance-benchmark/

Fastest: No encryption at all

● Need friendly firewall (choose direction accordingly)

● Only over trusted networks

● If rsh is installed: rcp instead of scp

Fastest: No encryption at all

● Need friendly firewall (choose direction accordingly)

● Only over trusted networks

● If rsh is installed: rcp instead of scp

● If rsh is not installed: nc on both ends

Resuming transfers

● When nothing changed but the transfer was interrupted

– size-only: do not perform byte-level file comparison

Resuming transfers

● When nothing changed but the transfer was interrupted

– append: do not re-check partially transmitted files and resume the transfer where it was abandoned assuming first transfer attempt was with scp or with rsync --inplace

Parallel data transfer: bbcp

● Better use of the bandwidth than SCP

● Needs to be installed on both sides (easy to install)

● Needs friendly firewalls

Parallel data transfers: parsync

http://moo.nac.uci.edu/~hjm/parsync/

Parallel data transfers: sbcast

Transferring ZOT files

● Zillions Of Tiny files

● More meta-data than data → large overhead for rsync

● Solution: Pre-tar or tar on the fly

● Needs friendly firewall

● Also avoid 'ls' and '*' as they sort the output. Favor 'find'

Data sharingwith other users (Unix permissions, Encryption)

with external users (Owncloud)

Data sharing

Data sharing with other users

Sharing with all other users

Sharing with the group

Sharing and hiding

Sharing and encrypting

Data sharing

Data sharing with external users

● owncloud

CISM login

Dropbox-like

External SFTP connectors

Dropbox-like

My home on Manneback

Can create a share URL

And distribute it

Summary:

Storage: choose the right filesystem and the right file format

Transfer: use the parallel tools when possible and limit encryption in favor of throughput

Sharing: use all the potential of the UNIX permissions and try Owncloud

Introduction to Scientific Data Management · – Result with context information: JSON Data –...

Transcript of Introduction to Scientific Data Management · – Result with context information: JSON Data –...

Introduction to Scientific Data Management · – Result with context information: JSON Data –...

Documents

Transcript of Introduction to Scientific Data Management · – Result with context information: JSON Data –...

TSV Ohmden e.V.

Saving to .csv for CVB Statistics Data Formats to .csv for CVB Statistics Data Formats CVB Statistics June 2, 2011 1 Nomenclature: File names and variables CSV tables created from

creator-static.line.me · (20.42%) 2015.07.27 2015.07.27 2015.07.2 2015.06.25 2015.03.13 8, 8, PayPal PayPal PayPal PayPal PayPal csv csv csv csv csv PDF PDF PDF PDF 2015.09.10

Historical Option Data in CSV and SQL Formats

TSV - YUDO EU

How to Open the CSV Data in MapInfo

Tutorial: Exporting Open Green Map Data as CSV and KML ...

341 TSV Green

Catalogo tsv

Texas Ethics Commission Public Access CSV Data Flat Format · Texas Ethics Commission – Public Access CSV Data – Flat Format Page | 1 Doc Date: 07/31/2015 The CSV file contains

TSV 158 - economy.gov.mk

User Sync Client Guidelines - BlackBerry...capabilities for LDAP and CSV data sources. Capability LDAP data source CSV data source User data Yes Yes Organization hierarchy structure

Quantum ViewSM Data Subscription Comma Separated …Comma-Separated Value (CSV) Files – Description The type of records included in your Quantum View Data CSV files is dependent

CSV Considerations for Data Integrity - Rx-360 · 2016. 3. 3. · CSV Considerations Around Data Integrity Data integrity is a current hot topic, but not a new one, within the life

SESSION 14 Train Personnel on CSV and Data Integrity ... 14 Train Personnel on CSV and Data Integrity Compliance Chris Wubbolt –QACV Consulting Denise Botto –inVentiv Health April

CSV Considerations for Data Integrity - Rx-360 …rx-360.org/wp-content/uploads/CSV-Considerations-Around-Data... · CSV Considerations Around Data Integrity Data integrity is a current

e-data - Documentation - LGTverwaltung... · 2.8 Adjusting the delivery frequency of LGT Standard data (csv/xml) The delivery frequency of LGT Standard files (csv/xml) changes with

csv,conf 2014 - Open data within organizations

MomentumPro V3.1 MomentumPro XML Transform Guide Guide.pdf · familiar CSV data formats and then convert CSV data into the required XML data file. Using this tool requires no specialist

Viewing CSV Data Files in Microsoft Excel spreadsheet ...download.flukecal.com/pub/literature/3392856B_Viewing...2 Fluke Calibration Viewing CSV Data Files in Microsoft Excel® spreadsheet