Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on...

25
Versioning of Data Sets: Why, How, What and Where? Lesley Wyborn, NCI Jens Klump, CSIRO Adrian Burton, ANDS This slides is available at: http://bit.ly/2dDmXHE

Transcript of Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on...

Versioning of Data Sets: Why, How, What and Where?

Lesley Wyborn, NCIJens Klump, CSIRO

Adrian Burton, ANDS

This slides is available at: http://bit.ly/2dDmXHE

Outline

● Group introductions (Lesley - facilitator, ~2mins)● Why - the growing need for data versioning (Lesley, 5 mins)● How and what (Jens, 5 mins)● Where (current practices) (Adrian, 5 mins)● Two case studies

○ AAL (Yeshe, 5 mins)○ IMOS (Natalia, 5 mins)

● Group discussions (Lesley/Jens, 10mins, Flipchart - Adrian)● Summary and next steps (Jens, 5 mins)

© National Computational Infrastructure 2016

Why: the Growing Need for Data Versioning

Lesley WybornNational Computational Infrastructure ANU

© National Computational Infrastructure 2016

Why is Versioning Important in eResearch today?

• Historically we have moved on from the traditional ‘book on the shelf model’ for datasets:

– A single researcher/individual research team collected all data used in each research paper

– Versioning was straightforward: there was only one data set that was unique to that paper

eResearch Australasia 2016: BoF on Versioning

Source: http://commons.wikimedia.org/wiki/File:Shelves_of_Language_Books_in_Library.JPG Source: http://en.wikipedia.org/wiki/Library_catalog#/media/File:Schlagwortkatalog.jpg

© National Computational Infrastructure 2016

We have moved on to the era of sharing and reusing data

• Modern team based research is more common that often utilises publicly available data sets which are capable of being reused or repurposed to support new research directions

• Data can be sourced from national/institutional repositories (49 PB in Australia in RDS):

– Many of these data sets are being continually added to and/or revised

– Data are being copied between repositories or to local sites

– Colocation with HPC/cloud means new data products can be created in very short time frames

• Some of the more mature data centers offer web services access:

– Enables users to dynamically select subsets based on spatial and/or temporal queries

– Rarely are queries the same, particularly where the user graphically draws a spatial bounding box to define the area of interest

eResearch Australasia 2016: BoF on Versioning

© National Computational Infrastructure 2016

Modern eResearch Data Conundrum

• It is now much harder for a researcher to cite the exact data extract that was used to support a research project or government investigation

– Particularly if the source data set is being dynamically modified – And/or data are accessed via dynamic queries

• With increasing duplication of data sets between the main data repositories and local stores it is also getting harder to identify the canonical or point of truth data set

eResearch Australasia 2016: BoF on VersioningSource: http://generator-meme.com/meme/sad-tiger/

© National Computational Infrastructure 2016

Basic versioning requirements

• Agreed procedures for versioning data sets and derived data products in a systematized way so that it is possible to reference the exact version of the data that was used:a. to underpin the research findings and/or

b. to generate higher level data products

• When do we attach persistent identifiers to datasets? a. Do we associate persistent identifiers with particular versions of each data set?

b. If data sets are constantly changing how do we determine when it is declared a new version?

c. Who assigns the PID when the data are generated and stored on a 3rd party site?

eResearch Australasia 2016: BoF on Versioning

Source: http://www.fanpop.com/clubs/save-the-tigers/images/8696291/title/tigers-wallpaper

© National Computational Infrastructure 2016

Satellite Data Use Case

eResearch Australasia 2016: BoF on Versioning

• 857,000 Landsat source scenes in this data set (~52 x 1012 Pixels)

• It is constantly being added to

• Historical errors found in a few scenes (going back >30 years)

• New data products constantly being derived or older versions are being revised

© National Computational Infrastructure 2016

I AM THE ONE assigning THE DOI to THIS data set

eResearch Australasia 2016: BoF on Versioning

© National Computational Infrastructure 2016

Versioning: next steps?

• Do we agree that we need to be concerned about versioning or do we put it on ice?

• How should we approach versioning – Jens?

eResearch Australasia 2016: BoF on Versioning

Source: http://www.latimes.com/world/la-fg-c1-china-siberian-tiger-20131001-dto-htmlstory.html

Where?

Software eg 2.1.5

● Major○ Minor

■ Patch

● non backward compatible○ backward compatible new

functionality■ Backward compatible

bug fix

Social conventions to a point

Data?

● Any significant change

V.1, v.2

Data?

● Major○ Minor

● Change scope context or intended use of data○ QA updates

1.2

Data? NASA

● Level 0● Level 1 ● Level 2 ● Level 3

1. ‘raw’ data from the satellite.2. calibrated and geolocated, with

original sampling pattern.3. converted into geophysical

parameters with the original sampling pattern.

4. resampled, averaged over space, interpolated/averaged over time.

Data? AIMS Weather Station Data

● Level 0● Level 1 ● Level 2

1. raw unprocessed data as received from the AWS. No QA

2. all suspect data points removed but no suspect data are corrected.

3. all suspect data points corrected where possible.

Data? IMOS netCDF Conversion

5 levels combining the levels of quality control and the levels of scientific interpretation

Raw data -> Knowledge products

Data? DOI? CSIRO Astronomy

New version = new metadata page = new doi

● Relation to other versions● How different?

Versioning of Data Sets: IMOS/AODN Approach

Natalia Atkins, Sebastien Mancini, Roger Proctor

What is IMOS Data? – most is dynamic– New data is continuously added– Existing data can be both modified or updated.

File type• File based (e.g. NetCDFs)• Databases (e.g. Animal tracking)• Other (AUV images, acoustic recordings ….)

Data access• Accessed via web services (WMS, WFS and WPS)• THREDDS• Direct data download (from our S3 storage)

No formal data versioning.

We ask the following: The citation in a list of references is: "IMOS

[year-of-data-download], [Title], [data-access-URL], accessed

[date-of-access].

State of play in terms of DOI

• Set up to manually mint DOIs.

• We have been asked for DOIs for 2 AODN datasets (Australian

Phytoplankton Database, and Glider Climatology product), and

have archived 2 static versions.

Current approaches and possibilities

Data stored on Amazon S3 (object storage)• Since March 2016

• Use of versioning feature

• All previous versions kept except for satellite data

• Web services (WMS, WFS, WPS using Geoserver)• currently not storing user queries,

• most RDA’s recommendations are achievable for small

datasets

• NetCDFs – history is captured within the file

Astronomy Virtual Observatory: data versioningYeshe Fenner

Astronomy Australia Ltd

Data models and formatsData types: images, spectra, image-spectral data cubes, raw visability data (radio interferometry), catalogues

Data size: ~2 petabytes of optical, >12 petabytes radio, 100s terabytes theory

Data formats/models: FITS, HDF5, PostgreSQL, Hadoop/Spark

Data access: 1) web UI, 2) third-party VO apps, 3) APIs

Data ingest: mostly dynamic, but only released publicly at discrete time points

Data and pipeline versioningGeneral approach to versioning:

● Data Release: new DOIs for each public data release (manual validation & release process). E.g. _v01 to first version, _v02 to 2nd version

● Branches (if applicable): indicating different data processing pipelines to derive same quantity (therefore each property of the astronomical object has its own version)

● Old versions of data/branches remain available, but the current version is the default