Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on...
Transcript of Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on...
Versioning of Data Sets: Why, How, What and Where?
Lesley Wyborn, NCIJens Klump, CSIRO
Adrian Burton, ANDS
This slides is available at: http://bit.ly/2dDmXHE
Outline
● Group introductions (Lesley - facilitator, ~2mins)● Why - the growing need for data versioning (Lesley, 5 mins)● How and what (Jens, 5 mins)● Where (current practices) (Adrian, 5 mins)● Two case studies
○ AAL (Yeshe, 5 mins)○ IMOS (Natalia, 5 mins)
● Group discussions (Lesley/Jens, 10mins, Flipchart - Adrian)● Summary and next steps (Jens, 5 mins)
© National Computational Infrastructure 2016
Why: the Growing Need for Data Versioning
Lesley WybornNational Computational Infrastructure ANU
© National Computational Infrastructure 2016
Why is Versioning Important in eResearch today?
• Historically we have moved on from the traditional ‘book on the shelf model’ for datasets:
– A single researcher/individual research team collected all data used in each research paper
– Versioning was straightforward: there was only one data set that was unique to that paper
eResearch Australasia 2016: BoF on Versioning
Source: http://commons.wikimedia.org/wiki/File:Shelves_of_Language_Books_in_Library.JPG Source: http://en.wikipedia.org/wiki/Library_catalog#/media/File:Schlagwortkatalog.jpg
© National Computational Infrastructure 2016
We have moved on to the era of sharing and reusing data
• Modern team based research is more common that often utilises publicly available data sets which are capable of being reused or repurposed to support new research directions
• Data can be sourced from national/institutional repositories (49 PB in Australia in RDS):
– Many of these data sets are being continually added to and/or revised
– Data are being copied between repositories or to local sites
– Colocation with HPC/cloud means new data products can be created in very short time frames
• Some of the more mature data centers offer web services access:
– Enables users to dynamically select subsets based on spatial and/or temporal queries
– Rarely are queries the same, particularly where the user graphically draws a spatial bounding box to define the area of interest
eResearch Australasia 2016: BoF on Versioning
© National Computational Infrastructure 2016
Modern eResearch Data Conundrum
• It is now much harder for a researcher to cite the exact data extract that was used to support a research project or government investigation
– Particularly if the source data set is being dynamically modified – And/or data are accessed via dynamic queries
• With increasing duplication of data sets between the main data repositories and local stores it is also getting harder to identify the canonical or point of truth data set
eResearch Australasia 2016: BoF on VersioningSource: http://generator-meme.com/meme/sad-tiger/
© National Computational Infrastructure 2016
Basic versioning requirements
• Agreed procedures for versioning data sets and derived data products in a systematized way so that it is possible to reference the exact version of the data that was used:a. to underpin the research findings and/or
b. to generate higher level data products
• When do we attach persistent identifiers to datasets? a. Do we associate persistent identifiers with particular versions of each data set?
b. If data sets are constantly changing how do we determine when it is declared a new version?
c. Who assigns the PID when the data are generated and stored on a 3rd party site?
eResearch Australasia 2016: BoF on Versioning
Source: http://www.fanpop.com/clubs/save-the-tigers/images/8696291/title/tigers-wallpaper
© National Computational Infrastructure 2016
Satellite Data Use Case
eResearch Australasia 2016: BoF on Versioning
• 857,000 Landsat source scenes in this data set (~52 x 1012 Pixels)
• It is constantly being added to
• Historical errors found in a few scenes (going back >30 years)
• New data products constantly being derived or older versions are being revised
© National Computational Infrastructure 2016
I AM THE ONE assigning THE DOI to THIS data set
eResearch Australasia 2016: BoF on Versioning
© National Computational Infrastructure 2016
Versioning: next steps?
• Do we agree that we need to be concerned about versioning or do we put it on ice?
• How should we approach versioning – Jens?
eResearch Australasia 2016: BoF on Versioning
Source: http://www.latimes.com/world/la-fg-c1-china-siberian-tiger-20131001-dto-htmlstory.html
Software eg 2.1.5
● Major○ Minor
■ Patch
● non backward compatible○ backward compatible new
functionality■ Backward compatible
bug fix
Social conventions to a point
Data? NASA
● Level 0● Level 1 ● Level 2 ● Level 3
1. ‘raw’ data from the satellite.2. calibrated and geolocated, with
original sampling pattern.3. converted into geophysical
parameters with the original sampling pattern.
4. resampled, averaged over space, interpolated/averaged over time.
Data? AIMS Weather Station Data
● Level 0● Level 1 ● Level 2
1. raw unprocessed data as received from the AWS. No QA
2. all suspect data points removed but no suspect data are corrected.
3. all suspect data points corrected where possible.
Data? IMOS netCDF Conversion
5 levels combining the levels of quality control and the levels of scientific interpretation
Raw data -> Knowledge products
Data? DOI? CSIRO Astronomy
New version = new metadata page = new doi
● Relation to other versions● How different?
What is IMOS Data? – most is dynamic– New data is continuously added– Existing data can be both modified or updated.
File type• File based (e.g. NetCDFs)• Databases (e.g. Animal tracking)• Other (AUV images, acoustic recordings ….)
Data access• Accessed via web services (WMS, WFS and WPS)• THREDDS• Direct data download (from our S3 storage)
No formal data versioning.
We ask the following: The citation in a list of references is: "IMOS
[year-of-data-download], [Title], [data-access-URL], accessed
[date-of-access].
State of play in terms of DOI
• Set up to manually mint DOIs.
• We have been asked for DOIs for 2 AODN datasets (Australian
Phytoplankton Database, and Glider Climatology product), and
have archived 2 static versions.
Current approaches and possibilities
Data stored on Amazon S3 (object storage)• Since March 2016
• Use of versioning feature
• All previous versions kept except for satellite data
• Web services (WMS, WFS, WPS using Geoserver)• currently not storing user queries,
• most RDA’s recommendations are achievable for small
datasets
• NetCDFs – history is captured within the file
Data models and formatsData types: images, spectra, image-spectral data cubes, raw visability data (radio interferometry), catalogues
Data size: ~2 petabytes of optical, >12 petabytes radio, 100s terabytes theory
Data formats/models: FITS, HDF5, PostgreSQL, Hadoop/Spark
Data access: 1) web UI, 2) third-party VO apps, 3) APIs
Data ingest: mostly dynamic, but only released publicly at discrete time points
Data and pipeline versioningGeneral approach to versioning:
● Data Release: new DOIs for each public data release (manual validation & release process). E.g. _v01 to first version, _v02 to 2nd version
● Branches (if applicable): indicating different data processing pipelines to derive same quantity (therefore each property of the astronomical object has its own version)
● Old versions of data/branches remain available, but the current version is the default