1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata...

70
1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information (provenance)

Transcript of 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata...

Page 1: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

1

Peter Fox

Data Science – ITEC/CSCI/ERTH

Week 3, September 9, 2014

Data formats, metadata standards, conventions,

reading and writing data and information (provenance)

Page 2: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Contents• Assignment on data collection exercise

• Reading from last week

• Data formats

• Metadata standards, conventions,

• Reading and writing data and information (embedded), where does provenance show up?

• Next week (data analysis I)

2

Page 3: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Data Formats• We will cover some (not all)

– ASCII, UTF-8, ISO 8859-1– Self-describing formats– Table-driven– Markup languages and other web-based– Database– Graphs– Unstructured

3

Page 4: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

ASCII• American Standard Code for Information

Interchange

• http://www.webopedia.com/TERM/A/ASCII.html

• Table of characters http://www.webopedia.com/quick_ref/asciicode.asp

• ISO-8859-1 (aka ISO Latin 1) is a superset of ASCII – used on the web to represent ‘non-ASCII’ characters

• Non-printing characters

4

Page 5: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Example – good or bad?

5

Page 6: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Example – good or bad?

6

Page 7: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Example – good or bad?

7

Page 8: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Example – good or bad?

8

Where is the data?Where is the provenance?

Page 9: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Example – good or bad?

9

Page 10: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

EBCDIC• Extended Binary-Coded Decimal Interchange

Code

• IBM

• 7-bit, 8-bit, 9-bit

• If someone mentions this just RUN away

• Seriously, it is okay to convert

10

Page 11: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Making ASCII more useful• Delimited: CSV or tab or (gulp) ASCII space

– Improves parsing– How to handle special characters?– How to handle ambiguous delimiters?

• Moving them in/out of “Excel” for e.g.

• Templates

• Encoding strings, e.g.– ‘f7.4, c32, i8,f5.3,e9.4’

11

Page 12: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Reading and writing ASCII• Text editor (vi, emacs, wordpad)

– Dangers exist in applications that add hidden formatting, i.e. characters

– Even cr, lf (two ASCII characters, non-printable) can often cause major reading and interoperability problems

• Procedural languages (e.g. C)

• Interpreted languages (e.g. Perl, Python)

• Data structures are utilized (e.g. typed arrays, often multidimensional) to provide logical organization to data that is read in, or in preparation for writing it out

12

Page 13: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Data in “data structures”• JSON – JavaScript Object Notation

json.org/example

{"menu": {

"id": "file",

"value": "File",

"popup": {

"menuitem": [

{"value": "New", "onclick": "CreateNewDoc()"},

{"value": "Open", "onclick": "OpenDoc()"},

{"value": "Close", "onclick": "CloseDoc()"}

]

}

}}

13

Page 14: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Data in “data structures”• JSON – JavaScript Object Notation

json.org/example

{"menu": {

"id": "file",

"value": "File",

"popup": {

"menuitem": [

{"value": "New", "onclick": "CreateNewDoc()"},

{"value": "Open", "onclick": "OpenDoc()"},

{"value": "Close", "onclick": "CloseDoc()"}

]

}

}}

14

The same text expressed as XML:<menu id="file" value="File"> <popup> <menuitem value="New" onclick="CreateNewDoc()" /> <menuitem value="Open" onclick="OpenDoc()" /> <menuitem value="Close" onclick="CloseDoc()" /> </popup></menu>

Page 15: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Data in “applications”• Increasing trend in storing data in application

files, e.g. Excel, .mat (Matlab), .sav (IDL), …

• What advantages?– Ready to use– Data structures are provided

• What problems?– Data structures may not match the underlying

data representation (model), i.e. information and data may be lost (e.g. float instead of double)

– Format versions– Interoperability – can it be read by another app?

15

Page 16: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

FreeForm• 10+ years ago there was an attempt to

provide a templated (almost table driven) approach

• Good homework assignment when you are bored – find out why it was created and what happened to it

16

Page 17: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Spreadsheets• E.g. Excel – import data, Save As csv

17

Page 18: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Documentation?

18

Page 19: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

CDF• Common Data Format• The Common Data Format (CDF) is a self-

describing data format for the storage and manipulation of scalar and multidimensional data in a platform- and discipline-independent fashion

• Although CDF has its own internal self describing format, it consists of more than just a data format. CDF is a scientific data management package (known as the "CDF Library") which allows programmers and application developers to manage and manipulate scalar, vector, and multi-dimensional data arrays 19

Page 20: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

CDFML• The CDF office realized that scientific

progress is often impeded by the lack of, or excessive multiplicity of, available standards for data formats and structures and/or data format translators. In a bid to facilitate and promote data sharing with other data formats, the CDF office has decided to adopt Extensible Markup Language (XML) as a basis for establishing interoperability with other scientific data formats and created CDF Markup Language (CDFML) to describe CDF data and metadata.

20

Page 21: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

netCDF• Network Common Data Format (and API)

• Self describing – what does this mean?

• Variables, dimensions, types, attributes, coordinates

• nc_dump

• nc_open

• nc_inquire

• nc_dim

• nc_varget/ put

• nc_attget/ put21

Page 22: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

22

Page 23: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

NcML• NcML is an XML representation of netCDF

metadata, (approximately) the header information one gets from a netCDF file with the "ncdump -h" command.

• NcML is similar to the netCDF CDL (network Common data form Description Language), except, of course, it uses XML syntax.

• http://www.unidata.ucar.edu/software/netcdf/ncml/• NetCDF-Java library support -

http://www.unidata.ucar.edu/software/netcdf-java/index.html 23

Page 24: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

HDF5• HDF5 is a data model, library, and file format for

storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.

• HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.

• The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.

• VERY complex API

24

Page 25: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

HDF4• At its lowest level, HDF is a physical file

format for storing scientific data

• At its highest level, HDF is a collection of utilities and applications for manipulating, viewing, and analyzing data in HDF files

• Between these levels, HDF is a software library that provides high-level APIs and a low-level data interface

25

Page 26: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

HDFEOS• A variant of HDF for the Earth Observing

System (EOS)

• http://hdfeos.org/

• More under metadata later

26

Page 27: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

HDFEOS Profiles over time

27

Page 28: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Common Data Model• Versus 1.0 combines netCDF and HDF into

one model, and API

• Uses the underlying HDF format representation but uses the netCDF v4 API

• Simplifies access

28

Page 29: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

FITS• FITS stands for `Flexible Image Transport

System' and is the standard astronomical data format endorsed by both NASA and the IAU.

• FITS is much more than an image format (such as JPG or GIF) and is primarily designed to store scientific data sets consisting of multi-dimensional arrays (1-D spectra, 2-D images or 3-D data cubes) and 2-dimensional tables containing rows and columns of data.

• Many APIs

29

Page 30: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

TIFF/GeoTIFF• Tagged Image File Format 24-bit support• http://www.libtiff.org/ • GeoTIFF is a public domain metadata standard

which allows georeferencing information to be embedded within a TIFF file.

• The potential additional information includes projections, coordinate systems, ellipsoids, datums, and everything else necessary to establish the exact spatial reference for the file.

• The GeoTIFF format is fully compliant with TIFF 6.0, so software incapable of reading and interpreting the specialized metadata will still be able to open a GeoTIFF file.

30

Page 31: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

RDBMS (in one slide)• A Relational database management system

(RDBMS) is a database management system (DBMS) that is based on the relational model as introduced by E. F. Codd.

• Most popular commercial and open source databases currently in use are based on the relational model.

• A short definition of an RDBMS may be a DBMS in which data is stored in the form of tables and the relationship among the data is also stored in the form of tables. 31

Page 32: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

BUFR• Binary Universal Form for the Representation

of meteorological data (BUFR) is a binary data format maintained by the World Meteorological Organization

• The latest version is BUFR Edition 4

• BUFR Edition 3 is also considered current for operational use

• http://www.wmo.ch/pages/prog/www/WMOCodes/OperationalCodes.html

32

Page 33: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

BUFR structure• A BUFR message is composed of six sections, numbered

zero through five.• Sections 0, 1 and 5 contain static metadata, mostly for

message identification.• Section 2 is optional; if used, it may contain arbitrary data in

any form wished for by the creator of the message (this is only advisable for local use).

• Section 3 contains a sequence of so-called descriptors that define the form and contents of the BUFR data product.

• Section 4 is a bit-stream containing the message's core data and meta-data values as laid out by Section 3.The product description contained in Section 3 can be made sophisticated and non-trivial by the use of replication and/or operator descriptors. 33

Page 34: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

GriB• GRIB (GRIdded Binary) is a mathematically

concise data format commonly used in meteorology to store historical and forecast weather data

• Significant amount of software available

• See wikipedia page for more details

34

Page 35: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

ESML• ESML is an interchange technology that enables data (both

structural and semantic) interoperability with applications without enforcing a standard format within the Earth science community.

• Users can write external files using ESML schema to describe the structure of the data file.

• Applications can utilize the ESML Library to parse this description file and decode the data format.

• Software developers can build data format independent scientific applications utilizing the ESML technology.

• Semantic tags can be added to the ESML files by linking different domain ontologies to provide a complete machine understandable data description.

• ESML description file allows the development of intelligent applications that can now understand and "use" the data.

35

Page 36: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

ESML• Earth Science Markup Language

• http://sourceforge.net/projects/esml/

• Schema

• Editor

• Library

• Tutorial

• Application API - IDL

36

Page 37: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

CSML• http://csml.badc.rl.ac.uk/

• Climate Science Markup Language

• CSML is a standards-based data model and GML (Geography Markup Language) application schema for atmospheric and oceanographic data with associated software tools developed at the Rutherford Appleton Laboratory.

• Java library at: http://csml.badc.rl.ac.uk/java/

37

Page 38: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

CSML Java codeProfileCoverage cov = ...;

PrintStream out = ...; // e.g. System.out

RecordType rangeType = cov.getRangeType();

out.println("<table>");

for (Record record : cov.getRange()) { // Each Record is a row in the table

out.print("<tr>"); for (String memberName : rangeType.getMemberNames()) {

// Each member represents a different Phenomenon and

// is a column in the table.

out.print("<td>" + record.getValue(memberName) + "</td>");

}

out.println("</tr>");

}out.println("</table>");

38

Page 39: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

RDF• http://www.w3.org/RDF/ - Resource

Description Framework– Read the introduction and overview

• Graph representation and encoding

• RDF the model and RDF/XML the encoding

• Many tools, and very good language support

• Is the foundation of ‘data on the web’, see www.linkeddata.org

• JSON-LD (JSON for Linked Data)

• We cover this more in a later class 39

Page 40: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Break?

40

Page 41: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Metadata formats• Fall into three categories

– Unstructured and disconnected– With the data– ‘Close’ to the data

• See the ASCII example and contrast this with the netCDF example

• Structure around metadata is very important

• Vocabulary (constraints) are also very useful

• We dream of contextual metadata…41

Page 42: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Dublin Core• DCMI is an open organization engaged in the

development of interoperable online metadata standards that support a broad range of purposes and business models. – ISO Standard 15836-2003 of February 2003– ANSI/NISO Standard Z39.85-2007 of May 2007– IETF RFC 5013 of August 2007

• Metadata element set - http://dublincore.org/documents/dces/

• Metadata terms - http://dublincore.org/documents/dcmi-terms/

42

Page 43: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

DC Type Vocabulary - Sect. 7

• Collection• Dataset• Event• Image• InteractiveResource• MovingImage• PhysicalObject• Service• Software• Sound• StillImage

43

Page 44: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

44

Page 45: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

METS• The METS schema is a standard for

encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained in the Network Development and MARC Standards Office of the Library of Congress, and is being developed as an initiative of the Digital Library Federation.

45

Page 46: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

METS example

46

Page 47: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

47

Page 48: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

48

Page 49: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

METS example profile

49

Page 50: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Time• ISO 8601 specifies numeric representations

of date and time. – helps to avoid confusion in international

communication due to different national notations– increases the portability of computer user

interfaces– Good read: http://www.cl.cam.ac.uk/~mgk25/iso-

time.html

• In XML encodings, see xsd:datetime– http://www.w3.org/TR/NOTE-datetime– http://www.w3.org/TR/xmlschema-2/ 50

Page 51: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

ISO 19xxx• http://www.geoportal-idec.cat/geoportal/eng/

familiaiso.html

• Is a family of standards *yes real ones*

• Covers– Geospatial– Features– Many more

• How to read and write?51

Page 52: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Spatial representation• ISO 19115:2003 defines the schema required for

describing geographic information and services• It provides information about the identification, the

extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data

• ISO 19115:2003 is applicable to:– the cataloguing of datasets, clearinghouse activities, and

the full description of datasets– geographic datasets, dataset series, and individual

geographic features and feature properties.

52

Page 53: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Spatial representation• ISO 19115:2003 defines

– mandatory and conditional metadata sections, metadata entities, and metadata elements

– the minimum set of metadata required to serve the full range of metadata applications (data discovery, determining data fitness for use, data access, data transfer, and use of digital data)

– optional metadata elements - to allow for a more extensive standard description of geographic data, if required

– a method for extending metadata to fit specialized needs.

• Though ISO 19115:2003 is applicable to digital data, its principles can be extended to many other forms of geographic data such as maps, charts, and textual documents as well as non-geographic data.

53

Page 54: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Spatial representation• ISO 19115-2:2009 extends the existing geographic metadata

standard by defining the schema required for describing imagery and gridded data

• It provides information about the properties of the measuring equipment used to acquire the data, the geometry of the measuring process employed by the equipment, and the production process used to digitize the raw data

• This extension deals with metadata needed to describe the derivation of geographic information from raw data, including the properties of the measuring system, and the numerical methods and computational procedures used in the derivation

• The metadata required to address coverage data in general is addressed sufficiently in the general part of ISO 19115. 54

Page 55: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

netCDF+CF• netCDF – good but there are NO required

attributes or controlled vocabularies

• In the atmospheric sciences, a group formed to developed COARDS –

• And then CF – the climate and forecast conventions

• http://cf-pcmdi.llnl.gov/• The conventions define metadata that provide a definitive description of

what the data in each variable represents, and the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities. 55

Page 56: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

CF standard names • http://cf-pcmdi.llnl.gov/documents/cf-

standard-names/

• E.g. air_pressure_at_cloud_base (def: cloud_base refers to the base of the lowest cloud, units: Pa)

• Used in netCDF attribute fields

56

Page 57: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

COARDS• http://ferret.wrc.noaa.gov/noaa_coop/

coop_cdf_profile.html

• Cooperative Ocean/Atmosphere Research Data Service

• For example, if an oceanographic netCDF file encodes the depth of the surface as 0 and the depth of 1000 meters as 1000 then the axis would use attributes as follows:– axis_name:units="meters"; axis_name:positive="down";

• If, on the other hand, the depth of 1000 meters were represented as -1000 then the value of the positive attribute would have been "up". If the units attribute value is a valid pressure unit the default value of the positive attribute is "down”. 57

Page 58: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

HDFEOS• http://eospso.gsfc.nasa.gov/

eos_homepage/mission_profiles/docs/mission2.gif

• This was an effort to add a profile/ convention for NASA use of HDF in the EOS mission

• A lot of custom library software was built to use it (e.g. the IDL HDFEOS library)

58

Page 59: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

More markup languages• GML - Geography Markup Language –

developed as a way to standardize geographic representations (to facilitate interoperability) ISO 19136:2007– Stores data and metadata– Because it focuses on coordinates, is important as

representing structural elements, such as points, lines, polygons used in a specific discipline

– Features application schema to represent roads, rivers, etc.

– Is stored statically as well as generated dynamically– http://schemas.opengis.net/gml/3.2.1/

59

Page 60: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Markup languages• KML – Keyhole Markup Language – developed as

an interlingua for a specific application, i.e. Google Earth– Currently stores data and metadata– XML tag and nesting provides for embedding structure

and associations between metadata and data– Uses other markup languages, e.g. GML

• Currently, KML 2.2 utilizes certain geometry elements derived from GML 2.1.2. These elements include point, line string, linear ring, and polygon.

– Can contain links (external) to other content– Increasingly is now generated dynamically rather than

being a storage format– KMZ – compressed version of KML

60

Page 61: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

61

Page 62: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Provenance – metadata in a given context – think this way

• Who?

• What?

• Where?

• Why?

• When?

• How?

62

Page 63: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

20080602 Fox VSTO et al.

63

• Provenance in this data pipeline

• Provenance is metadata in context

• What context?– Who you are?– What you are

asking?– What you will

use the answer for?

Page 64: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

At the least• Keyword-value pair

– Obs_start_time=“Mon 1 Sep 2014 19:22:30 EDT”

– Observer=“Peter Fox”

64

Page 65: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Provenance standards• International standards:

– ISO lineage (see reading)– PROV

(reading)

65

Page 66: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Wire protocols• Data Access Protocol (www.opendap.org)

– Devised as a superset protocol/ data structure so that any data format could be translated into it, serialized and transported over a network (and de-serialized)

– At present (except caching) data is not stored in DAP ‘format’

– We’ll cover this in more detail in a later class

66

Page 67: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

What to do when none exist?• ISO 19109

• UML and tools….

67

Page 68: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

ISO 19109• ISO 19109:2005(E) defines rules for creating

and documenting application schemas, including principles for the definition of features

• Its scope includes the following– conceptual modeling of features and their properties from a universe

of discourse– definition of application schemas– use of the conceptual schema language for application schemas– transition from the concepts in the conceptual model to the data types

in the application schema– integration of standardized schemas from other ISO geographic

information standards with the application schema. 68

Page 69: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

Summary and discussion• What is in common about the data and

metadata formats?

• Many choices for both – what are the key criteria for choosing?– Read and write capability– Faithful representation of structure of data– Accurate representation of metadata with no (or

minimal) loss of information

69

Page 70: 1 Peter Fox Data Science – ITEC/CSCI/ERTH Week 3, September 9, 2014 Data formats, metadata standards, conventions, reading and writing data and information.

What is next• Assignment 2 on the web today

• Assignment 1 due today, feedback in preparation for week 5

• Next week (Data Analysis I)

• Reading– Data formats: netCDF– Metadata resources – METS– OAI-PMH Protocol for Metadata Harvesting– Climate and Forecast (CF) conventions 70