Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture

18
TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer Polytechnic Institute Email: [email protected]; Twitter: @MarshallXMa ICSU-WDS Data Stewardship Award Lecture SciDataCon 2014, New Delhi, India, Nov. 02-05

Transcript of Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture

TWC

Why Data Science Matters

Xiaogang (Marshall) Ma

Tetherless World Constellation

Rensselaer Polytechnic Institute

Email: [email protected]; Twitter: @MarshallXMa

ICSU-WDS Data Stewardship Award Lecture

SciDataCon 2014, New Delhi, India, Nov. 02-05

TWCAcknowledgements

• Dr. Mustapha Mokrane and Dr. Simon Hodson

• Colleagues at TWC/RPI, CODATA-ECDP, ESIP, CGI-

IUGS, AGU/ESSI, ICSU-WDS, RDA, ITC, and more

• My mentor Prof. Peter Fox

• My family

• All of you

TWCOutline

• Technical trends

– Data management, publication & citation

• Methodology

– Interoperability & Provenance

• Data management is just a start

– Data analysis

– Semantic eScience

3

TWCData Management

4

data work

Image courtesy Randy Glasbergen

TWCData Management Plan

• Data Management Plan

– A formal document that outlines what you will do with your data

during and after you complete your research

• Resources/Tools help create DMPs:

– NSF Data Management Plan Requirements:

http://www.nsf.gov/eng/general/dmp.jsp

– DCC Data Management Plans:

http://www.dcc.ac.uk/resources/data-management-plans

– DMPTool: https://dmptool.org

– DCC DMPOnline: https://dmponline.dcc.ac.uk

5

TWCData Publication

• Data as first class products of research

– e.g., NSF bio-sketches can include data publications

6Image from j4h.net

See: http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp

TWC

7

“All data necessary to understand, assess, and extend the conclusions of

the manuscript must be available to any reader of Science. ”

“…authors are required to make materials, data and associated protocols

promptly available to readers without undue qualifications.”

“…authors must make materials, data, and associated protocols available

to readers.”

“…it is a condition of publication that authors make available the data and

research materials supporting the results in the article.”

“…require authors to make all data underlying the findings described in

their manuscript fully available without restriction…”

“Earth and space science data should be widely accessible in multiple

formats and long‐term preservation of data is an integral responsibility of

scientists and sponsoring institutions.”

“…support the principle that research data should be made freely

available to all researchers…”

“…recommends depositing data that correspond to journal articles in

reliable data repositories…”

TWC• Ways of data publication

– Data as supplemental material of a paper

– Standalone data

– Data paper: data in a repository + descriptive ‘data paper’

8

Strasser, GeoData 2014 Workshop Presentation (2014)

Examples:

• Standalone data journals: Nature Scientific Data, Geoscience Data

Journal, Ecological Archives, Data in Brief …

• Journals that publish data papers: Earth and Space Science,

GigaScience, F1000 Research, Internet Archaeology …

TWC

9

An isolateddata island ?!

Image from nature.com

TWCData Citation

• Data Citation Index

– Indexes the world's leading data repositories

– Connects datasets to related refereed literature indexed in

the Web of Science™

– Efficient access to data across subjects and regions

10Image courtesy http://wokinfo.com

TWCData interoperability

11

Ma et al., Nature Geosciecne (2011)

Interoperability:

“Data should be discoverable, accessible, decodable,

understandable and usable, and data sharing should be

legal and ethical for all participants.”

Original image from: http://ehna.org

TWCProvenance of research

12

Image from nature.com

Ma et al., Nature Climate Change (2014)

http://data.globalchange.gov

Provenance documentation

“Linking a range of observations and model outputs, research

activities, people and organizations involved in the production of

scientific findings with the supporting data sets and methods

used to generate them”

TWC• IPython Notebook:

A web-based interactive computational environment

Di Stefano et al., ESIP 2014 Summer Meeting Presentation (2014)

Codes, APIs,

datasets, text…PDF document

• We made extension to the IPython Notebook

environment to enable automatic provenance

capture during a scientific workflow

13

TWC

14

TWCSemantic eScience

• Artificial Intelligence accelerates scientific discovery

– Data search, synthesis and hypothesis representation

– Data analysis: reasoning with models of the data

Gil et al., Science (2014)

Image from science.com

A state-of-the-art example: Hanalyzer (high-throughput analyzer)

• Uses natural language processing to

automatically extract a semantic network from

all PubMed papers relevant to a scientist

• Uses Semantic Web technology to integrate

assertions from other biomedical sources

• Reasons about the network to find new

correlations that suggest new genes to

investigate

15

Leach et al., PLoS Comput Bio (2009)

TWCDeep Carbon Virtual Observatory

Fox, RDA Fourth Plenary Meeting Presentation (2014)

http://deepcarbon.net

A cyber-enabled

platform for linked

science

TWCSummary

• Data as first class products of research

• eScience: the digital or electronic facilitation of science

• Semantic eScience

– A virtuous circle between science and semantic technologies

– Data driven + Knowledge driven?

Image courtesy @WileyExchanges

17

TWC

More information:

Marshall X Ma

[email protected]

Thank you!