TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer...
-
Upload
joan-morales -
Category
Documents
-
view
217 -
download
0
Transcript of TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer...
TWCWhy Data Science Matters
Xiaogang (Marshall) Ma
Tetherless World ConstellationRensselaer Polytechnic Institute
Email: [email protected]; Twitter: @MarshallXMa
ICSU-WDS Data Stewardship Award Lecture
SciDataCon 2014, New Delhi, India, Nov. 02-05
TWCAcknowledgements
• Dr. Mustapha Mokrane and Dr. Simon Hodson
• Colleagues at TWC/RPI, CODATA-ECDP, ESIP, CGI-IUGS, AGU/ESSI, ICSU-WDS, RDA, ITC, and more
• My mentor Prof. Peter Fox
• My family
• All of you
TWCOutline
• Technical trends– Data management, publication & citation
• Methodology– Interoperability & Provenance
• Data management is just a start– Data analysis– Semantic eScience
3
TWCData Management Plan
• Data Management Plan– A formal document that outlines what you will do with your data
during and after you complete your research
• Resources/Tools help create DMPs:– NSF Data Management Plan Requirements:
http://www.nsf.gov/eng/general/dmp.jsp – DCC Data Management Plans:
http://www.dcc.ac.uk/resources/data-management-plans
– DMPTool: https://dmptool.org – DCC DMPOnline: https://dmponline.dcc.ac.uk
5
TWCData Publication
• Data as first class products of research– e.g., NSF bio-sketches can include data publications
6Image from j4h.net
See: http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp
TWC
7
“All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. ”
“…authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications.”
“…authors must make materials, data, and associated protocols available to readers.”
“…it is a condition of publication that authors make available the data and research materials supporting the results in the article.”
“…require authors to make all data underlying the findings described in their manuscript fully available without restriction…”
“Earth and space science data should be widely accessible in multiple formats and long‐term preservation of data is an integral responsibility of scientists and sponsoring institutions.”
“…support the principle that research data should be made freely available to all researchers…”
“…recommends depositing data that correspond to journal articles in reliable data repositories…”
TWC• Ways of data publication
– Data as supplemental material of a paper– Standalone data– Data paper: data in a repository + descriptive ‘data paper’
8
Strasser, GeoData 2014 Workshop Presentation (2014)
Examples:• Standalone data journals: Nature Scientific Data, Geoscience Data
Journal, Ecological Archives, Data in Brief …
• Journals that publish data papers: Earth and Space Science, GigaScience, F1000 Research, Internet Archaeology …
TWCData Citation
• Data Citation Index– Indexes the world's leading data repositories– Connects datasets to related refereed literature indexed in
the Web of Science™– Efficient access to data across subjects and regions
10Image courtesy http://wokinfo.com
TWCData interoperability
11
Ma et al., Nature Geosciecne (2011)
Interoperability:“Data should be discoverable, accessible, decodable, understandable and usable, and data sharing should be legal and ethical for all participants.”
Original image from: http://ehna.org
TWCProvenance of research
12Image from nature.com
Ma et al., Nature Climate Change (2014)
http://data.globalchange.gov
Provenance documentation “Linking a range of observations and model outputs, research activities, people and organizations involved in the production of scientific findings with the supporting data sets and methods used to generate them”
TWC• IPython Notebook:
A web-based interactive computational environment
Di Stefano et al., ESIP 2014 Summer Meeting Presentation (2014)
Codes, APIs, datasets, text…
PDF document
• We made extension to the IPython Notebook environment to enable automatic provenance capture during a scientific workflow
13
TWCSemantic eScience
• Artificial Intelligence accelerates scientific discovery– Data search, synthesis and hypothesis representation– Data analysis: reasoning with models of the data
Gil et al., Science (2014)
Image from science.com
A state-of-the-art example: Hanalyzer (high-throughput analyzer) • Uses natural language processing to
automatically extract a semantic network from all PubMed papers relevant to a scientist
• Uses Semantic Web technology to integrate assertions from other biomedical sources
• Reasons about the network to find new correlations that suggest new genes to investigate
15
Leach et al., PLoS Comput Bio (2009)
TWCDeep Carbon Virtual Observatory
Fox, RDA Fourth Plenary Meeting Presentation (2014)
http://deepcarbon.net
A cyber-enabled platform for linked science
TWCSummary
• Data as first class products of research
• eScience: the digital or electronic facilitation of science
• Semantic eScience– A virtuous circle between science and semantic technologies– Data driven + Knowledge driven?
Image courtesy @WileyExchanges
17